lawrence k wrote:

What PHP code would give me this kind of 100% certainty?

I was bored so wrote this. I'm quite proud of myself, as I wrote it and ran it and it worked first time! :-)

It not only checks that the UTF-8 is valid, it forces it to be valid.


<?php

/**
 * PHP UTF-8 Validation Library
 * Copyright (C) 2007 Toby Inkster
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 * 
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public
 * License along with this library; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
 *
 * @author Toby Inkster <URL:http://tobyinkster.co.uk/>
 * @copyright Copyright (C) 2007 Toby Inkster
 * @license http://www.gnu.org/copyleft/lgpl.html GNU Lesser General Public Licence
 */

/**
 * Utlity function to retrieve the first byte from a string.
 *
 * Note this function has a side-effect. As well as returning the
 * first byte of the string, it also modifies the string passed
 * as a parameter to remove the initial byte.
 *
 * @param string $string String to shift.
 * @return string First byte of string.
 */
function shift_byte (&$string)
{
        if (strlen($string)<1)
                return FALSE;

        $byte = substr($string, 0, 1);
        $string = substr($string, 1);
        return $byte;
}

/**
 * Validate a string as UTF-8, and modify the string to remove nasties.
 *
 * Note this function has a side-effect. As well as returning a
 * boolean to indicate whether the given string was valid, it also
 * modifies the string replacing any invalid characters with a
 * replacement character. (The replacement character is a question
 * mark, but you can change this if you like.)
 *
 * Note that in UTF-8, most characters have several alternative
 * representations. RFC 3629 says that the shortest representation
 * is the correct one. Other representations ("overlong forms")
 * are not valid. Earlier UTF-8 specifications did not prohibit
 * overlong forms, though suggest emitting a warning when one is
 * encountered. This function DOES NOT CHECK FOR OVERLONG FORMS!
 *
 * @param string $string String to validate.
 * @return boolean Was the string valid or not?
 */
function validate_utf8 (&$string)
{
        $new            = '';
        $valid          = TRUE;
        $replacement    = '?';

        /* Loop through each UTF-8 character. */
        while (strlen($string))
        {
                /* Array of bytes to store this character. */
                $c = array();

                /* Firstly, assume that a character is a single byte. */
                $c[0] = shift_byte($string);

                /* "Seven Z" notation. */
                if (ord($c[0]) <= 0x7F)
                {
                        $new .= $c[0];
                }

                /* "Five Y, Six Z" notation. */
                elseif ((ord($c[0]) >= 0xC2) && (ord($c[0]) <= 0xDF))
                {
                        $c[1] = shift_byte($string);

                        if ((ord($c[1]) >= 0x80) && (ord($c[1]) <= 0xBF))
                        {
                                $new .= $c[0].$c[1];
                        }
                        else
                        {
                                $new .= $replacement;
                                $valid = FALSE;
                        }
                }

                /* "Four X, Six Y, Six Z" notation. */
                elseif ((ord($c[0]) >= 0xE0) && (ord($c[0]) <= 0xEF))
                {
                        $c[1] = shift_byte($string);
                        $c[2] = shift_byte($string);

                        if ((ord($c[1]) >= 0x80) && (ord($c[1]) <= 0xBF)
                        &&  (ord($c[2]) >= 0x80) && (ord($c[2]) <= 0xBF))
                        {
                                $new .= $c[0].$c[1].$c[2];
                        }
                        else
                        {
                                $new .= $replacement;
                                $valid = FALSE;
                        }
                }

                /* "Three W, Six X, Six Y, Six Z" notation. */
                elseif ((ord($c[0]) >= 0xF0) && (ord($c[0]) <= 0xF4))
                {
                        $c[1] = shift_byte($string);
                        $c[2] = shift_byte($string);
                        $c[3] = shift_byte($string);

                        if ((ord($c[1]) >= 0x80) && (ord($c[1]) <= 0xBF)
                        &&  (ord($c[2]) >= 0x80) && (ord($c[2]) <= 0xBF)
                        &&  (ord($c[3]) >= 0x80) && (ord($c[3]) <= 0xBF))
                        {
                                $new .= $c[0].$c[1].$c[2].$c[3];
                        }
                        else
                        {
                                $new .= $replacement;
                                $valid = FALSE;
                        }
                }

                else
                {
                        $new .= $replacement;
                        $valid = FALSE;
                }

        }

        $string = $new;
        return $valid;
}

1?>