If your like me and have people who like to write stuff in MS Office first and then post it to the web you get the silly quotes and such that word uses. The quotes and such are not valid UFT-8 and that’s how I like to store my data in my database so it usually dumps out with an error.
I’ve come up with the following Zend_Filer_Interface so that way you can plug it into anything that has the ability to use Zend_Filter.
Read More for the source code.
Updated on 12/10/09 to include more characters and remove the html version
Updated again on 12/10/09 to make it work with UTF-8 Submitted Forms. I’ve spent half the day validaing this and testing it with a HEX editor
<?php
class Util_Filter_WordChars implements Zend_Filter_Interface
{
/**
* Filter out the invalid characters that word puts in.
* @param string $value
* @return string
*/
public function filter($value)
{
$search = array(chr(0xe2) . chr(0x80) . chr(0x98), // '
chr(0xe2) . chr(0x80) . chr(0x99), // '
chr(0xe2) . chr(0x80) . chr(0x9c), // "
chr(0xe2) . chr(0x80) . chr(0x9d), // "
chr(0xe2) . chr(0x80) . chr(0x93), // em dash
chr(0xe2) . chr(0x80) . chr(0x94), // en dash
chr(0xe2) . chr(0x80) . chr(0xa6)); // ...
$replace = array(
'\'',
'\'',
'"',
'"',
'-',
'-',
'...');
return str_replace($search, $replace, $value);
}
}
This could be changed to put them in the html ascii code to display them but I like the standard single quote and double quote online.


Thank you!! It seems like I run into this all the time. I will have to try and use this and see how it works.
Excellent! Man, those Word quotes piss me off. Thanks and cheers!
Here is a list I created to deal with excel files. I have a few additional chars and seem to be missing a few. Perhaps, we can make a more complete list?
$badchr = array(
“”, //null byte
“\x01″, //SOH start of header
“\x0B”, //vertical tab
“\t”, //tab
“\x16″, //SYN synchronous idle
“\xc2″, //prefix 1
“\x80″, //prefix 2
“\x92″, //sigle quote
“\x93″, //double quote
“\x94″, //double quote
“\x96″, //dash
“\x98″, // single quote opening
“\x99″, // single quote closing
“\x8c”, // double quote opening
“\x9d” // double quote closing
);
Ron
@ron,
Thanks for the info. I have updated it to include all of yours plus mine. Let me know if the replacements are not correct.
This would be golden if it excluded MSO:s weird tags too. That kind of filter should be given in ZF. This is a good start though!
Perhaps I’m mistaken, but I believe that all of those characters are proper UTF-8. Most designers will get pretty if you convert their nicely-designed typography into hash marks.
Example:
echo mb_detect_encoding(chr(0xe2) . chr(0×80) . chr(0×98)); // outputs ‘UTF-8′
If what you’re trying to do is convert from UTF-8 to ASCII, iconv’s transliteration is probably a bit better at it:
$output = iconv(‘UTF-8′, ‘ASCII//TRANSLIT’, $input);
@Andy,
I understand where you are coming from but when I echo out that character on a UTF-8 page I get the question mark, which looks bad.
I can see where converting them to ASCII would work too but I’ve always just converted them to the standard single quote and double quote.