Problem with encoding for multiselect field.

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1.Adding a multiple choice list in the custom content type mangager admin area 
(with special characters such as: ą, ę, ł, ń, ó, ,ś ,ź ,ż) 
output filter set for formatted list type. 
2.Place the list with a template tage <?php print_custom_field('list_points'); 
?>
3.Add a new custom post type with a multiple choice list and choose to show 
values that contain the special characters. 

What is the expected output? What do you see instead?
The expected output would be what you can see in the backend - Instead he 
displays values with a different encoding and that gives what you can see in 
the userside-output.png

Now the funny thing is that you can see them all displaying correctly in the 
admin panel when adding/editing new custom post type. Moreover, if you choose 
one or more values with special characters and save it, you will observe that 
the custom fields will come back to to their original value (unchecked) - This 
is what you can see on the second image adminside-output-after-update.png  
I tried multiple things and I managed to understand a few things. The data is 
encoded in a json file (if you change it directly in the json file and reimport 
it - it will still get you an encoding bug) which means that the issue is not 
there. When I change the custom field directly in the database (UTF-8 encoded) 
it will display perfectly fine. Unless you check another option with an 
encoding problem and a plugin will cause insertion of this metavalue to the 
database. 
I hope that I was quite clear (or clearer than the last time) about where the 
issue is coming from and how does it look like. 
Of course I might use other fields via a walkaround (a simple text field where 
I will write in the HTML code directly) but that's simply not the point of 
having a CMS. 

Does the problem continue if you disable all other plugins?
Yes it does continue after you disable other plugins
What version of the plugin are you using?
9.0
What version of WordPress?
3.1
Please provide any additional information below (e.g. Browser, version,
OS):
Ubuntu 10.10, Chrome (tested in mozilla as well)

Original issue reported on code.google.com by Mt.Zieli...@gmail.com on 13 May 2011 at 4:06

Attachments:

GoogleCodeExporter commented 8 years ago

Ok... I must have used a different htmlentities function on the front-end.  
I'll look into this as soon as I'm back from vacation.

Original comment by fireproofsocks on 13 May 2011 at 8:26

Changed state: InReview

GoogleCodeExporter commented 8 years ago

Holy smokes, this is way more complicated than I thought.  I can't seem to find 
where this is getting hijacked... WP might be running the KSES filter on this 
thing: the variables coming through the $_POST array are NOT the variables I'm 
writing to the form fields.  WTF?

Original comment by fireproofsocks on 15 May 2011 at 4:21

Changed state: InProgress

GoogleCodeExporter commented 8 years ago

Yes, this is descending into encoding hell: 
http://pa2.php.net/manual/en/function.utf8-decode.php

If the mb_string library isn't installed on a server, there's going to be no 
reliable way to determine encoding used.

$x = 'ę';
htmlspecialchars($x); // u0119
html_entity_decode($x); // u00c4ufffd

Original comment by fireproofsocks on 15 May 2011 at 4:48

GoogleCodeExporter commented 8 years ago

This seems to hold some promise:

$z = htmlspecialchars(utf8_encode($y));
print utf8_decode(htmlspecialchars_decode($z));

But I'll have to use mb_detect_encoding() to check whether or not that should 
kick in or not.

Original comment by fireproofsocks on 15 May 2011 at 1:29

GoogleCodeExporter commented 8 years ago

Issue 99 has been merged into this issue.

Original comment by fireproofsocks on 9 Jun 2011 at 3:51

GoogleCodeExporter commented 8 years ago

This is truly maddening...  consider the following PHP snippets:

<?php
print 'ę';   // Run via command line, this works... via Apache, it prints Ä™
print htmlspecialchars('ę');   // does nothing... returns 'ę'
print utf8_encode('ę');              // returns Ä<99>  WTF?
print htmlentities('ę');             // returns Ä# WTF?
?>

So this has something to do with the php.ini options (my system has a different 
php.ini for command line and for apache).

Original comment by fireproofsocks on 10 Jul 2011 at 11:08

GoogleCodeExporter commented 8 years ago

So the page headers affect the encoding... but we can't change the page 
headers.    I've been trying some of the functions outlined on 
http://www.php.net/manual/en/function.utf8-encode.php#93162, but so far, 
nothing works.

Original comment by fireproofsocks on 11 Jul 2011 at 5:51

GoogleCodeExporter commented 8 years ago

Aha... something here worked:

function UTF8ToEntities ($string) {
    /* note: apply htmlspecialchars if desired /before/ applying this function
    /* Only do the slow convert if there are 8-bit characters */
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
        return $string;

    // reject too-short sequences
    $string = preg_replace("/[\302-\375]([\001-\177])/", "�\\1", $string); 
    $string = preg_replace("/[\340-\375].([\001-\177])/", "�\\1", $string); 
    $string = preg_replace("/[\360-\375]..([\001-\177])/", "�\\1", $string); 
    $string = preg_replace("/[\370-\375]...([\001-\177])/", "�\\1", $string); 
    $string = preg_replace("/[\374-\375]....([\001-\177])/", "�\\1", $string); 

    // reject illegal bytes & sequences
        // 2-byte characters in ASCII range
    $string = preg_replace("/[\300-\301]./", "�", $string);
        // 4-byte illegal codepoints (RFC 3629)
    $string = preg_replace("/\364[\220-\277]../", "�", $string);
        // 4-byte illegal codepoints (RFC 3629)
    $string = preg_replace("/[\365-\367].../", "�", $string);
        // 5-byte illegal codepoints (RFC 3629)
    $string = preg_replace("/[\370-\373]..../", "�", $string);
        // 6-byte illegal codepoints (RFC 3629)
    $string = preg_replace("/[\374-\375]...../", "�", $string);
        // undefined bytes
    $string = preg_replace("/[\376-\377]/", "�", $string); 

    // reject consecutive start-bytes
    $string = preg_replace("/[\302-\364]{2,}/", "�", $string); 

    // decode four byte unicode characters
    $string = preg_replace(
        "/([\360-\364])([\200-\277])([\200-\277])([\200-\277])/e",
        "'&#'.((ord('\\1')&7)<<18 | (ord('\\2')&63)<<12 |" .
        " (ord('\\3')&63)<<6 | (ord('\\4')&63)).';'",
    $string);

    // decode three byte unicode characters
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')&15)<<12 | (ord('\\2')&63)<<6 | (ord('\\3')&63)).';'",
    $string);

    // decode two byte unicode characters
    $string = preg_replace("/([\300-\337])([\200-\277])/e",
    "'&#'.((ord('\\1')&31)<<6 | (ord('\\2')&63)).';'",
    $string);

    // reject leftover continuation bytes
    $string = preg_replace("/[\200-\277]/", "�", $string);

    return $string;
}
//------------------------------------------------------------------------------

$opt = 'ę';

$utf = UTF8ToEntities( htmlspecialchars($opt) ); 

print $utf;  // prints the correct output (converted, it is  ę )

Original comment by fireproofsocks on 11 Jul 2011 at 5:57

GoogleCodeExporter commented 8 years ago

Or more simply:

<?php

function charset_decode_utf_8 ($string) { 
      /* Only do the slow convert if there are 8-bit characters */ 
    /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */ 
    if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string)) 
        return $string; 

    // decode three byte unicode characters 
    $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",        
    "'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",    
    $string); 

    // decode two byte unicode characters 
    $string = preg_replace("/([\300-\337])([\200-\277])/e", 
    "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'", 
    $string); 

    return $string; 
}

?>

Original comment by fireproofsocks on 11 Jul 2011 at 7:03

GoogleCodeExporter commented 8 years ago

Hi !
First excuse my kind weird english i'm a french guy.
I think I figure out to simply fix this problem I had to.

First, you just have to save all php files of the plugin with UTF-8 encoding 
(not ISO Latin1)
In fact, I just test this with few files (like "includes/CCTM.php" and 
"includes/pages/post_type.php")
and it worked fine for me as far as I can see.

Second, you must get rid of htmlentities in those "includes/pages/*.php".

For example,  I change line 143 in post_type.php
            <textarea name="description" class="cctm_textarea" id="description" rows="4" cols="60"><?php print htmlentities($def['description']); ?></textarea>

to
            <textarea name="description" class="cctm_textarea" id="description" rows="4" cols="60"><?php print ($def['description']); ?></textarea>

And, voilà ! On the main page "custom content types" and edit page "edit 
content type" it did the trick.

I don't know if all this is really clear but I hope this helps and to see this 
fix in the next update ! ;)

Have fun !

Original comment by whadaff@gmail.com on 13 Jul 2011 at 2:22

GoogleCodeExporter commented 8 years ago

Thanks -- that's part of the problem, but It's a bit more complicated than that 
-- it also depends on the settings on your server, so it requires a few other 
changes as well so it can work the same way on multiple servers.  I think I 
have a solution figured out -- I'll post it shortly.

Original comment by fireproofsocks on 13 Jul 2011 at 5:10

GoogleCodeExporter commented 8 years ago

Ugh... still no luck.  I'm able to print the correct html entities into the 
form values, but when it comes through the post array, it still gets converted, 
e.g. "u0119" and "u0142"... so I think I need to write the converse of the 
charset_decode_utf_8() function... one that takes u0119 and outputs the wily 
foreign character...

Original comment by fireproofsocks on 16 Jul 2011 at 8:32

GoogleCodeExporter commented 8 years ago

"I manually encoded the characters to Special HTML Characters, and created a 
field containing those characters. Next I headed to 
/includes/elements/multiselect.php and on line 145 and changed 
htmlspecialchars($opt) to htmlspecialchars_decode($opt). On my template file I 
uncluded the field with the following code: $g = get_custom_field('genre', ', 
'); echo htmlspecialchars_decode($g); and it worked.

I know it's not a very good idea to encode to HTML Characters, but it's the 
only way to avoid problems with UTF-8 encoding. I believe it's possible to 
encode the field characters to HTML Characters before they go into database 
(though I couldn't find where it's done). The problem lies in including the 
field on the template file, because  get/print_custom_field is not a part of 
the plugin, but I'm sure it's possible to work this around."

Original comment by fireproofsocks on 3 Sep 2011 at 5:25

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

I finally found something that worked, at least in an independent test.  Check 
this out:

<?php
    function charset_decode_utf_8($string) { 
        $string = htmlspecialchars($string);

        /* Only do the slow convert if there are 8-bit characters */ 
        /* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */ 
        if (! preg_match("/[\200-\237]/", $string) and ! preg_match("/[\241-\377]/", $string)) {
            return $string;
        }

        // decode three byte unicode characters 
        $string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string); 

        // decode two byte unicode characters 
        $string = preg_replace("/([\300-\337])([\200-\277])/e", "'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'", $string); 

        return $string; 
    }
?>
<html>
<head><title>Test form</title></head>
<body>
<?php

if ( !empty($_POST) ){
    print_r($_POST);
}
?>
    <form method="post">
        <div class="cctm_element_wrapper" id="custom_field_mymulti">

        <label for="cctm_mymulti" class="cctm_label cctm_multiselect cctm_multiselect_checkbox" id="cctm_label_mymulti">
            MyMulti
        </label>
        <br/><div class="cctm_muticheckbox_wrapper">

        <input type="checkbox" name="cctm_mymulti[]" class="cctm_mymulti cctm_muticheckbox" id="cctm_mymulti0" value="<?php print charset_decode_utf_8('xyzęegg'); ?>" > <label class="cctm_muticheckbox" for="cctm_mymulti0"><?php print charset_decode_utf_8('xyzęegg'); ?></label></div><br/><div class="cctm_muticheckbox_wrapper">
        <input type="checkbox" name="cctm_mymulti[]" class="cctm_mymulti cctm_muticheckbox" id="cctm_mymulti1" value="<?php print charset_decode_utf_8('xyzüuugh'); ?>" > <label class="cctm_muticheckbox" for="cctm_mymulti1"><?php print charset_decode_utf_8('xyzęegg'); ?></label></div><br/><div class="cctm_muticheckbox_wrapper">
        <input type="checkbox" name="cctm_mymulti[]" class="cctm_mymulti cctm_muticheckbox" id="cctm_mymulti2" value="<?php print charset_decode_utf_8('normal">'); ?>" > <label class="cctm_muticheckbox" for="cctm_mymulti2"><?php print charset_decode_utf_8('normal">'); ?></label></div><br/><span class="cctm_description">Testing</span>
        </div>  

        <input type="submit" value="Submit" />
    </form>

</body>
</html>

That WORKS.  The foreign characters are properly converted to their HTML-entity 
equivalents.  But the multi-select doesn't want to get that field out of the 
json array correctly...

Original comment by fireproofsocks on 29 Sep 2011 at 2:30

GoogleCodeExporter commented 8 years ago

AHA.  It's WP's get_post_meta() and update_post_meta() that is causing this to 
fail.  Look:

print $value; // ["xyz\u0119egg","xyz\u00fcuugh","normal"]   <--- being sent to 
the database
print "<hr/>";
update_post_meta( $post_id, $field_name, $value );
$x = get_post_meta($post_id, $field_name, true);
print_r($x); exit; ["xyzu0119egg","xyzu00fcuugh","normal"] <--- coming back 
from the database

Original comment by fireproofsocks on 29 Sep 2011 at 2:54

GoogleCodeExporter commented 8 years ago

Original comment by fireproofsocks on 29 Sep 2011 at 3:07

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

So it appears the solution is to addslashes() to the value before it goes into 
the database.  So I've updated the multiselect.php class and modified its 
save_post_filter() function:

return addslashes(json_encode($posted_data[ CCTMFormElement::post_name_prefix . 
$field_name ]));    

That finally works, and will available in 0.9.4.  Now to work out the other 
fields that are getting weird quotes now.

Original comment by fireproofsocks on 29 Sep 2011 at 3:41

GoogleCodeExporter commented 8 years ago

Way to go !!! It wasn't that obvious after all, Thanks again for your 
dedication and making your plugin better.

Original comment by Mt.Zieli...@gmail.com on 30 Sep 2011 at 3:41

GoogleCodeExporter commented 8 years ago

Found one more glitch with this that has to do with the differences between how 
WP handles updating meta data and creating it.  I had to double up on the 
addslashes when the post is being created.  Craziness.  But it's in 0.9.4.

Original comment by fireproofsocks on 30 Sep 2011 at 4:05

xytroyzy / wordpress-custom-content-type-manager

Problem with encoding for multiselect field. #88