xenocrat / chyrp-lite

An ultra-lightweight blogging engine, written in PHP.
https://chyrplite.net/
BSD 3-Clause "New" or "Revised" License
402 stars 42 forks source link

Unicode emoji characters break posting functionality. #153

Closed djtuBIG-MaliceX closed 2 years ago

djtuBIG-MaliceX commented 2 years ago

Just set up chyrp-lite v2022.01 on my personal VPS. While it all works good and all, it seems to break as soon as the post includes any Unicode emoji text content. (eg: posts that include πŸ¦† πŸ˜€πŸ˜πŸ˜‚πŸ€£πŸ˜Ž etc. as entered via emoji keyboard shortcuts on most OSes nowadays)

Eg:

image

Results in:

image

in addition, it creates an empty article without post body content.

image

Otherwise, ASCII-only Markdown works fine.

xenocrat commented 2 years ago

Hello there,

Thank you for the report. I will investigate and try to reproduce. What DBMS are you using: MySQL, SQLite, PostgreSQL?

djtuBIG-MaliceX commented 2 years ago

MariaDB 10.3 on Debian Buster, which is basically MySQL.

xenocrat commented 2 years ago

I was able to reproduce the error. I also did a bit of investigating. You need to ensure your database table has been created with one of the utf8mb4 collations otherwise it will not be able to store the full range of Unicode characters. More details here. A table collation of "utf8" is actually the half-measure utf8mb3 – fine for storing text in most languages but not good enough to store emoji.

In addition to this, I might need to set the connection charset to uft8mb4 instead of utf8. The documentation is not clear on this and I'm not able to test utf8mb4 collation right now due to phpMyAdmin misbehaving, but I've committed the change to SQL.php that implements this.

xenocrat commented 2 years ago

I got phpMyAdmin to play nicely and confirmed the fix. If you patch SQL.php from the develop branch and ensure your database, tables, and columns are all using a utf8mb4 collation then the quacken shall be released. πŸ¦†

djtuBIG-MaliceX commented 2 years ago

Can confirm fix after updating relevant columns name and value in post_attributes table, as well as replacing SQL.php with that in develop branch.

image

djtuBIG-MaliceX commented 2 years ago

One more thing: Not sure if it's because I didn't switch it for everything but if I search using emoji text like so: image

This occurs: image

Normal search behaviour otherwise works. The above is obviously an edge case πŸ˜…

xenocrat commented 2 years ago

Yes I think it’s caused by the mix of collations across the posts and post_attributes tables. I’ll test that out to be sure. Thanks for reporting!

xenocrat commented 2 years ago

Confirmed, if all tables being searched (posts, post_attributes, and pages) are fully converted to utf8mb4 then the search will succeed.