wp-cli / db-command

Performs basic database operations using credentials stored in wp-config.php.
MIT License
71 stars 58 forks source link

regex search vs. UTF-8 encoding #250

Closed szepeviktor closed 7 months ago

szepeviktor commented 7 months ago

Bug Report

Describe the current, buggy behavior

wp db search '\p{Cf}' --regex

Regexp search for character classes finds individual bytes of an UTF-8 encoded character. e.g. í in "hírlevél" the result is displayed like "blog h▒▒rlevél feliratkozás"

How to search in UTF-8 encoded text?

BTW wp db search "$(printf '\xc3')" --regex also finds the first byte of í (actually all characters encoded on two bytes)

szepeviktor commented 7 months ago

Same goes for search-replace command.

danielbachhuber commented 7 months ago

How to search in UTF-8 encoded text?

@szepeviktor I'm not sure I follow. Can you share an example of what you tried, what you saw, and what you expected to see?

szepeviktor commented 7 months ago

Can you share an example of what you tried, what you saw, and what you expected to see?

Issued this command: wp db search '\p{Cf}' --regex Seen results like: blog h▒▒rlevél feliratkozás one of the "block" characters was highlighted, so the UTF-8 two byte character was split into two.

szepeviktor commented 7 months ago

\p{Cf} regular expression is for finding "Format characters", I am looking for U+200B ZERO WIDTH SPACE and other invisible characters in post content and in post meta. https://unicode.org/charts/PDF/U2000.pdf

szepeviktor commented 7 months ago

Solution

You need to add --regex-flags=u to be UTF-8 compatible 😃 https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

danielbachhuber commented 7 months ago

@szepeviktor Glad I was here to help you figure it out! 😁