Investigate optimized utf8 decoders and validators

thejoshwolfe commented 5 years ago

inspired by some comments in https://github.com/ziglang/zig/pull/2099 , check out these algorithms for UTF-8 processing:

So far I'm concerned that none of the above properly validate UTF-8. None of them explain which validation checks they're doing, and Wikipedia lists several checks that are commonly overlooked. And because the above implementations are optimized, it's difficult to know what they're doing without testing, which is part of the objective for this issue.

In addition to nonsensical byte sequences, we also need to be sure to reject:

Overlong encoding
Surrogate half
Overflow

We already have tests for these in the unicode.zig library (search for testError). Switching to an optimized implementation should not regress those tests.

shawnl commented 5 years ago

Looking at the atomic state graph on this page: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ and comparing it with this chart:


 * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
 *
 * Table 3-7. Well-Formed UTF-8 Byte Sequences
 *
 * +--------------------+------------+-------------+------------+-------------+
 * | Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0000..U+007F     | 00..7F     |             |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+

They are identical.

matu3ba commented 2 years ago

Notable source for cross-validation libgrapheme from suckless project.
If SIMD is acceptable: simdutf8 or a port.
Article how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8 for own implementation.
Stuff for Benchmarking utf8

ziglang / zig

Investigate optimized utf8 decoders and validators #2100