ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.98k stars 2.56k forks source link

Investigate optimized utf8 decoders and validators #2100

Open thejoshwolfe opened 5 years ago

thejoshwolfe commented 5 years ago

inspired by some comments in https://github.com/ziglang/zig/pull/2099 , check out these algorithms for UTF-8 processing:

So far I'm concerned that none of the above properly validate UTF-8. None of them explain which validation checks they're doing, and Wikipedia lists several checks that are commonly overlooked. And because the above implementations are optimized, it's difficult to know what they're doing without testing, which is part of the objective for this issue.

In addition to nonsensical byte sequences, we also need to be sure to reject:

We already have tests for these in the unicode.zig library (search for testError). Switching to an optimized implementation should not regress those tests.

shawnl commented 5 years ago

Looking at the atomic state graph on this page: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ and comparing it with this chart:


 * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
 *
 * Table 3-7. Well-Formed UTF-8 Byte Sequences
 *
 * +--------------------+------------+-------------+------------+-------------+
 * | Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0000..U+007F     | 00..7F     |             |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+

They are identical.

matu3ba commented 2 years ago
  1. Notable source for cross-validation libgrapheme from suckless project.
  2. If SIMD is acceptable: simdutf8 or a port.
  3. Article how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8 for own implementation.
  4. Stuff for Benchmarking utf8