Provide separated functions/files for AVX2/SSE4/SSSE3

Thanks for the greater work 👍, I now use the library in Notepad2 (https://github.com/zufuliu/notepad2 by commit https://github.com/zufuliu/notepad2/commit/481377287679fd85e9c2d16f8f9220e0e30a6ac6). I found manually expanded/unrolled version of the two functions are about 2x faster (on i5 x64 built by MSVC 2017) than current functions.

Attachment is the expanded results. it's based on the preprocessor out by cl /EP (similar to gcc -E -P), with some formatting changes:

in testing ASCII (vmask_t high = v_test_bit(bytes, 7);), _mm_slli_epi16 / _mm256_slli_epi16 is removed.
_mm_srli_epi16 / _mm256_srli_epi16 is removed from vec_t e_2 = v_lookup(error_2, shifted_bytes, 0);
the for (int n = 1; n <= 3; n++) loop is manually unrolled to fix compiler warning about potential uninitialized vmask_t cont;.

added a scalar version for SSE 4.1 _mm_testz_si128, so the code works for SSSE3. the scalar version seems even faster.

#if defined(__SSE4_1__)
if (!_mm_testz_si128(_mm_and_si128(e_1, e_2), e_3)) {
    return 0;
}
#else
e_3 = _mm_and_si128(_mm_and_si128(e_1, e_2), e_3);
uint64_t dummy[2];
_mm_storeu_si128((__m128i *)dummy, e_3);
dummy[0] |= dummy[1];
if (dummy[0]) {
    return 0;
}
#endif

Applications can use separated functions for runtime dispatching, adding separated functions/files make the library easier to build.

The Attachment, nearly identical to code at https://github.com/zufuliu/notepad2/blob/master/src/EditEncoding.c#L1227 z_validate.zip

Hey Zufu, this is great to see! Glad people are using the library!

A few comments:

1) I have a wip branch that has some nice speedups over the master branch, and supports more architectures (AVX-512 and NEON): https://github.com/zwegner/faster-utf8-validator/blob/wip/z_validate.c

Unfortunately the code got even messier, including a Python script to generate the lookup tables. I've been putting off cleaning up the wip branch for too long (it's easy for me to lose focus with so many incomplete side projects competing for my time). I had almost finished cleaning up the documentation, when Daniel Lemire (or possibly John Keiser? I'm not sure) figured out a trick to save a bit in the lookup tables: https://github.com/simdjson/simdjson/pull/993

...this made a complicated trick I was using obsolete, and means I need to rewrite a bunch more documentation. I'll try and finish this soon, and finally clean up the WIP branch. Writing clear documentation for a complicated algorithm can be pretty tedious for me though, so it might take a bit :) I at least pushed up some changes I had sitting around locally.

2) I'm skeptical of splitting the source file up or expanding macros. I know the code is a bit messy, and relies too much on the preprocessor, but there is a lot of duplicated logic across different architectures, and I've used the abstractions a lot for experimenting/optimizing. C is just not a good language for performance tuning, in my opinion. I'm half-considering converting the library to use Python-generated assembly...

Some changes I've made in the wip branch might help with the issues you're seeing, though: the shift right intrinsic macros have a conditional to make sure the shift is non-zero, since I also saw a compiler not optimize that out:

#   define v_shr(x, shift)  ((shift) ? _mm256_srli_epi16((x), (shift)) : (x))

For handling the separate files/build system issue, if you're not able to compile the same file multiple times with different options, would it work for you to use separate files that just look like this?

#define AVX2
#include "z_validate.c"

3) The note about vmask_t cont being uninitialized and the code for SSSE3 are appreciated. I'll fix these, thanks!

Thanks, glade to see there is an ARM NEON version. Indeed the code is harder to read.

about SSSE3, maybe following code is better (at least it's better for 32-bit builds)

e_3 = _mm_and_si128(_mm_and_si128(e_1, e_2), e_3);
const __m128i ones = _mm_xor_si128(_mm_setzero_si128(), _mm_setzero_si128());
const int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(_mm_xor_si128(e_3, ones), ones));
if (mask != 0xFFFF) {
    return 0;
}

it's based on Clang assembler output, see https://godbolt.org/z/M9eEYr

Edit: my head must kicked by a mule, don't recognized the obvious code:

e_3 = _mm_and_si128(_mm_and_si128(e_1, e_2), e_3);  
const int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(e_3, _mm_setzero_si128()));
if (mask != 0xFFFF) {
    return 0;
}

zwegner / faster-utf8-validator

Provide separated functions/files for AVX2/SSE4/SSSE3 #5