Closed zamgi closed 2 years ago
@zamgi I hope you are aware of HuggingFace transformer project.
I hope you too know of this Microsoft initiative. https://github.com/Microsoft/BlingFire
Ideally BlingFire can handle ALL tokenization needs. Curious what you think of that
@zamgi Here is more background https://github.com/zhongkaifu/Seq2SeqSharp/issues/33
@GeorgeS2019
I think since Seq2SeqSharp is already using interop why not use SentencePiece.
@zamgi It is always important to be curious and see what options the experts are using. I could be wrong. Why we are here to exchange ideas. :-)
@zamgi "sentencepiece.dll" is SentencePiece in windows version, but it doesn't work in Linux environment. Most of Seq2SeqSharp users (include myself) train models in Linux environment, and infer it in both Windows and Linux environment, so Seq2SeqSharp has to support both of them.
To built-in SentencePiece, firstly, we need to make it configurable that allow users to enable or disable this feature, because some users use word-level tokens rather than subword-level tokens. Secondly, we may have some options to build-in it, such as 1) place SentencePiece binary files to repo, 2) compile it at build time, 3)Using BlingFire as @GeorgeS2019 suggested in above, or other options.
Here is the error in build broken:
/home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqSimilarityConsole/SeqSimilarityConsole.csproj(32,5): error MSB3073: The command "copy "..\dll\win_x64\sentencepiece.dll" "/home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqSimilarityConsole/\bin"" exited with code 127. SeqLabelConsole -> /home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqLabelConsole/bin/SeqLabelConsole.dll /usr/bin/sh: 2: /tmp/tmp12cd6e7f39fb42ac843da133daed1757.exec.cmd: copy: not found
Because the build environment is in Linux, so it uses "cp" rather than "copy" for file copying. We need to make sure Seq2SeqSharp can be built and run in both Linux and Windows environments.
@zhongkaifu I can try build "sentencepiece.dll" for Linux ("sentencepiece.so") and make in c#-code use them depending on the platform, but where to get "libgcc_s_seh-1.dll", "libgfortran-3.dll", "libopenblas.dll", "libquadmath-0.dll" for Linux? they are exists?
Thanks @zamgi Because Linux has many different distributions, such as Ubuntu (18.04, 20.04...), CentOS and others, and we do have users running Seq2SeqSharp on MacOS, it would be great if we can include SentencePiece source code (or git checkout source code) in the project, and build it along with entire Seq2SeqSharp projects.
For "libgcc_s_seh-1.dll", "libgfortran-3.dll", "libopenblas.dll", "libquadmath-0.dll" , I have no idea about it. "sentencepiece.dll" should be the only file SentencePiece depends on.
Thanks for making these changes. I just left some comments for you and I'm currently running some tests for it.
I tested SeqWebAPI project in both Windows and Linux.
Windows: I copied sentencepiece.dll file to SeqWebAPI bin folder, run tests, and it passed all tests.
Linux: I copied .so file to SeqWebAPI bin folder, run tests, but it threw out exception: "Unhandled exception. System.DllNotFoundException: Unable to load shared library 'libsentencepiece.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: liblibsentencepiece.so: cannot open shared object file: No such file or directory". Then I rebuild your customized SentencePiece project by the following commands (they are from official SentencePiece repo): % cd sentencepiece % mkdir build % cd build % cmake .. % make -j $(nproc) % sudo make install % sudo ldconfig -v
And then rerun tests and it all passed.
Here are two suggestions:
As far as I know, build "sentencepiece.dll" for windows not use any dependency (links to other dll, including windows memory management runtime), but i don't known this for build for Linux. Im make build "libsentencepiece.so" for Linux in Ubuntu 21.10. Is it suitable for other builds/versions Linux - its unknown. these two commands: % sudo make install % sudo ldconfig -v put the .so files to common linux-path libraries. maybe this is the reason. using interop/native code - it's always bring some problems.
"Above problem can be resolved by rebuild customized SentencePiece, maybe we need to add above build command to README.MD file" - Yes, probably. you know better))
@zhongkaifu
I forgot to say: I have not run SeqWebAPI app on Linux, but I am run demo sentencepiece console app (./Seq2SeqSharp/SentencePiece/[sentencepiece_testapp]) on Linux - its work good. May be reason in web-app environment (current folder, paths for searching .so...)?
and one question: im see in 'Seq2SeqSharp/SeqWebAPIs/Seq2SeqInstance.cs' file next lines:
....
lock ( locker )
{ ...
var nrs = m_seq2seq.Test
I tested SeqWebAPIs in Ubuntu 18.04 and got those exceptions, but after rebuilding customized SentencePiece, it works fine. So the problem may be caused by inconsistent Linux version. Anyway, I can add "build SentencePiece insturction" to README.md file.
For using interop/native code, I found some code here (https://github.com/wang1ang/SentencePieceWrapper/blob/master/SentencePieceWrapper/SentencePieceWrapper.cpp), but they are implemented by C++ CLI rather than C#. Not sure if it would be helpful to you.
For "lock", yes, it's thread safe and you can remove it. But there maybe some load balance issue you need to be aware of . For CPU version, it's totally okay, because operating system can schedule which CPUs work for which threads and swap memory between page files when work load is very large. However, for GPU version, things are different, because operating system doesn't do such schedule and swap things for GPUs. So when work load is very large, it may cause 1) OOM issue, 2) imbalance GPU usage. To resolve this problem, we need to implement GPU load balance for SeqWebAPIs or other services.
using SentencePiece from C# code