zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
193 stars 38 forks source link

using SentencePiece from C# code #43

Closed zamgi closed 2 years ago

zamgi commented 2 years ago

using SentencePiece from C# code

GeorgeS2019 commented 2 years ago

@zamgi I hope you are aware of HuggingFace transformer project.

I hope you too know of this Microsoft initiative. https://github.com/Microsoft/BlingFire

Ideally BlingFire can handle ALL tokenization needs. Curious what you think of that

GeorgeS2019 commented 2 years ago

@zamgi Here is more background https://github.com/zhongkaifu/Seq2SeqSharp/issues/33

zamgi commented 2 years ago

@GeorgeS2019

I think since Seq2SeqSharp is already using interop why not use SentencePiece.

GeorgeS2019 commented 2 years ago

@zamgi It is always important to be curious and see what options the experts are using. I could be wrong. Why we are here to exchange ideas. :-)

zhongkaifu commented 2 years ago

@zamgi "sentencepiece.dll" is SentencePiece in windows version, but it doesn't work in Linux environment. Most of Seq2SeqSharp users (include myself) train models in Linux environment, and infer it in both Windows and Linux environment, so Seq2SeqSharp has to support both of them.

To built-in SentencePiece, firstly, we need to make it configurable that allow users to enable or disable this feature, because some users use word-level tokens rather than subword-level tokens. Secondly, we may have some options to build-in it, such as 1) place SentencePiece binary files to repo, 2) compile it at build time, 3)Using BlingFire as @GeorgeS2019 suggested in above, or other options.

zhongkaifu commented 2 years ago

Here is the error in build broken:

/home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqSimilarityConsole/SeqSimilarityConsole.csproj(32,5): error MSB3073: The command "copy "..\dll\win_x64\sentencepiece.dll" "/home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqSimilarityConsole/\bin"" exited with code 127. SeqLabelConsole -> /home/runner/work/Seq2SeqSharp/Seq2SeqSharp/SeqLabelConsole/bin/SeqLabelConsole.dll /usr/bin/sh: 2: /tmp/tmp12cd6e7f39fb42ac843da133daed1757.exec.cmd: copy: not found

Because the build environment is in Linux, so it uses "cp" rather than "copy" for file copying. We need to make sure Seq2SeqSharp can be built and run in both Linux and Windows environments.

zamgi commented 2 years ago

@zhongkaifu I can try build "sentencepiece.dll" for Linux ("sentencepiece.so") and make in c#-code use them depending on the platform, but where to get "libgcc_s_seh-1.dll", "libgfortran-3.dll", "libopenblas.dll", "libquadmath-0.dll" for Linux? they are exists?

zhongkaifu commented 2 years ago

Thanks @zamgi Because Linux has many different distributions, such as Ubuntu (18.04, 20.04...), CentOS and others, and we do have users running Seq2SeqSharp on MacOS, it would be great if we can include SentencePiece source code (or git checkout source code) in the project, and build it along with entire Seq2SeqSharp projects.

For "libgcc_s_seh-1.dll", "libgfortran-3.dll", "libopenblas.dll", "libquadmath-0.dll" , I have no idea about it. "sentencepiece.dll" should be the only file SentencePiece depends on.

zhongkaifu commented 2 years ago

Thanks for making these changes. I just left some comments for you and I'm currently running some tests for it.

zhongkaifu commented 2 years ago

I tested SeqWebAPI project in both Windows and Linux.

Windows: I copied sentencepiece.dll file to SeqWebAPI bin folder, run tests, and it passed all tests.

Linux: I copied .so file to SeqWebAPI bin folder, run tests, but it threw out exception: "Unhandled exception. System.DllNotFoundException: Unable to load shared library 'libsentencepiece.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: liblibsentencepiece.so: cannot open shared object file: No such file or directory". Then I rebuild your customized SentencePiece project by the following commands (they are from official SentencePiece repo): % cd sentencepiece % mkdir build % cd build % cmake .. % make -j $(nproc) % sudo make install % sudo ldconfig -v

And then rerun tests and it all passed.

Here are two suggestions:

  1. We may need to figure out why copied .so file threw out exception. Maybe it also depends on some other files ?
  2. Above problem can be resolved by rebuild customized SentencePiece, maybe we need to add above build command to README.MD file ?
zamgi commented 2 years ago

As far as I know, build "sentencepiece.dll" for windows not use any dependency (links to other dll, including windows memory management runtime), but i don't known this for build for Linux. Im make build "libsentencepiece.so" for Linux in Ubuntu 21.10. Is it suitable for other builds/versions Linux - its unknown. these two commands: % sudo make install % sudo ldconfig -v put the .so files to common linux-path libraries. maybe this is the reason. using interop/native code - it's always bring some problems.

"Above problem can be resolved by rebuild customized SentencePiece, maybe we need to add above build command to README.MD file" - Yes, probably. you know better))

zamgi commented 2 years ago

@zhongkaifu

I forgot to say: I have not run SeqWebAPI app on Linux, but I am run demo sentencepiece console app (./Seq2SeqSharp/SentencePiece/[sentencepiece_testapp]) on Linux - its work good. May be reason in web-app environment (current folder, paths for searching .so...)?

and one question: im see in 'Seq2SeqSharp/SeqWebAPIs/Seq2SeqInstance.cs' file next lines: .... lock ( locker ) { ... var nrs = m_seq2seq.Test( groupBatchTokens ); - (main call) ... } ...

zhongkaifu commented 2 years ago

I tested SeqWebAPIs in Ubuntu 18.04 and got those exceptions, but after rebuilding customized SentencePiece, it works fine. So the problem may be caused by inconsistent Linux version. Anyway, I can add "build SentencePiece insturction" to README.md file.

For using interop/native code, I found some code here (https://github.com/wang1ang/SentencePieceWrapper/blob/master/SentencePieceWrapper/SentencePieceWrapper.cpp), but they are implemented by C++ CLI rather than C#. Not sure if it would be helpful to you.

For "lock", yes, it's thread safe and you can remove it. But there maybe some load balance issue you need to be aware of . For CPU version, it's totally okay, because operating system can schedule which CPUs work for which threads and swap memory between page files when work load is very large. However, for GPU version, things are different, because operating system doesn't do such schedule and swap things for GPUs. So when work load is very large, it may cause 1) OOM issue, 2) imbalance GPU usage. To resolve this problem, we need to implement GPU load balance for SeqWebAPIs or other services.