xiaohuiyan / BTM

Code for Biterm Topic Model (published in WWW 2013)
https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
Apache License 2.0
405 stars 137 forks source link

Python extension #26

Open lucianolorenti opened 4 years ago

lucianolorenti commented 4 years ago

In case is anyone interested. I've made a python extension out of this code. It is more or less the same code, except it is wrapped with python-boost. And it avoids all the intermediate files. You can use it something like this:

import btm
number_of_topics = 2
alpha = 50/2
beta = 0.0005
n_iters = 50000
btm_model = btm.Model(number_of_topics, alpha, beta, n_iters, 3, True)
btm_model.fit(["sentence 1", "sentence 2", "sentence 2"])
pz = btm_model.get_pz()
pw_z = btm_model.get_pw_z( )
vocabulary = btm_model.vocabulary()
b = btm_model.predict(["ANother sentence"], "sum_b")
Logos23333 commented 4 years ago

I tried your version but something is wrong.

output.txt

lucianolorenti commented 4 years ago

I think there are two issues: The first one is that I forgot to add the__init__.py file. And the second is that

This file requires compiler and library support for the ISO C++ 2011 standard. I was using gcc 9.2.0 which I suppose it uses c++11 as default. Now I added the init.py file and the explicit argument -std=c++11. Tell me if not works for you

Logos23333 commented 4 years ago

I think there are two issues: The first one is that I forgot to add the__init__.py file. And the second is that

This file requires compiler and library support for the ISO C++ 2011 standard. I was using gcc 9.2.0 which I suppose it uses c++11 as default. Now I added the init.py file and the explicit argument -std=c++11. Tell me if not works for you

It works, successfully installed btm-0.1.0, thanks for your solution.

Logos23333 commented 4 years ago

In case is anyone interested. I've made a python extension out of this code. It is more or less the same code, except it is wrapped with python-boost. And it avoids all the intermediate files. You can use it something like this:

import btm
number_of_topics = 2
alpha = 50/2
beta = 0.0005
n_iters = 50000
btm_model = btm.Model(number_of_topics, alpha, beta, n_iters, 3, True)
btm_model.fit(["sentence 1", "sentence 2", "sentence 2"])
pz = btm_model.get_pz()
pw_z = btm_model.get_pw_z( )
vocabulary = btm_model.vocabulary()
b = btm_model.predict(["ANother sentence"], "sum_b")

when i run the example code above, i got something like this: ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 1 of 50001 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 2 of 50001 Is it the expected result or not?

lucianolorenti commented 4 years ago

No is not. Somehow is accessing the pvec in the position 3 when it has only 3 elements. I am going to try in another PC to see if I get the same error.

I've tried with another ArchLinux and it worked. I'm going to try with an ubuntu.

lucianolorenti commented 4 years ago

I tried in a Debian 10. And the version of boost-python was old. I had to recompile boost-python in order to work. But apart from that, I did not have any other problem. I don't know what is happening in your case.

ChangweiZhou commented 4 years ago

Hi!

I tried but the code is not working. It says:

C:\train\B-Python>pip install . Processing c:\train\b-python Building wheels for collected packages: btm Running setup.py bdist_wheel for btm ... error Complete output from command C:\Users\07390\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\07390\AppData\Local\Temp\pip-wheel-tgike6d6 --python-tag cp37: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 creating build\lib.win-amd64-3.7\btm copying btm__init__.py -> build\lib.win-amd64-3.7\btm running build_ext building 'btm_cpp' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release creating build\temp.win-amd64-3.7\Release\btm C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MT -DMAJOR_VERSION=1 -DMINOR_VERSION=0 -IC:\Users\07390\Anaconda3\include -IC:\Users\07390\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tpbtm/model.cpp /Fobuild\temp.win-amd64-3.7\Release\btm/model.obj -std=c++11 cl: 命令行 warning D9002 :忽略未知选项“-std=c++11” model.cpp C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\btm\doc.h(24): warning C4267: “return”: 从“size_t”转换到“int”,可能丢失数 据 C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\btm\model.h(10): fatal error C1083: 无法打开包括文件: “boost/python/numpy.hpp”: No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe' failed with exit status 2


Failed building wheel for btm Running setup.py clean for btm Failed to build btm Installing collected packages: btm Found existing installation: btm 1.0.15 Uninstalling btm-1.0.15: Successfully uninstalled btm-1.0.15 Running setup.py install for btm ... error Complete output from command C:\Users\07390\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\07390\AppData\Local\Temp\pip-record-bydnsmqf\install-record.txt --single-version-externally-managed --compile: running install running build running build_py creating build creating build\lib.win-amd64-3.7 creating build\lib.win-amd64-3.7\btm copying btm__init__.py -> build\lib.win-amd64-3.7\btm running build_ext building 'btm_cpp' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release creating build\temp.win-amd64-3.7\Release\btm C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MT -DMAJOR_VERSION=1 -DMINOR_VERSION=0 -IC:\Users\07390\Anaconda3\include -IC:\Users\07390\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tpbtm/model.cpp /Fobuild\temp.win-amd64-3.7\Release\btm/model.obj -std=c++11 cl: 命令行 warning D9002 :忽略未知选项“-std=c++11” model.cpp C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\btm\doc.h(24): warning C4267: “return”: 从“size_t”转换到“int”,可能丢失 数据 C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\btm\model.h(10): fatal error C1083: 无法打开包括文件: “boost/python/numpy.hpp”: No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe' failed with exit status 2

----------------------------------------

Rolling back uninstall of btm Command "C:\Users\07390\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\07390\AppData\Local\Temp\pip-record-bydnsmqf\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\07390\AppData\Local\Temp\pip-req-build-odmu1lp8\

Not sure what is wrong.

lucianolorenti commented 4 years ago

For what I see The compiler can't find the boost numpy headers

...model.h(10): fatal error C1083: 无法打开包括文件: “boost/python/numpy.hpp”: No such file or directory

Do you have boost correctly installed? And did you add the headers path to the include path dir?

ChangweiZhou commented 4 years ago

Hi!

I installed boost, but I do not know how to add the header path to the include path directory.

ChangweiZhou commented 4 years ago

So I tried to install boost using anaconda, and again it does not work:

(d2l) C:\train\B-Python>pip install . Processing c:\train\b-python Building wheels for collected packages: btm Building wheel for btm (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\07390\Anaconda3\envs\d2l\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"'; file='"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\07390\AppData\Local\Temp\pip-wheel-4iss4_6i' --python-tag cp37 cwd: C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\ Complete output (18 lines): running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 creating build\lib.win-amd64-3.7\btm copying btm__init__.py -> build\lib.win-amd64-3.7\btm running build_ext building 'btm_cpp' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release creating build\temp.win-amd64-3.7\Release\btm C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DMAJOR_VERSION=1 -DMINOR_VERSION=0 -IC:\Users\07390\Anaconda3\envs\d2l\include -IC:\Users\07390\Anaconda3\envs\d2l\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tpbtm/model.cpp /Fobuild\temp.win-amd64-3.7\Release\btm/model.obj -std=c++11 cl: 命令行 warning D9002 :忽略未知选项“-std=c++11” model.cpp C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\btm\doc.h(24): warning C4267: “return”: 从“size_t”转换到“int”,可能丢失数 据 C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\btm\model.h(10): fatal error C1083: 无法打开包括文件: “boost/python/numpy.hpp”: No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe' failed with exit status 2

ERROR: Failed building wheel for btm Running setup.py clean for btm Failed to build btm Installing collected packages: btm Found existing installation: btm 1.0.15 Uninstalling btm-1.0.15: Successfully uninstalled btm-1.0.15 Running setup.py install for btm ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\07390\Anaconda3\envs\d2l\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"'; file='"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\07390\AppData\Local\Temp\pip-record-hlpib9u3\install-record.txt' --single-version-externally-managed --compile cwd: C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\ Complete output (18 lines): running install running build running build_py creating build creating build\lib.win-amd64-3.7 creating build\lib.win-amd64-3.7\btm copying btm__init__.py -> build\lib.win-amd64-3.7\btm running build_ext building 'btm_cpp' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release creating build\temp.win-amd64-3.7\Release\btm C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DMAJOR_VERSION=1 -DMINOR_VERSION=0 -IC:\Users\07390\Anaconda3\envs\d2l\include -IC:\Users\07390\Anaconda3\envs\d2l\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tpbtm/model.cpp /Fobuild\temp.win-amd64-3.7\Release\btm/model.obj -std=c++11 cl: 命令行 warning D9002 :忽略未知选项“-std=c++11” model.cpp C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\btm\doc.h(24): warning C4267: “return”: 从“size_t”转换到“int”,可能丢失 数据 C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\btm\model.h(10): fatal error C1083: 无法打开包括文件: “boost/python/numpy.hpp”: No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28314\bin\HostX86\x64\cl.exe' failed with exit status 2

Rolling back uninstall of btm Moving to c:\users\07390\anaconda3\envs\d2l\lib\site-packages\btm-1.0.15.dist-info\ from c:\users\07390\anaconda3\envs\d2l\lib\site-packages\~tm-1.0.15.dist-info Moving to c:\users\07390\anaconda3\envs\d2l\lib\site-packages\btm\ from c:\users\07390\anaconda3\envs\d2l\lib\site-packages\~tm ERROR: Command errored out with exit status 1: 'C:\Users\07390\Anaconda3\envs\d2l\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"'; file='"'"'C:\Users\07390\AppData\Local\Temp\pip-req-build-fq1ue1v3\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\07390\AppData\Local\Temp\pip-record-hlpib9u3\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.

I have also included the directory of boost in the system path variable.

lucianolorenti commented 4 years ago

The include path are the paths where the compiler looks for headers file (the .h files). It is not related to the system path which are the paths where the operating system looks for executables. I will try to add a configuration file to specify these paths and make the compilation easier.

In the meantime you can edit the setup.py and add it yourself.


btm_cpp = Extension('btm_cpp',
                    define_macros = [('MAJOR_VERSION', '1'),
                                     ('MINOR_VERSION', '0')],
                    libraries = ['boost_python3', 'boost_numpy3'],
                    language='c++11',
+                   include_dirs=[ THE_PATH_WHERE_THE_BOOST_HEADERS_ARE_LOCATED ],
+                   library_dirs=[ THE_PATH_WHERE_THE_BOOST_LIBRARIES_ARE_LOCATED],
                    extra_compile_args=extra_compile_args,
                    sources = ['btm/model.cpp','btm/infer.cpp'])

The THE_PATH_WHERE_THE_BOOST_HEADERS_ARE_LOCATED should end with an include, i.e. C:\something\something\include The THE_PATH_WHERE_THE_LIBRARIES_ARE_LOCATED perhaps end with a bin, i.e. 'C:\something\something\bin'. It should be a folder with a lot of dll

Depending on how boost was installed you probably will need to change the name of the libraries [''boost_numpy3', 'boost_python3'] This names make references to library files (in this case .dll files). For example, In the case of boost_numpy3 the last step of the compilation (the linker) will search for libboost_numpy3.dll, perhaps in your machine the file is called libboost_numpy.dll and you should change the libraries in setup.py to'boost_numpy'

ChangweiZhou commented 4 years ago

Hi!

Thanks. I have a question:

In your set up:

btm_model = btm.Model(number_of_topics, alpha, beta, n_iters, 3, True)

What does 3 mean at here? Should not all the parameters be fixed already?

lucianolorenti commented 4 years ago

Hi! It is a parameter that does nothing :S. Is what it was the save_step in the original code. But in my fork nothing is saved in intermediate iterations.

ChangweiZhou commented 4 years ago

Hi!

Thanks for the prompt reply. Wish you are safe!

I am giving a try with this on a large data set. One question - is it possible for this to be displaying progressing bars like tqdm? So far I am not seeing any indicator at all. Since training a large model takes a lot of time, I feel this could be useful.

lucianolorenti commented 4 years ago

That's odd. The progress bar is the same that in the original code, I can see it. I just pushed a few commits removing the save_step parameter and add a boolean show_progressbar to make the progress bar optional. Because previously the progress bar was always present.
Also now is also possible to do this:

btm_model = btm.Model(number_of_topics, alpha, beta, n_iters, background_topic, show_progressbar)
btm_model.initialize(["sentence 1", "sentence 2", "sentence 2"])
for j in range(500):
    btm_model.fit_step()

To perform the fit steps in python. The fit_step performs only one pass of the algorithm.

ChangweiZhou commented 4 years ago

It is wierd. Here is a public ipynb file:

https://colab.research.google.com/drive/1Rr2WsY7MRy3Pin8Eak9HNa6rddBLSn07

I tried your commands but it says


NameError Traceback (most recent call last)

in () ----> 1 get_ipython().run_cell_magic('time', '', '\nnumber_of_topics = 2\nalpha = 50/2\nbeta = 0.0005\nn_iters = 50000\nbtm_model = btm.Model(number_of_topics, alpha, beta, n_iters, background_topic, show_progressbar)\nbtm_model.initialize(["sentence 1", "sentence 2", "sentence 2"])\nfor j in range(500):\n btm_model.fit_step()') 2 frames in time(self, line, cell, local_ns) /usr/local/lib/python3.6/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns) 1191 else: 1192 st = clock2() -> 1193 exec(code, glob, local_ns) 1194 end = clock2() 1195 out = None in () NameError: name 'background_topic' is not defined
ChangweiZhou commented 4 years ago

I am training using Google colab, not windows. So theoretically the issue should be from Google colab.

lucianolorenti commented 4 years ago

You did not define the background_topic variable. Follow the readme thoughtfully.

I run it in google colab and is working

ChangweiZhou commented 4 years ago

Thanks! I figured out how to use it now. The second method works for me.

Quick question: Is it possible to speed up the training using GPU/TPU? I know it uses Gibbs sampling in the background. Just wondering if we can speed up the training process since colab offer GPU/TPU support.

Huakui-Zhang commented 4 years ago

when i run the example code above, i got something like this: ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 1 of 50001 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 ERR: index=3, size=3 2 of 50001 Is it the expected result or not?

@Logos23333 I've encountered the same problem. And I found out that this is caused by the following line of code

this->w2id[w] = this->w2id.size();

in line 118 in model.cpp. For example, when this->w2id is empty, i.e., its size is 0, the above code will assign this->w2id[w] to 1. That is. the resultant ids of the words are one greater than the expected ids, which causes the index out of boundary error. However, since I am not too familiar with c++, I am not sure why I run into this. The line of code can be changed to the following to avoid the error:

int new_id = this->w2id.size();
this->w2id[w] = new_id;