Parallelisation of `TableClassificator.py`

emir-munoz commented 6 years ago

The current script wtables/feature_extraction/TableClassificator.py processes sequentially all the files in the folder of the compressed .tar file. Due to the large number of files and the time required for decompression, it would be good to add some parallelism.

The current approach is embarrassingly parallel and each page can be processed by a different process/thread. For testing purposes, we need a smaller sample of HTML files.

[x] Generate a smaller .tar file with only 10K samples for testing purposes.

The parallelisation can be done using joblib. An example, here https://pythonhosted.org/joblib/parallel.html

I will generate such a sample and apply the parallelism over the for loop.

emir-munoz commented 6 years ago

Some relevant links:

jhomaralc commented 5 years ago

TableClassificator.py was replaced by ExtractTables.py. The method reads article bz2 files and extracts usefull tables in HTML format.

emir-munoz commented 5 years ago

Thanks @jhomaralc good work. I will close this issue then.

wikitables / web-of-data-tables

Parallelisation of `TableClassificator.py` #4