Improve preprocessing component

wri-dssg-omdena / policy-data-analyzer

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.

Other

34 stars 9 forks source link

Improve preprocessing component #48

Closed jordiplanescutxi closed 3 years ago

jordiplanescutxi commented 3 years ago

The current method that we are using to split sentences yields a great amount of wrongly splitt sentences. We need to improve it so as to have a good final version when we want to use the fine-tunned transformers.

jordiplanescutxi commented 3 years ago

Split into sentences improvement

thefirebanks commented 3 years ago

Polish preprocessing code

thefirebanks commented 3 years ago

Steps:

[x] Sentence splitting pt. 1, experiment with external libraries (try multilingual if possible)
[x] Sentence splitting pt. 2, compare with current sentence splitting script and choose one
[x] Build input pipeline as a script pt. 1, import documents as text from Amazon S3 bucket
[x] Build input pipeline as a script pt. 2, join OCR and other components of input pipeline together
[x] Build output pipeline as a script, sentences ready for assisted labeling component

thefirebanks commented 3 years ago

[x] Add rules for Mexico
[x] Add rules for El Salvador
[x] Add rules for Chile