wri-dssg-omdena / policy-data-analyzer

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Other
34 stars 9 forks source link

Improve preprocessing component #48

Closed jordiplanescutxi closed 3 years ago

jordiplanescutxi commented 3 years ago

The current method that we are using to split sentences yields a great amount of wrongly splitt sentences. We need to improve it so as to have a good final version when we want to use the fine-tunned transformers.

jordiplanescutxi commented 3 years ago

Split into sentences improvement

thefirebanks commented 3 years ago

Polish preprocessing code

thefirebanks commented 3 years ago

Steps:

thefirebanks commented 3 years ago