issues
search
wmgeolab
/
scope
5
stars
3
forks
source link
LM Feature Extraction Proof-of-Concepts on GEF/BRIGHT/GDELT PDFs/URIs
#132
Closed
jeremy-swack
closed
1 year ago
jeremy-swack
commented
1 year ago
[x] #140
[x] Execute script
[x] Test out first proof-of-concept with PDF data
joegenius98
commented
1 year ago
[ ] Investigate if there's a stand-out, best, Python PDF extractor that can capture important, almost all of text
[ ] Get a sample set of URIs to test on
[ ] Test different HTML extractors
[ ] Determine what's easier: extracting PDFs or HTML articles?
[ ] Choose one to go with for MVP