wo / paperscraper

tracking and parsing new philosophy papers on the internet
9 stars 4 forks source link

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

Open wo opened 8 years ago

wo commented 8 years ago

At the moment, paperparser.py invokes ocr2xml if the metadata extracted via pdftohtml has low confidence; it then returns the metadata extracted via ocr. Neither is ideal. Sometimes pdftohtml produces much better results than oxr2xml, sometimes its the other way round, and metadata confidence is not a good way to distinguish between the two cases. It would be better to have a separate quick sanity evaluation of author/title/abstract on the basis of which it is decided (1) whether ocr2xml needs to be invoked, and (2) whether to use the metadata extracted via pdftohtml or via ocr.

Here, for example, extraction via pdftohtml gets the title right, while extraction via ocr yields "9103 ‘9 [inV uo qSanuipg J0 AlissoAiuf] 112 filo'spetuno[plogxo'bd/pduq mos; popeopimoq" (perhaps because of the unusual font): http://pq.oxfordjournals.org/content/early/2016/04/04/pq.pqw028.full.pdf

On the other hand, here are some cases where pdftohtml yields really bad titles or authors: http://web.ics.purdue.edu/~drkelly/MallonKellyMakingRaceNothing2012.pdf ("Making Race Out O f nO thing: Psych O l O gically cO nst R ained sO cial R O les") http://www3.nd.edu/~dhoward1/Lost%20Wanderers.pdf ("Lost Wandere rs i n the Fore st o f Knowledg e: S ome Thought s on t he Disco very ) Just ifi cat ion Di sti ncti on") http://www.consciousness.it/Docs/Lavazza,%20Manzotti%20-%202011%20-%20A%20New%20Mind%20for%20a%20New%20Aesthetics.pdf ("A N d REA L AVA zz A * | R I cc AR d O M AN z OTTI *")

It shouldn't be hard to recognize at least ridiculous cases like these.