wo / paperscraper

tracking and parsing new philosophy papers on the internet
9 stars 4 forks source link

Handle constantly changing link URLs #80

Open wo opened 8 years ago

wo commented 8 years ago

On http://www.kantstudiesonline.net/index.php/articles/ the links to papers change on each visit, and can't be factorized into a session id and a non-trivial remainder. As a consequence, all papers get checked every day, and if a paper has been parsed incorrectly and manually corrected, the incorrect parsing is recognized as a new paper (because the corrected paper is not recognised as a duplicate).

For the case of Kant Studies Online, it would help to store the link text in addition to the link url and only process links whose link text is new. But that would not work for other sites where link texts may be something like 'PDF'.

A better solution is probably to store a hash of the pdf file in the Doc table and skip processing of documents whose hash is already in the table. (That would not work if a journal modified the pdf on each retrieval, which fortunately Kant Studies doesn't.)

Another (perhaps complementary) solution would be to improve the post-processing duplication detection: if two papers have almost the exact same content, they should be recognized as duplicates, even if they have different authors or titles.

wo commented 8 years ago

I've improved the duplicate detection so that it at least doesn't return None whenever no author has been extracted.