wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
25 stars 4 forks source link

Extracting wrong title in PDF metadata #364

Open TeriForey opened 4 years ago

TeriForey commented 4 years ago

Noticed that some PDFs have a second title within the metadata, for example:

Title:          SEAJPH-April 13.indb
Creator:        Adobe InDesign CS3 (5.0)
Producer:       Adobe PDF Library 8.0
CreationDate:   Mon May 20 04:11:26 2013 IST
ModDate:        Mon May 20 04:11:27 2013 IST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          11
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      616434 bytes
Optimized:      no
PDF version:    1.3
PDF subtype:    PDF/X-3:2002
    Title:         ISO 15930 - Electronic document file format for prepress digital data exchange (PDF/X)
    Abbreviation:  PDF/X-3:2002
    Subtitle:      Part 3: Complete exchange suitable for colour-managed workflows (PDF/X-3)
    Standard:      ISO 15930-3

The latter one, which just describes that it's a PDF, is getting returned instead of the first.

jdu commented 4 years ago

I don't think we'll deal with this at the moment, this code may actually get strippped out as the majority of titles we seem to be getting pretty consistently from the source page on the target site, if that's the case that was the predominant reason for getting this information from the PDF metadata, marking as wontfix while we evaluate if any of the rest of this information would be useful.