Extracting wrong title in PDF metadata

wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning

MIT License

25 stars 4 forks source link

Noticed that some PDFs have a second title within the metadata, for example:

Title:          SEAJPH-April 13.indb
Creator:        Adobe InDesign CS3 (5.0)
Producer:       Adobe PDF Library 8.0
CreationDate:   Mon May 20 04:11:26 2013 IST
ModDate:        Mon May 20 04:11:27 2013 IST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          11
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      616434 bytes
Optimized:      no
PDF version:    1.3
PDF subtype:    PDF/X-3:2002
    Title:         ISO 15930 - Electronic document file format for prepress digital data exchange (PDF/X)
    Abbreviation:  PDF/X-3:2002
    Subtitle:      Part 3: Complete exchange suitable for colour-managed workflows (PDF/X-3)
    Standard:      ISO 15930-3

The latter one, which just describes that it's a PDF, is getting returned instead of the first.

wellcometrust / reach

Extracting wrong title in PDF metadata #364