File hashes are not consistent between Reach runs

nsorros commented 5 years ago

Just realized that some documents we scrape have a different file hash even though they represent the same document.

If our intention is not to redownload the same document, this is one problem, but even if we do not care about this, during analysis there needs to be a field which is unique per document in order for the conclusions to be accurate. The solution could be to document that document_uri is the unique key that people should use of they want to analyze the document or it could be to alter our implementation of file hash accordingly or it can be something else.

lizgzil commented 5 years ago

Also there are scraped documents with the same file hash. Checking in the RDS data just now - there are 205,234 scraped policy docs, but 143,142 unique file hashs.

lizgzil commented 5 years ago

An example of this is the policy document

http://www.fao.org/3/I9553EN/i9553en.pdf

which has the document ID

818aff942d9813d338fe31828ee9452a

and also

14748f5b61ec161bc226354edbeee7f1

ivyleavedtoadflax commented 4 years ago

I think this issue has reared its head again.

The gold annotated data that I am using to evaluate reach has 81 annotated documents from a scrape done by Reach on 2019-10-8. Not all of these document ids exist in more recent scrapes however:

Date of evaluation	Gold documents found
Dec/Jan	10/81
2020-01-13	7/81
2020-01-16	6/81
2020-01-28	4/81

The problem seems to be getting worse with time. Is there any reason why taking an md5 hash of the entire document could yield different results on subsequent scrapes: @SamDepardieu @jdu https://github.com/wellcometrust/reach/blob/d66379b2b40ea6d01e53c4667f98db2cf2ca2896/reach/scraper/wsf_scraping/file_system.py#L11

jdu commented 4 years ago

It's possible it could change, but not sure the cases where this would be the case apply here. For instance if there was a process on their end which opened the file and saved it, even without changing the content of the PDF, there's a trailer in the file which contains a CreationDate/ModDate value, which if those changed it might completely change the MD5 hash output.

Are we sure that the hashes changed and not that that the documents just aren't there a tall?

jdu commented 4 years ago

If you have the original file from your gold data you could calc a hash of it, pull the same file from S3 and calc the hash on it and if they differ you could compare the file in a diff viewer to see if there are any differences.

ivyleavedtoadflax commented 4 years ago

It's possible it could change, but not sure the cases where this would be the case apply here. For instance if there was a process on their end which opened the file and saved it, even without changing the content of the PDF, there's a trailer in the file which contains a CreationDate/ModDate value, which if those changed it might completely change the MD5 hash output.

Are we sure that the hashes changed and not that that the documents just aren't there a tall?

Nope not 100% sure that they are not there at all - that is the other possibility.

jdu commented 4 years ago

Also, if you opened the file in preview and then calculated the MD5 hash in order to set up the gold data set, your MD5 hash calculated locally might differe because some data in the PDF might have changed from the open / close. I'll do a quick test in a moment.

ivyleavedtoadflax commented 4 years ago

Just to clarify, i never calculate md5 hashes locally, they are always taken from Reach. So the question is whether in previous runs of the scraper, the same file is being given a different hash over subsequent scrapes...or whether those files simply aren't being found anymore.

jdu commented 4 years ago

Already tested (was curious to see for myself in any case) and at least with Preview in macos it doesn't change the file hash, even if I open and save the file. I think because it's only hashing the first 65536 bytes of the file and the trailer is at the end of the file generally.

If there's nothing altering the files then the md5 should be deterministic and shouldn't change at all between scrapes unless the file has changed on their end.

ivyleavedtoadflax commented 4 years ago

yes... that is was I thought...

ivyleavedtoadflax commented 4 years ago

going to dig into this a bit further, i think i have some historical files knocking around. Think i'm also going to add a datestamp to the files output by the new evaluator tasks

jdu commented 4 years ago

The fact you're getting some as well is weird. Because if it was something in the pipeline that was changing the hash I would think it would be all or nothing. So either they aren't being scraped / are missing from the data (a very real possibility) or they've been changed at the source. If even a single byte changes in the first block of bytes in the PDF it can drastically change the hash.

jdu commented 4 years ago

Another option here is to use the DocumentId embedded in the PDF if it's available and then falling back to an MD5 hash if the DocumentID doesn't exist.

That DocumentId shouldn't change.

$ pdfinfo -meta stored_v.pdf
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 91.163280, 2018/06/22-11:31:03        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <xmp:CreateDate>2018-09-04T16:57:41+02:00</xmp:CreateDate>
         <xmp:MetadataDate>2018-11-23T16:25:13+01:00</xmp:MetadataDate>
         <xmp:ModifyDate>2018-11-23T16:25:13+01:00</xmp:ModifyDate>
         <xmp:CreatorTool>Adobe InDesign CC 13.1 (Macintosh)</xmp:CreatorTool>
         <xmpMM:InstanceID>uuid:96ee9e09-c03e-be42-a57b-49ffadd7d79c</xmpMM:InstanceID>
         <xmpMM:OriginalDocumentID>xmp.did:F77F117407206811822A97A08D940DAA</xmpMM:OriginalDocumentID>

         <!-- HERE -->
         <xmpMM:DocumentID>xmp.id:6b8f1861-33fa-4840-8122-e964de1452e0</xmpMM:DocumentID>
         <!-- HERE -->

         <xmpMM:RenditionClass>proof:pdf</xmpMM:RenditionClass>
         <xmpMM:DerivedFrom rdf:parseType="Resource">
            <stRef:instanceID>xmp.iid:80a4aae9-2c43-41e9-a87b-ecf759b5ca1c</stRef:instanceID>
            <stRef:documentID>xmp.did:845fb7c5-0600-4ff5-9fa1-d370d05b7ed5</stRef:documentID>
            <stRef:originalDocumentID>xmp.did:F77F117407206811822A97A08D940DAA</stRef:originalDocumentID>
            <stRef:renditionClass>default</stRef:renditionClass>
         </xmpMM:DerivedFrom>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>converted</stEvt:action>
                  <stEvt:parameters>from application/x-indesign to application/pdf</stEvt:parameters>
                  <stEvt:softwareAgent>Adobe InDesign CC 13.1 (Macintosh)</stEvt:softwareAgent>
                  <stEvt:changed>/</stEvt:changed>
                  <stEvt:when>2018-09-04T16:57:41+02:00</stEvt:when>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">The State of Food Security and Nutrition in the World 2018</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Bag/>
         </dc:creator>
         <pdf:Producer>Adobe PDF Library 15.0</pdf:Producer>
         <pdf:Trapped>False</pdf:Trapped>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

ivyleavedtoadflax commented 4 years ago

Ahh good to know that's an option. I'm just digging into the most recent scrape from staging to do a comparison of document_ids from a couple of months ago. I just want to get a sense of how bad the problem is first...

ivyleavedtoadflax commented 4 years ago

Would be good to know how many pdfs have this embedded document id, and how unique it is. If we are still falling back on a document hash for the remainder, then we will still need to fix any underlying issue

nsorros commented 4 years ago

This problem is a bit annoying. I don't understand why would data miss from subsequent scrapes, that is a bit problematic. Whatever we decide it would be good to have some guarantee from engineering that these ids will be there and they will not change.

ivyleavedtoadflax commented 4 years ago

So... I've looked back at a scrape from (2019-10-09) and compared it to the scrape that was done on staging overnight (2020-01-29). Here are the scores on the doors:

Metric	20191009	20200129
Unique doc ids	7033	157223
Unique to this scrape	2969	153159

This gives an overlap of just 4064 document ids, which means 58% of document ids captured in that first scrape no longer appear in the latest one.

Note that it is unclear whether the 2019-10-09 scrape was a complete one - it just happened to be the data that was on staging when I pulled it, but I don't think it matters very much.

ivyleavedtoadflax commented 4 years ago

I'll have a look at the scraped pdfs and see if the DocumentID present in pdfs is a reliable alternative, but i think it is going to have to be very reliable to make it worthwhile, otherwise falling back onto an md5 hash of the document for even a small percentage of the scrape could be problematic given the above results.

jdu commented 4 years ago

I'll have a look at the DAG tomorrow and see what I can see in there. I suspect there might be alot of duplicates but it's also possible we might not have been getting all the docs before, might be worth checking how many of the docs in the last run are duplicates.

This problem is a bit annoying. I don't understand why would data miss from subsequent scrapes, that is a bit problematic. Whatever we decide it would be good to have some guarantee from engineering that these ids will be there and they will not change.

@nsorros We wiped data after all the key changes and scraper changes as the indexes needed to be recreated (and some data was orphaned) and the data repopulated using the new schemas. So the first run we did recently was a fresh run from mostly a blank slate, it's never going to be 100% perfect getting data from the targets as there's too much there thats out of our control from an engineering perspective, what we have to aim for is eventual consistency where the datas completeness increases steadily over subsequent runs.

ivyleavedtoadflax commented 4 years ago

I should have also said that I extracted these document ids from parsed-pdf json, not from ES.

ivyleavedtoadflax commented 4 years ago

Some analytics on two recent Reach runs on staging dated 2020-01-29 and 2020-01-31 from the parsed-pdfs jsons. The headline figure is that in the earlier Reach run, there were 57989 more unique document ids than in the later run. In the later run 18872 document ids were new, which is about 16%. That's actually not as bad as I had expected given the results from the Evaluator task, but still a significant amount.

Total doc ids

$ cat 20200129_combined.jsonl | jq 'file_hash' | uniq | wc -l
157223
$ cat 20200131_combined.jsonl | jq 'file_hash' | uniq | wc -l
118106

Total overlap between the Reach runs:

$ comm -1 -2 \ 
    <(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
    <(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
99234

Files unique to 20200131:

$ comm -1 -3 \  <aws:wellcome>
    <(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
    <(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
57989

Files unique to 20200129:

$ comm -2 -3 \  <aws:wellcome>
    <(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
    <(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
18872

ivyleavedtoadflax commented 4 years ago

Looking at the new pdf_metadata elements in the 2020-01-31 run, the following elements appear:

$ cat 20200131_combined.jsonl | jq '.pdf_metadata | [paths | join(".")][]' | sort | uniq -c
  75964 "author"
 108479 "creator"
  82309 "title"

but the total number of unique titles is:

$ cat 20200131_combined.jsonl| jq '.pdf_metadata.title' | sort | uniq | wc -l
26090

which is about 22% of the total. These are of extremely varied quality, and would not serve to produce a unique identifier. For example, a good portion of the titles are:

      9 "Written evidence - Willis Towers Watson"      
      3 "Written evidence - Wilson Bio-Chemical"             
      3 "Written evidence - Wiltshire Council"                                                                                         
      3 "Written evidence - Wimborne War on Waste"                   
      3 "Written evidence - Winchester Friends of the Earth"                       
      3 "Written evidence - Windwatch NI"                                   
     15 "Written evidence - Wine and Spirit Trade Association"
      3 "Written evidence - Wine Institute"                   
      3 "Written evidence - Wine & Spirit Trade Association"
      6 "Written evidence - WinVisible (women with visible & invisible disabilities)"            
      3 "Written evidence - Wipe The Slate Clean"                                                                                                    
      3 "Written evidence - Wirral Foodbank"                                                                                                                                                
      3 "Written evidence - Wirral Older People's Parliament"                                               
      3 "Written evidence - Witchford and Area Schools Partnership"                             
      3 "Written evidence - Witness Confident"                                                                                                  
      3 "Written evidence - W L Consulting Ltd"                                                    
      3 "Written evidence - Wm Morrison Supermarkets plc"                                                                    
      3 "Written evidence - Woking Borough Council Overview & Scrutiny Committee"     
      3 "Written evidence - Womankind Worldwide"                                    
      3 "Written evidence - Women And Children First (Uk)"                              
      3 "Written evidence - Women for Independence Midlothian"                                                                
      6 "Written evidence - Women for Refugee Women"                                     
      3 "Written evidence - Women For Women International Uk"                     
      3 "Written evidence - Women in Manufacturing and Engineering"

ivyleavedtoadflax commented 4 years ago

Looking at the source_metadata.did element, there are 27777 unique doc ids

$ cat 20200131_combined.jsonl | jq'.source_metadata.did' | sort | uniq | wc -l
27777

A good proportion of these are replicated a large number of times:

$ cat 20200131_combined.jsonl | jq '.source_metadata.did' | sort | uniq -c | sort -k 1 -h  | tail -n 10
     33 "98f2358e-486b-2b47-bbfc-0032f25e4f90"
     36 "6B3DAA810D206811822ADF4D98799A13"
     36 "E4311E9E0E206811822ADDFBCA027983"
     36 "f2f18911-a877-9e4a-88c5-016178588b6e"
     36 "FFCB0B8328206811822AA0B70B0C5E7C"
     42 "7dd5faff-9622-4bd4-b4e3-7b5b9de50ccb"
     51 "923dbd41-10a3-4835-b0d9-aa3259774f18"
     66 "9e716710-7d79-e641-b11c-198ce2405117"
     90 "a8ded3b0-069b-7b49-9d09-b9de6a2b3f1e"

15883 dids appear just once in the dataset, all the others appear at least twice:

$ cat 20200131_combined.jsonl | jq '.source_metadata.did' | sort | uniq -c | awk 'FNR > 0 { if($1 > 1) print $1 }' | wc -l
15883

Next thing to look at is the relationship between file_hash and did but i probably need to open python for that!

lizgzil commented 4 years ago

nice to see the analysis! Just a comment on my idea to use url as the unique identifier - it would need cleaning/also has uniqueness problems! (just reminded of this after revisiting my uber-reach comparison work) e.g. these link to the same doc

jdu commented 4 years ago

@lizgzil We should definitely steer clear of using the URL as an identifier for the document. Primarily because the URL that the document lives at could change in future scrapes which would result in duplication, as well as a given document could exist at multiple URLs like your example above with different query params causing a different hash to be calculated from the URL. In addition to this, a document could be referenced in more than one site. If a policy for instance is published to WHO and gov.co.uk puts up the same document on their site (this might be contrived but it could be possible) so the identifier needs to be able to understand that those two documents are the same.

As well I think one of the sites has weird POST URLs for some documents which don't have a unique URL between a group of documents as the document to retrieve is part of a post body within the request itself versus encoded in the URL.

From a functional viewpoint the following would potentially yield a higher number of unique document ids:

Use the DocumentID if available along with the title of the document as an id. The title of the document is less likely to change than the location/url, data within the PDF itself.

With the DocumentId, the only problem I can think of with that is if the document is generated from scratch again and the current version replaced with the new one. Because it was regenerated from scratch the DocumentId will have changed in that instance. However, we can mitigate this by storing information about the PDF and source page itself (base source URL, title, source...) which if we find a new document at a URL which we previously found a different document at, we can evaluate to see if it's the same document but with a different hash. I don't think there's going to be a single property which we can use that will be unique consistently, but a couple used in conjunction will if we have a process to mitigate collisions like outlined above.

Based on some other discussions, this might be a good scenario to utilize SQLite instead of a JSON store as we can insert the scrape results into the SQlite database and query the data on multiple properties and shunt the sqlite file to the next stage in the process versus being stuck with a key/value approach because of JSON.

ivyleavedtoadflax commented 4 years ago

Nice ideas @jdu. It should be quite easy to calculate how many pieces of information we need to get a unique ID, I'll have a look at this now while I run a test dag.

Of course, this still doesn't explain why we are not getting a consistent file hash across Reach runs :woman_shrugging:

lizgzil commented 4 years ago

@jdu thanks for the info, good to know. It's tricky in my comparison work since I need a unique identifier to link the Uber policy documents with the Reach ones. Perhaps it will involve an additional step of scraping the url texts then to see if 2 policy documents are the same (Uber url-> scrape uber url ->hash text from uber scrape <=?=> hash text from reach scrape <- scrape reach url <-Reach url

jdu commented 4 years ago

Jus tto add to this (at least for reaches internal IDs). What we could possibly do is gather the following information about a PDF.

{
   "did": "<document_id>",
   "title": "<doc_title>",
   "ref_page": "<page that held the link to download>",
   "url": "<url the file was downloaded from>",
   "scraped_date": "<unix timestamp when this doc was scraped>",
   "aliases": [
        ["<a document id that was found later that matched>", "<scraped_date>"]
    ]
}

So if we move the data store into sqlite, when we come across a PDF, we can do a query:

SELECT * FROM scraped_pdfs WHERE title = <title> AND url = <url>;

using the details from the PDF currently being evaluated, if that returns a match in the SQLite db, instead of updating the record, we append the file we just found that matches to aliases property of the existing entry.

That way instead of us completely orphaning the document by removing the original entry, we keep some referential integrity so that in later stages we can see that a single document actually has multiple potential ids, we can do this with the original file has as well. So the document now has persistent state across scrapes which tells us how that documents identification has changed over time while keeping the original id intact.

ivyleavedtoadflax commented 4 years ago

Nice idea @jdu. Since we can't rely on the file_hash would we assign a UUID to each document internally for use in Reach?

jdu commented 4 years ago

@ivyleavedtoadflax We can either generate a UUID4 for it, i'm more inclined to use the initial DocumentId we get for a given document as it's identifier. That way anythign after that can continue to use that ID for the document, including your evaluations. But we have the aliases stored as secondary identifiers for the document.

ivyleavedtoadflax commented 4 years ago

Less than 25% of the documents in the last scrape had a DocumentID though, and only 13% had a unique one (the non-uniqueness may be because of duplicate documents), so I don't think we can rely on DocumentID for Reach's unique ID.

jdu commented 4 years ago

@ivyleavedtoadflax The implementation to get the DocumentId isn't spectacular at the moment and needs some work. But I suspect that some documents share a DocumentId as they are part of a group potentially or are derived from a base document, there's an additional InstanceId which I'll see if it's different between items. It might be for instance that they generated a PDF from one larger document, then split the document apart using something into multiple subsequent documents all sharing the same Id if the documents that share the id are thematically similar.

ivyleavedtoadflax commented 4 years ago

I have created a csv from the last scrape that includes all these possible unique id candidates: s3://datalabs-dev/reach-airflow/data/20200131_flat.csv, should be useful to inform this issue.

jdu commented 4 years ago

I'll pull that in a bit and have a look, I might write a quick script to batch through the PDFs stored in S3 to see what identifiers they have in their metadata. I know I created a PDF from an MS word doc on my computer and it didn't seem to add a DocumentId to it so there is going to be an occurence of a few where they don't have any id at all, but the DocumentId is no different than the file hash, so we could do this entire same thing using the PDF hashes instead of the DocumentId. Same result, we store any new file hashes that get generated for an already scraped document and store it as an alias.

nsorros commented 4 years ago

Is it possible that the solution is to separate the creation of the unique identifier and the collection of the documents? in between the two we can add a deduplication step where we can use some machine learning for finding duplicates in our current database of documents and after this all new documents will get a new id.

i understand that this might break away from our current approach of download everything from scratch but it seems like a longer term solution of creating a repository of policy documents. these guys had a similar problem for fashion https://www.lyst.com/ and they have a great team of engineers and data scientists, we could chat with them to hear their thoughts if it helps.

a different solution all together to Matt's problem mainly is to create a fixed set of documents for which we run the rest of the DAG. these would be true documents but for which the ids will not be changed by subsequent scrapes.

jdu commented 4 years ago

I think it's less an issue of de-duplication and we can address the problem before de-duplication is necessary, there are ways to work around the duplication before even downloading the file, it's more handling when the documents identification has changed compared to what we have stored. I don't think we need to jump into a more complex machine learning solution to handle it at this point and it can easily be addressed by having some persistent state about individual documents, it's less a duplication problem and more about implementing state to keep tabs on a document over time to ensure referential integrity further down the pipeline.

nsorros commented 4 years ago

I think it's less an issue of de-duplication and we can address the problem before de-duplication is necessary, there are ways to work around the duplication before even downloading the file, it's more handling when the documents identification has changed compared to what we have stored. I don't think we need to jump into a more complex machine learning solution to handle it at this point and it can easily be addressed by having some persistent state about individual documents, it's less a duplication problem and more about implementing state to keep tabs on a document over time to ensure referential integrity further down the pipeline.

I was mainly thinking cases where the same document appears again or that a document changes the url it comes from. So I was thinking we can do an "easy" check on duplicates before storing a new file and introduce state.

But what you are suggesting, if I understanding it correctly, as in save what has been downloaded and make sure ids stay the same which i agree is a simple solution, also works.

I do think though, as Matt has pointed out, that this is quite important at this point for data science as it is blocking both Matt and Liz.

wellcometrust / reach

File hashes are not consistent between Reach runs #48