Providing data for OED_Greg's aggregation

zooniverse / shakespeares_world

Full text transcription project for the Folger Shakespeare Library

https://www.shakespearesworld.org

Other

8 stars 5 forks source link

Providing data for OED_Greg's aggregation #196

Closed VVH closed 8 years ago

VVH commented 8 years ago

@ggdhines will run a separate post-processing step on SW data to identify new words and variants in the data submitted by users and then provide a csv to James M at OED in the following structure:

A. The wordform found B. Date C. Author D. Title E. Catalogue record number F. Image record URL

James asks: How feasible would it be possible to supply the context in which the wordform is found – say, the sentence in which it occurs, or even better, a given number of characters to the left and right? This would enable us to discount some of the noise by eye without having to visit the page image.

VVH: I think this is feasible

VVH commented 8 years ago

@ggdhines further to my emails of 8 and 15 Feb.

ggdhines-zz commented 8 years ago

Code just finished running - I haven't found any variants yet. Based on what @simoneduca has told me, the code in aggregation/engine/helper_functions.py line 161 should give us what we need. Then in aggregation/engine/folger.py line 480 - there is an assert statement that checks if any variants have been found. For production, I'll just have any variants emailed out. In answer to James' request - I think that's fine. We can reopen this when we find a variant.

ggdhines-zz commented 8 years ago

Realized that I wasn't looking for the variants in the right place. Corrected that - see above commit. Any spelling mistake will count as a variant so I am looking for lines where at least 2 people gave the same variant (this is still going to give a lot of false positives with names). Every line blob in the json result now has a field called "variants"