Closed VVH closed 8 years ago
@ggdhines further to my emails of 8 and 15 Feb.
Code just finished running - I haven't found any variants yet. Based on what @simoneduca has told me, the code in aggregation/engine/helper_functions.py line 161 should give us what we need. Then in aggregation/engine/folger.py line 480 - there is an assert statement that checks if any variants have been found. For production, I'll just have any variants emailed out. In answer to James' request - I think that's fine. We can reopen this when we find a variant.
Realized that I wasn't looking for the variants in the right place. Corrected that - see above commit. Any spelling mistake will count as a variant so I am looking for lines where at least 2 people gave the same variant (this is still going to give a lot of false positives with names). Every line blob in the json result now has a field called "variants"
@ggdhines will run a separate post-processing step on SW data to identify new words and variants in the data submitted by users and then provide a csv to James M at OED in the following structure:
A. The wordform found B. Date C. Author D. Title E. Catalogue record number F. Image record URL
James asks: How feasible would it be possible to supply the context in which the wordform is found – say, the sentence in which it occurs, or even better, a given number of characters to the left and right? This would enable us to discount some of the noise by eye without having to visit the page image.
VVH: I think this is feasible