wdfox / ConfidenceScanner

4 stars 1 forks source link

Handling Funny Characters #1

Closed wdfox closed 7 years ago

wdfox commented 7 years ago

@TomDonoghue -- Just did a little bit of messing around with the way special characters are handled in BeatifulSoup. It looks to me like they simply get preserved in their natural form (although I have only tried it with Greek letters). Do we want to keep words that contain special characters in our body of text for an article or pr or should I get rid of any words containing non-ascii.lower characters?

My initial though is to leave them in for the analysis. At worst they will be disregarded, as they don't do much for the 'confidence' of the writing. Also, if we decide to normalize against the length of the abstract or pr, it would be worth having them in there. Let me know what you think.

TomDonoghue commented 7 years ago

@wdfox : assuming there are relatively few such words, I don't think it's too important what we do with them - they are unlikely to have a large effect on the results. Keeping them, for the reasons you mention, makes sense to me.

This is definitely the kind of thing we should keep note of, and we might want to check it's influence later (as in, re-run analysis with and without special character words). Given we might want to examine this, one option for this, and things like it, is to write to bit of code to do this (drop the special character words) as an extra line / function etc, keep it off, but it's ready and accessible if and when we need to check it. The benefit of adding these things originally is that it can be much easier to add these little things in the initial write, rather than retro-fit them in later. In this case, I'd say probably add the function, since it should be fairly straight-forward, for other features it's always case-by-case for deciding if potentially-useful functionality is worth the effort / time of implementing them.