websterParser / WebsterParser

Convert Webster's Unabridged 1913 dictionary in to a more usable format
GNU General Public License v3.0
349 stars 20 forks source link

Mising words due to parse errors #73

Closed SimonTeixidor closed 2 years ago

SimonTeixidor commented 2 years ago

I believe that all words following "Roller bearing" from the CIDE.R source file are missing from the resulting dictionary. See "Rut", "Ruta-baga", etc.

I ran a git bisect, and it appears that the breaking change was introduced in 3375fe69ba968f13ddca594475de3a8fc01b2c79. I tried to read the changes introduced there but I haven't been able to figure the issue out yet.

There's a similar issue for words following "Stooge", such as "Sweet". Here the issue seems to be that the source data is missing a closing </p> tag for the "Stooge" entry. I guess this should be reported upstream to GCIDE, but perhaps we could make the parser more robust against things like that?

Given that I just stumbled upon some examples, I suspect that there are quite a few words missing. I wonder if we could come up with an automated way to verify that the resulting dictionary contains all words from the source files?

jeffbyrnes commented 2 years ago

Knowing that 3375fe69ba968f13ddca594475de3a8fc01b2c79 is where things took a turn is a huge help, thanks @SimonPersson!

I’ll see if I can piece it together, but maybe @nickwynja, who authored that commit, can help?

nickwynja commented 2 years ago

Looks like both this and #56 are caused by my changes. I hope to take a look over the next week or two.

jeffbyrnes commented 2 years ago

Appreciate that @nickwynja! I haven’t had a chance to dig in, so any help is appreciated.

nickwynja commented 2 years ago

Managed to settle on a more direct, though less elegant, solution much quicker than I thought I'd be able to.

jeffbyrnes commented 2 years ago

I’m validating that https://github.com/websterParser/WebsterParser/pull/81 did the trick, and if so, I’ll cut a new release for y’all!

jeffbyrnes commented 2 years ago

Fixed!