websterParser / WebsterParser

Convert Webster's Unabridged 1913 dictionary in to a more usable format
GNU General Public License v3.0
351 stars 20 forks source link

[Recently introduced] conversion results in incorrect entries? #20

Closed krackers closed 4 years ago

krackers commented 4 years ago

Sometime between commit 28d2f8d700eeafdce570ea5d9c4864032b89bfce and commit 7e42c21228d85047c36c88ceccfc288166f280b0, conversion started to produce weird results. If you look at the resulting xml you'll see

<d:entry id="A5230vamgytw0" d:title="Condorcet's method">
<d:index d:value="Condorcet's method" d:title="Condorcet's method"/>
<d:index d:value="Condottiere" d:title="Condottiere"/>
<d:index d:value="Conduce" d:title="Conduce"/>
<d:index d:value="Conducent" d:title="Conducent"/>
<d:index d:value="Conducibility" d:title="Conducibility"/>
<d:index d:value="Conducible" d:title="Conducible"/>
<d:index d:value="Conducibleness" d:title="Conducibleness"/>
<d:index d:value="Conducibly" d:title="Conducibly"/>
<d:index d:value="Conducive" d:title="Conducive"/>
<d:index d:value="Conduciveness" d:title="Conduciveness"/>
<d:index d:value="Conduct" d:title="Conduct"/>
<d:index d:value="Conductance" d:title="Conductance"/>
<d:index d:value="Conductibility" d:title="Conductibility"/>
<d:index d:value="Conductible" d:title="Conductible"/>
<d:index d:value="Conduction" d:title="Conduction"/>

and the index continues until <d:index d:value="Czechs" d:title="Czechs"/>. This doesn't seem correct since most of those should have their own entry, and indeed prior to 28d2f8d700eeafdce570ea5d9c4864032b89bfce things were split up properly. I don't know if this was introduced when you switched to including GCIDE as a submodule or if it was introduced in some intermediary refactoring.

jeffbyrnes commented 4 years ago

More likely, the intermediate refactoring.

I’ll do a git bisect tomorrow & see if I can chase down what’s going on.

krackers commented 4 years ago

Nevermind, I just tried again on the latest commit and it doesn't occur. Not sure why this originally happened.

krackers commented 4 years ago

Reopening; it seems like I can reproduce this when I change the

      if (src.text().trim() !== '1913 Webster' &&
        src.text().trim() !== 'Webster 1913 Suppl.'
      ) {

to

 if (!src.text().includes('1913')) {

(hence why I saw it in the first place as I was preparing PR #19 ). I still think this is a regression as it didn't occur with 28d2f8d700eeafdce570ea5d9c4864032b89bfce when I included the PJC entries.

jeffbyrnes commented 4 years ago

Interesting! With #19 merged, we can see if this bug still occurs; if not, we can close it out & move on, noting that trying to include the volunteer GCIDE entries is problematic.

jeffbyrnes commented 4 years ago

Seems like this isn’t an issue any more, so I’ll close this out.