websterParser / WebsterParser

Convert Webster's Unabridged 1913 dictionary in to a more usable format
GNU General Public License v3.0
351 stars 20 forks source link

Issue with comments in entries #7

Closed piranha closed 4 years ago

piranha commented 4 years ago

Hey! Terrific work on conversion! I've installed the dictionary in my Dictionary.app and use it often for quite some time now.

Today though I decided to convert it to StarDict format, and used pyglossary for that. Pyglossary failed with lxml errors on me and after some debugging I found that (some) comments are encoded as <!– instead of <!-- - using EN DASH (codepoint 0x2013) rather than two HYPHEN-MINUS (codepoint 0x2d).

I suspect this is an issue somewhere in conversion since I can't open those words in Dictionary.app; broken words 100% include sequester, blurry and asbestus.

Hopefully you can fix that! Right now I've patched my copy of pyglossary with that gem of a code:

entryFull = entryFull.replace('<!–', '<!--').replace('–>', '-->').replace('&#x2013;&gt;', '-->')

:)

jeffbyrnes commented 4 years ago

I think I figured out why, check out index.js#L70-L72

Probably needs to be adjusted to avoid changing that if it’s a comment.

Oh, but hmm; right above it is:

  // Remove comments
  string = string.replace(/<\!--.*?-->/g, '');

So… I guess not?

Or perhaps the exclamation point should not be escaped… I don‘t think it’s a special character in RegEx?

jeffbyrnes commented 4 years ago

Yeah, that should do the trick; #6, in fact, includes this fix, so I think if you pull that into a local clone, and rebuild the dictionary, you should be good to go.

jeffbyrnes commented 4 years ago

I realized there’s quite a few things to clean up/tweak here, so I forked my own, in case that helps you out: https://github.com/jeffbyrnes/WebsterParser.

jeffbyrnes commented 4 years ago

@piranha I can confirm that #8 fixes the dictionary entries for sequester, blurry and asbestus. Wanna give it a try for your needs?

piranha commented 4 years ago

Yeah, I see them now, cool! Thanks a lot!

Maybe it's possible to get fixed more? Asbestus seems like a hard word. :)

two more things

A brace there, an excessive (?) newline and some unicode glitches there!

j-f1 commented 4 years ago

Here’s the source for the entry, reformatted for easier reading:

<p>
  <ent>Asbestos</ent><br/>
  <ent>Asbestus</ent><br/>
  <mhw>
    {
      <hw>As*bes"tus</hw> <pr>(<?/)</pr>,
      <hw>As*bes"tos</hw> <pr>(?; 277)</pr>,
    }
  </mhw>
  <pos>n.</pos>
  <ety>
    [L. <ets>asbestos</ets>
      (NL. <ets>asbestus</ets>)
      a kind of mineral unaffected by fire, Gr. <?/ (prop. an adj.)
      inextinguishable;
      <grk>'a</grk> priv. + <?/ to extinguish.
    ]
  </ety>
  <fld>(Min.)</fld>
  <def>
    A variety of amphibole or of pyroxene, occurring in long and delicate
    fibers, or in fibrous masses or seams, usually of a white, gray, or
    green-gray color. The name is also given to a similar variety of serpentine.
  </def>
  <br/>
  [<src>1913 Webster</src>]
</p>
piranha commented 4 years ago

Hm, what are those <?/ though?

j-f1 commented 4 years ago

They represent characters that could not be properly transcribed.