Closed piranha closed 4 years ago
I think I figured out why, check out index.js#L70-L72
Probably needs to be adjusted to avoid changing that if it’s a comment.
Oh, but hmm; right above it is:
// Remove comments
string = string.replace(/<\!--.*?-->/g, '');
So… I guess not?
Or perhaps the exclamation point should not be escaped… I don‘t think it’s a special character in RegEx?
Yeah, that should do the trick; #6, in fact, includes this fix, so I think if you pull that into a local clone, and rebuild the dictionary, you should be good to go.
I realized there’s quite a few things to clean up/tweak here, so I forked my own, in case that helps you out: https://github.com/jeffbyrnes/WebsterParser.
@piranha I can confirm that #8 fixes the dictionary entries for sequester, blurry and asbestus. Wanna give it a try for your needs?
Yeah, I see them now, cool! Thanks a lot!
Maybe it's possible to get fixed more? Asbestus seems like a hard word. :)
A brace there, an excessive (?) newline and some unicode glitches there!
Here’s the source for the entry, reformatted for easier reading:
<p>
<ent>Asbestos</ent><br/>
<ent>Asbestus</ent><br/>
<mhw>
{
<hw>As*bes"tus</hw> <pr>(<?/)</pr>,
<hw>As*bes"tos</hw> <pr>(?; 277)</pr>,
}
</mhw>
<pos>n.</pos>
<ety>
[L. <ets>asbestos</ets>
(NL. <ets>asbestus</ets>)
a kind of mineral unaffected by fire, Gr. <?/ (prop. an adj.)
inextinguishable;
<grk>'a</grk> priv. + <?/ to extinguish.
]
</ety>
<fld>(Min.)</fld>
<def>
A variety of amphibole or of pyroxene, occurring in long and delicate
fibers, or in fibrous masses or seams, usually of a white, gray, or
green-gray color. The name is also given to a similar variety of serpentine.
</def>
<br/>
[<src>1913 Webster</src>]
</p>
Hm, what are those <?/
though?
They represent characters that could not be properly transcribed.
Hey! Terrific work on conversion! I've installed the dictionary in my Dictionary.app and use it often for quite some time now.
Today though I decided to convert it to StarDict format, and used pyglossary for that. Pyglossary failed with lxml errors on me and after some debugging I found that (some) comments are encoded as
<!–
instead of<!--
- usingEN DASH
(codepoint 0x2013) rather than twoHYPHEN-MINUS
(codepoint 0x2d).I suspect this is an issue somewhere in conversion since I can't open those words in Dictionary.app; broken words 100% include
sequester
,blurry
andasbestus
.Hopefully you can fix that! Right now I've patched my copy of pyglossary with that gem of a code:
:)