yomidevs / yomitan-import

External dictionary importer for Yomitan.
https://foosoft.net/projects/yomichan-import/
MIT License
12 stars 5 forks source link

JMdict and JMdict Forms Do Not Have Valid Revision Dates #9

Closed MarvNC closed 1 year ago

MarvNC commented 1 year ago

chrome_Welcome_to_Yomibaba!_-_Google_Chrome_2023-09-25_00-00-33 On a fresh compile of both. This is seen in the .zip files distributed in Aquafina-water-bottle/jmdict-english-yomichan and MarvNC/jmdict-yomitan.

stephenmk commented 1 year ago

The code expects the JMdict date entry to be the final entry in the file. A couple of months ago they started including a small selection of JMnedict (name entries) in the JMdict file, so the date entry is no longer the final entry.

Instead of looking for the final entry, I guess you'd want to find the entry with the sequence number equal to 9999999. Or find the entry with the expression JMdict.

https://github.com/themoeway/yomitan-import/blob/73b35ff03a78de0c5bb9881eb1d99af121746dab/jmdict.go#L65-L83

Also, I've been slowly working on adding JMdict to jitenbot. The mdict (MDX/MDD) version is pretty much finished. Eventually I plan to get it working with yomichan too, but that's a pain because yomichan's format is so much more limited. So while it may be many months in the future until everything is ported over to jitenbot (including the name dictionary), you may want to reconsider spending too much time on yomitan-import.

sujou

MarvNC commented 1 year ago

Wow that looks awesome, are there any significant improvements planned for the Yomichan version?

And yeah, just hoping to fix the rev version issue for now.

stephenmk commented 1 year ago

are there any significant improvements planned for the Yomichan version?

There will be only one yomichan JSON "term" per JMdict entry per headword. Right now in my current version there's one JSON term per JMdict sense multiplied by the number of headwords, which results in an astronomical number of terms[^1]. It's possible that merging the JSON terms like this may result in faster validation times when importing the dictionary file, although I won't know until I try[^2].

[^1]: I designed the current version that way because that's how the original one worked, and some small yomichan features (e.g. term tags for parts of speech and miscellaneous info) relied upon each JMdict sense being a separate JSON term. Now that I have a better understanding of how this stuff works, I feel more comfortable breaking with that tradition.

[^2]: It will be nice if it does import faster, but that's not my main goal. This isn't really an issue to be solved on the dictionary side; the JSON validation process in yomichan badly needs to be optimized.

This will solve the "Merging of terms from separate entries" problem that I wrote about in this pull request.

Since this design means I'll no longer be able to use yomichan's term tags to display part-of-speech and other miscellaneous information, I'm going to use embedded image files to display the information instead. In some ways this is an improvement, because yomichan's term tags do not display this information in the correct order. Most people probably don't know this, but the order of these tags can be important to understanding JMdict entries. If the "adj-no" tag is the first tag, for example, it means that the word is mostly used as 〜の and the definition glosses will be written as adjectives (rather than nouns, adverbs, etc). Sometimes these definition glosses can be interpreted differently (English has plenty of words that can be both nouns and verbs), so the tags are there to resolve that ambiguity.

Using embedded images also means we'll be able to avoid the emoji problem that lots of people have with chrome-based browsers. I'll also be able to use embedded images instead of weird symbols (🅁, ⚠, ⛬, etc.) in the forms table. Since embedded images support hover-text in yomichan, users will be able to hover over and see additional information if they don't understand the symbols at first.

I'm also now grouping the senses by their part of speech tags. So if three senses in a row share are all "noun" glosses, then they'll be grouped together under a single noun tag rather than displaying the noun tag on each sense.

chuuchou

Also using furigana in all cross referenced words now. I want to add furigana to the example sentences as well by using a variety of different resources, but we'll see how well that goes.

kigo

MarvNC commented 1 year ago

Oh wow, didn't know about the tag order issue. Looks like some great improvements with the image tags and grouping, looking forward to seeing this release for jitenbot!