Closed dawei-dev closed 7 years ago
Yes, that's a tricky part!
Depending on your own requirements, you may need to represent dictionaries in different ways. Say, the one may want to throw dictionaries' contents on a web page, in terminal window (like in your case) or just find digest of an entry using machine learning magic :) So depending on the needs, different tools can be employed. That's why I deliberatedly avoided implementing any of rendering functions in the library.
As for example, there is a very simple renderer, that I use to put dictionary contents on a webpage [1].
It preserves original formatting of UTF8Text
entries (as it is already formated plain-text) and it outputs XDXF
"as is" (browser takes care of rendering it).
In your case, you probably want to output UTF8Text
"as is", and strip xml tags from XDXF
entries yet keeping the structure.
There is a similar project to yours, called sdcv
(stardict console version). It's written in cpp and has function that, I believe, does the job [2].
You can try to port it to haskell, I guess, library attoparsec
will help you a lot here.
Unfortunately, there are still many other formats to deal with, but most of them are pretty rare (I never encountered anything except text/xdxf myself). XDXF is one of the most popular, so it may be worth to target it first :)
[1] https://github.com/zohl/tr/blob/master/src/Main.hs#L58-L61 [2] https://github.com/Dushistov/sdcv/blob/master/src/libwrapper.cpp#L49
Thanks for the reply.
I know about the browser being able to render automatically XDXF
(xml
) since I also throw that text for the web. That leads to another question I'd like your opinion: Would you consider dumping the dictionary into a database and do SQL
instead of look up in the dictionary? I think the performance of using the database is better.
I think it depends on the usage. Consider the following cases: 1) You are reading an article and occasionally performing lookups into a dictionary. In terms of library this means the following:
Index
(which gives us position of desired entry)extract the entry from a byte array
The first operation might not be efficient right now, as Index
is a simple Map
. This can be improved using more efficient data structure (e.g. Judy array).
As for the second, there is a trade-off:
you can preload it into RAM (faster, but there will be penalty in memory usage and time of initialization).
Compared to what a sql database might do (accessing hard drive, building response from different blocks/extents scattered over the disk, journaling, locking tables, etc.), you might not want to use it here.
2) You have got a list of words to translate and want to pipeline it into the application. Here a database might be on the top due to caching and other optimizations. And, perhaps, that list of words are already stored in a DB :)
3) You want to perform fuzzy search over a dictionary. This is what will be hard to accomplish using the library right now (only exact matches are supported, and I guess, I should fix it). Here you need to (randomly) access the index and provide a custom function to match words. You can do it, for example, with PostgreSQL (it has built-in functions to deal with text and can be easily extended).
Of course, this is a pure speculation, so I would recommend to write a micro-benchmark for your case to test it out.
Dear Zohl:
I'm using your library to create a command-line dictionary tool. It simply print out the definition of a word. I'm wondering how you render the different format to make it look good on the Terminal. For example. many dictionaries use
XDXF
format.