What renderer function for XDXF format

dawei-dev commented 7 years ago

Dear Zohl:

I'm using your library to create a command-line dictionary tool. It simply print out the definition of a word. I'm wondering how you render the different format to make it look good on the Terminal. For example. many dictionaries use XDXF format.

zohl commented 7 years ago

Yes, that's a tricky part!

Depending on your own requirements, you may need to represent dictionaries in different ways. Say, the one may want to throw dictionaries' contents on a web page, in terminal window (like in your case) or just find digest of an entry using machine learning magic :) So depending on the needs, different tools can be employed. That's why I deliberatedly avoided implementing any of rendering functions in the library.

As for example, there is a very simple renderer, that I use to put dictionary contents on a webpage [1]. It preserves original formatting of UTF8Text entries (as it is already formated plain-text) and it outputs XDXF "as is" (browser takes care of rendering it).

In your case, you probably want to output UTF8Text "as is", and strip xml tags from XDXF entries yet keeping the structure. There is a similar project to yours, called sdcv (stardict console version). It's written in cpp and has function that, I believe, does the job [2]. You can try to port it to haskell, I guess, library attoparsec will help you a lot here.

Unfortunately, there are still many other formats to deal with, but most of them are pretty rare (I never encountered anything except text/xdxf myself). XDXF is one of the most popular, so it may be worth to target it first :)

[1] https://github.com/zohl/tr/blob/master/src/Main.hs#L58-L61 [2] https://github.com/Dushistov/sdcv/blob/master/src/libwrapper.cpp#L49

dawei-dev commented 7 years ago

Thanks for the reply. I know about the browser being able to render automatically XDXF(xml) since I also throw that text for the web. That leads to another question I'd like your opinion: Would you consider dumping the dictionary into a database and do SQL instead of look up in the dictionary? I think the performance of using the database is better.

zohl commented 7 years ago

I think it depends on the usage. Consider the following cases: 1) You are reading an article and occasionally performing lookups into a dictionary. In terms of library this means the following:

perform word lookup in an Index (which gives us position of desired entry)
extract the entry from a byte array

The first operation might not be efficient right now, as Index is a simple Map. This can be improved using more efficient data structure (e.g. Judy array). As for the second, there is a trade-off:
you can read it from hard drive memory every time you need to get an entry (slower).
you can preload it into RAM (faster, but there will be penalty in memory usage and time of initialization).

Compared to what a sql database might do (accessing hard drive, building response from different blocks/extents scattered over the disk, journaling, locking tables, etc.), you might not want to use it here.

2) You have got a list of words to translate and want to pipeline it into the application. Here a database might be on the top due to caching and other optimizations. And, perhaps, that list of words are already stored in a DB :)

3) You want to perform fuzzy search over a dictionary. This is what will be hard to accomplish using the library right now (only exact matches are supported, and I guess, I should fix it). Here you need to (randomly) access the index and provide a custom function to match words. You can do it, for example, with PostgreSQL (it has built-in functions to deal with text and can be easily extended).

Of course, this is a pure speculation, so I would recommend to write a micro-benchmark for your case to test it out.

zohl / dictionaries

What renderer function for XDXF format #2