Closed michal-h21 closed 2 years ago
Thanks! I pushed a commit to uca-sort
branch base on your patch. However there few tests for locale sorting in the test-suite and the speed of running hundreds of test cases is significantly slowed with the UCA. Thus I'll merge it into main
branch in some later time.
Of course, it is a bit slower than plain table.sort
, but it shouldn't be too bad. At least I hope it isn't. I've just updated UCA on CTAN, I've added all languages that have sorting rules in CLDR, which means most of European and Asian languages.
but it shouldn't be too bad.
I just wrote a simple python script to benchmark the test procedure. I checked out 8e70f86 (without lua-uca) and it took 14.3s to run texlua test/citeproc-test.lua
on all the 853 tests. For ca10a8b, the commit that introduces lua-uca ahead of 8e70f86, it took 110.5s to run the same tests, which is nearly 8x times slower. I don't expect it either. Note that I've also updated lua-uca to the latest version 0.1a.
I've made some tests and the sorting can be a lot slower in some cases, especially for arrays that contain a lot of similar or duplicated strings. The collator.new()
function is also quite time intensive, because it has to copy a huge table with sorting data. I've tried to add more caching to the sorting function, but it didn't help at all.
Wow, I see that you added a full LaTeX package, it is really nice! And also the BibTeX parser and convertor to Citeproc-json! This is actually something that I wanted to do for my other projects too (see Lua-refmanager, which is a bunch of testing scripts for various bibliography related tasks).
So what is missing is only the multilingual support? The main document language is stored in \bbl@main@language
macro by both Babel and Polyglossia.
There still remain two major features not implemented yet: disambiguation and collapsing. See the following table for details.
I'm currently working on the richtext
module which causes many failures.
It looks impressive anyways, thank you.
So what is missing is only the multilingual support? The main document language is stored in
\bbl@main@language
macro by both Babel and Polyglossia.
@michal-h21 I'm working on babel
compatibility. Do you know how to map \bbl@main@language
to BCP 47 language code like en-GB
? I checked the babel
documentation but did not find any. I found biber
does this conversion in its perl code Constants.pm. So it seems no way but providing a mapping table in my project.
@zepinglee yes, I think you can do that only using mapping. I use the following mapping from Babel names to HTML language codes in TeX4ht:
\Declare:Language{UKenglish}{en}
\Declare:Language{USenglish}{en}
\Declare:Language{latex}{en}
\Declare:Language{acadian}{fr}
\Declare:Language{albanian}{sq}
\Declare:Language{american}{en}
\Declare:Language{amharic}{am}
\Declare:Language{arabic}{ar}
\Declare:Language{armenian}{hy}
\Declare:Language{australian}{en}
\Declare:Language{austrian}{de}
\Declare:Language{basque}{eu}
\Declare:Language{bengali}{bn}
\Declare:Language{brazilian}{pt}
\Declare:Language{brazil}{pt}
\Declare:Language{breton}{br}
\Declare:Language{british}{en}
\Declare:Language{bulgarian}{bg}
\Declare:Language{canadian}{en}
\Declare:Language{canadien}{fr}
\Declare:Language{catalan}{ca}
\Declare:Language{croatian}{hr}
\Declare:Language{czech}{cs}
\Declare:Language{danish}{da}
\Declare:Language{divehi}{dv}
\Declare:Language{dutch}{nl}
\Declare:Language{english}{en}
\Declare:Language{esperanto}{eo}
\Declare:Language{estonian}{et}
\Declare:Language{finnish}{f\/i}
\Declare:Language{francais}{fr}
\Declare:Language{french}{fr}
\Declare:Language{galician}{gl}
\Declare:Language{germanb}{de}
\Declare:Language{german}{de}
\Declare:Language{greek}{el}
\Declare:Language{hebrew}{he}
\Declare:Language{hindi}{hi}
\Declare:Language{hungarian}{hu}
\Declare:Language{icelandic}{is}
\Declare:Language{interlingua}{ia}
\Declare:Language{irish}{ga}
\Declare:Language{italian}{it}
\Declare:Language{kannada}{kn}
\Declare:Language{khmer}{km}
\Declare:Language{korean}{ko}
\Declare:Language{lao}{lo}
\Declare:Language{latin}{la}
\Declare:Language{latvian}{lv}
\Declare:Language{lithuanian}{lt}
\Declare:Language{lowersorbian}{dsb}
\Declare:Language{magyar}{hu}
\Declare:Language{malayalam}{ml}
\Declare:Language{marathi}{mr}
\Declare:Language{naustrian}{de}
\Declare:Language{newzealand}{en}
\Declare:Language{ngerman}{de}
\Declare:Language{norsk}{no}
\Declare:Language{norwegiannynorsk}{nn}
\Declare:Language{nynorsk}{no}
\Declare:Language{occitan}{oc}
\Declare:Language{oldchurchslavonic}{cu}
\Declare:Language{persian}{fa}
\Declare:Language{polish}{pl}
\Declare:Language{polutonikogreek}{el}
\Declare:Language{portuges}{pt}
\Declare:Language{portuguese}{pt}
\Declare:Language{romanian}{ro}
\Declare:Language{romansh}{rm}
\Declare:Language{russian}{ru}
\Declare:Language{samin}{se}
\Declare:Language{sanskrit}{sa}
\Declare:Language{scottish}{gd}
\Declare:Language{serbian}{sr}
\Declare:Language{serbo-croatian}{sh}
\Declare:Language{slovak}{sk}
\Declare:Language{slovene}{sl}
\Declare:Language{slovenian}{sl}
\Declare:Language{spanish}{es}
\Declare:Language{swedish}{sv}
\Declare:Language{tamil}{ta}
\Declare:Language{telugu}{te}
\Declare:Language{thai}{th}
\Declare:Language{tibetan}{bo}
\Declare:Language{turkish}{tr}
\Declare:Language{turkmen}{tk}
\Declare:Language{ukrainian}{uk}
\Declare:Language{uppersorbian}{hsb}
\Declare:Language{urdu}{ur}
\Declare:Language{vietnamese}{vi}
\Declare:Language{welsh}{cy}
I've added lua-uca
for non-English locales so it won't affect much of running citeproc test-suite.
A simple test case is https://github.com/zepinglee/citeproc-lua/blob/2f99fde61437efce314379cc3a2b211f53cfddf1/test/latex/luatex-2-sort.lvt#L1-L34 along with https://github.com/zepinglee/citeproc-lua/blob/2f99fde61437efce314379cc3a2b211f53cfddf1/test/latex/support/sort.bib#L1-L18
The result is as follows.
These words are from the luc-uca
package documentation.
That sounds great! I've got an error with your example:
Module csl Error: Failed to find "sort.csl". on input line 23
stack traceback:
[C]: in function 'error'
...ocal/texlive/2021/texmf-dist/tex/latex/base/ltluatex.lua:109: in field 'mod
ule_error'
/home/mint/texmf/scripts/csl/csl-core.lua:27: in function 'csl-core.error'
/home/mint/texmf/scripts/csl/csl-core.lua:56: in function 'csl-core.read_file'
/home/mint/texmf/scripts/csl/csl-core.lua:114: in function 'csl-core.init'
/home/mint/texmf/scripts/csl/csl.lua:35: in function 'csl.init'
[\directlua]:1: in main chunk.
\lua_now:e #1->\__lua_now:n {#1}
l.23 \begin{document}
I've found that it is because I don't have the sort.csl
file, but the error message is a bit cryptic. Anyway, when I tried it with another CSL file, it seems to work OK.
I've found that it is because I don't have the
sort.csl
file, but the error message is a bit cryptic. Anyway, when I tried it with another CSL file, it seems to work OK.
@michal-h21 It is a unit test file for l3build
which should be run with l3build save --config test/latex/config-luatex-2 luatex-2-sort
. This command copies test/latex/luatex-2-sort.lvt
and all files in test/latex/support/
including sort.csl
to build/test-test/latex/config-luatex-2/
before running lualatex.
@zepinglee ah, it works now, thanks. One issue I noticed is that "čáp" is sorted incorrectly. It is because Lua-UCA doesn't support decomposed Unicode letters. It works when I use the normalized character.
@zepinglee ah, it works now, thanks. One issue I noticed is that "čáp" is sorted incorrectly. It is because Lua-UCA doesn't support decomposed Unicode letters. It works when I use the normalized character.
I've not heard of precomposed characters before. Are you going to implement that feature in lua-uca
?
@zepinglee I've implemented that now in the development version of Lua-UCA. Decomposed characters are now normalized, so "čáp" is correctly sorted after "cihla". I've also added some more caching, I am not sure if it will help for your big test suite, but you can try it.
@zepinglee I've implemented that now in the development version of Lua-UCA. Decomposed characters are now normalized, so "čáp" is correctly sorted after "cihla". I've also added some more caching, I am not sure if it will help for your big test suite, but you can try it.
Thanks for you support! It greatly helps this project.
I am glad I can help! Does it help with the speed? I suspect that maybe not, as normalization itself can take some time. If everything works, I can update Lua-UCA on CTAN.
I am glad I can help! Does it help with the speed? I suspect that maybe not, as normalization itself can take some time. If everything works, I can update Lua-UCA on CTAN.
@michal-h21 It makes no significant difference in speed with the latest version. It's totally ok to publish it to CTAN.
I thought so. I don't know how to change the code to make it faster. I cache everything that could be potentially slow to calculate. Anyway, I will make a new release.
Hi, here is a patch for
citeproc-node-sort.lua
that adds support for sorting using Lua-UCA. One thing that is missing is the language setting interface. The code expects bibliography language setting incontext.lang
, but you may want to name it differently. A function that will set it will be necessary.