Lua-UCA support - Githubissues

michal-h21 commented 2 years ago

Hi, here is a patch for citeproc-node-sort.lua that adds support for sorting using Lua-UCA. One thing that is missing is the language setting interface. The code expects bibliography language setting in context.lang, but you may want to name it differently. A function that will set it will be necessary.

diff --git a/citeproc/citeproc-node-sort.lua b/citeproc/citeproc-node-sort.lua
index 5370558..85bb6df 100644
--- a/citeproc/citeproc-node-sort.lua
+++ b/citeproc/citeproc-node-sort.lua
@@ -2,6 +2,13 @@ local Element = require("citeproc.citeproc-node-element")
 local Names = require("citeproc.citeproc-node-names").names
 local util = require("citeproc.citeproc-util")

+-- UCA collation support
+local ducet = require "lua-uca.lua-uca-ducet"
+local collator = require "lua-uca.lua-uca-collator"
+local languages = require "lua-uca.lua-uca-languages"
+
+local collator_obj = collator.new(ducet)
+

 local Sort = Element:new()

@@ -26,24 +33,22 @@ function Sort:sort (items, context)
       table.insert(key_dict[item.id], value)
     end
   end
+  local lang = context.lang or "en"
+  local language_sort = languages[lang] 
+  local collator = collator_obj
+  -- use language specific sorting, if available
+  if language_sort then 
+    collator = language_sort(collator_obj)
+  end
   local compare_entry = function (item1, item2)
     for i, value1 in ipairs(key_dict[item1.id]) do
       local descending = descendings[i]
       local value2 = key_dict[item2.id][i]
       if value1 and value2 then
-        if value1 < value2 then
-          if descending then
-            return false
-          else
-            return true
-          end
-        elseif value1 > value2 then
-          if descending then
-            return true
-          else
-            return false
-          end
-        end
+        local result = collator:compare_strings(value1, value2)
+        -- reverse sorting in descending mode
+        if descending then result = not(result) end
+        return result
       elseif value1 then
         return true
       elseif value2 then

zepinglee commented 2 years ago

Thanks! I pushed a commit to uca-sort branch base on your patch. However there few tests for locale sorting in the test-suite and the speed of running hundreds of test cases is significantly slowed with the UCA. Thus I'll merge it into main branch in some later time.

michal-h21 commented 2 years ago

Of course, it is a bit slower than plain table.sort, but it shouldn't be too bad. At least I hope it isn't. I've just updated UCA on CTAN, I've added all languages that have sorting rules in CLDR, which means most of European and Asian languages.

zepinglee commented 2 years ago

but it shouldn't be too bad.

I just wrote a simple python script to benchmark the test procedure. I checked out 8e70f86 (without lua-uca) and it took 14.3s to run texlua test/citeproc-test.lua on all the 853 tests. For ca10a8b, the commit that introduces lua-uca ahead of 8e70f86, it took 110.5s to run the same tests, which is nearly 8x times slower. I don't expect it either. Note that I've also updated lua-uca to the latest version 0.1a.

michal-h21 commented 2 years ago

I've made some tests and the sorting can be a lot slower in some cases, especially for arrays that contain a lot of similar or duplicated strings. The collator.new() function is also quite time intensive, because it has to copy a huge table with sorting data. I've tried to add more caching to the sorting function, but it didn't help at all.

michal-h21 commented 2 years ago

Wow, I see that you added a full LaTeX package, it is really nice! And also the BibTeX parser and convertor to Citeproc-json! This is actually something that I wanted to do for my other projects too (see Lua-refmanager, which is a bunch of testing scripts for various bibliography related tasks).

So what is missing is only the multilingual support? The main document language is stored in \bbl@main@language macro by both Babel and Polyglossia.

zepinglee commented 2 years ago

There still remain two major features not implemented yet: disambiguation and collapsing. See the following table for details.

https://github.com/zepinglee/citeproc-lua/blob/39e55207e728dfec65515a87a1de6c364ddaaa74/test/citeproc-test.log#L2221-L2260

I'm currently working on the richtext module which causes many failures.

michal-h21 commented 2 years ago

It looks impressive anyways, thank you.

zepinglee commented 2 years ago

So what is missing is only the multilingual support? The main document language is stored in \bbl@main@language macro by both Babel and Polyglossia.

@michal-h21 I'm working on babel compatibility. Do you know how to map \bbl@main@language to BCP 47 language code like en-GB? I checked the babel documentation but did not find any. I found biber does this conversion in its perl code Constants.pm. So it seems no way but providing a mapping table in my project.

michal-h21 commented 2 years ago

@zepinglee yes, I think you can do that only using mapping. I use the following mapping from Babel names to HTML language codes in TeX4ht:


\Declare:Language{UKenglish}{en}
\Declare:Language{USenglish}{en}
\Declare:Language{latex}{en}
\Declare:Language{acadian}{fr}
\Declare:Language{albanian}{sq}
\Declare:Language{american}{en}
\Declare:Language{amharic}{am}
\Declare:Language{arabic}{ar}
\Declare:Language{armenian}{hy}
\Declare:Language{australian}{en}
\Declare:Language{austrian}{de}
\Declare:Language{basque}{eu}
\Declare:Language{bengali}{bn}
\Declare:Language{brazilian}{pt}
\Declare:Language{brazil}{pt}
\Declare:Language{breton}{br}
\Declare:Language{british}{en}
\Declare:Language{bulgarian}{bg}
\Declare:Language{canadian}{en}
\Declare:Language{canadien}{fr}
\Declare:Language{catalan}{ca}
\Declare:Language{croatian}{hr}
\Declare:Language{czech}{cs}
\Declare:Language{danish}{da}
\Declare:Language{divehi}{dv}
\Declare:Language{dutch}{nl}
\Declare:Language{english}{en}
\Declare:Language{esperanto}{eo}
\Declare:Language{estonian}{et}
\Declare:Language{finnish}{f\/i}
\Declare:Language{francais}{fr}
\Declare:Language{french}{fr}
\Declare:Language{galician}{gl}
\Declare:Language{germanb}{de}
\Declare:Language{german}{de}
\Declare:Language{greek}{el}
\Declare:Language{hebrew}{he}
\Declare:Language{hindi}{hi}
\Declare:Language{hungarian}{hu}
\Declare:Language{icelandic}{is}
\Declare:Language{interlingua}{ia}
\Declare:Language{irish}{ga}
\Declare:Language{italian}{it}
\Declare:Language{kannada}{kn}
\Declare:Language{khmer}{km}
\Declare:Language{korean}{ko}
\Declare:Language{lao}{lo}
\Declare:Language{latin}{la}
\Declare:Language{latvian}{lv}
\Declare:Language{lithuanian}{lt}
\Declare:Language{lowersorbian}{dsb}
\Declare:Language{magyar}{hu}
\Declare:Language{malayalam}{ml}
\Declare:Language{marathi}{mr}
\Declare:Language{naustrian}{de}
\Declare:Language{newzealand}{en}
\Declare:Language{ngerman}{de}
\Declare:Language{norsk}{no}
\Declare:Language{norwegiannynorsk}{nn}
\Declare:Language{nynorsk}{no}
\Declare:Language{occitan}{oc}
\Declare:Language{oldchurchslavonic}{cu}
\Declare:Language{persian}{fa}
\Declare:Language{polish}{pl}
\Declare:Language{polutonikogreek}{el}
\Declare:Language{portuges}{pt}
\Declare:Language{portuguese}{pt}
\Declare:Language{romanian}{ro}
\Declare:Language{romansh}{rm}
\Declare:Language{russian}{ru}
\Declare:Language{samin}{se}
\Declare:Language{sanskrit}{sa}
\Declare:Language{scottish}{gd}
\Declare:Language{serbian}{sr}
\Declare:Language{serbo-croatian}{sh}
\Declare:Language{slovak}{sk}
\Declare:Language{slovene}{sl}
\Declare:Language{slovenian}{sl}
\Declare:Language{spanish}{es}
\Declare:Language{swedish}{sv}
\Declare:Language{tamil}{ta}
\Declare:Language{telugu}{te}
\Declare:Language{thai}{th}
\Declare:Language{tibetan}{bo}
\Declare:Language{turkish}{tr}
\Declare:Language{turkmen}{tk}
\Declare:Language{ukrainian}{uk}
\Declare:Language{uppersorbian}{hsb}
\Declare:Language{urdu}{ur}
\Declare:Language{vietnamese}{vi}
\Declare:Language{welsh}{cy}

zepinglee commented 2 years ago

I've added lua-uca for non-English locales so it won't affect much of running citeproc test-suite.

A simple test case is https://github.com/zepinglee/citeproc-lua/blob/2f99fde61437efce314379cc3a2b211f53cfddf1/test/latex/luatex-2-sort.lvt#L1-L34 along with https://github.com/zepinglee/citeproc-lua/blob/2f99fde61437efce314379cc3a2b211f53cfddf1/test/latex/support/sort.bib#L1-L18

The result is as follows.

https://github.com/zepinglee/citeproc-lua/blob/2f99fde61437efce314379cc3a2b211f53cfddf1/test/latex/luatex-2-sort.tlg#L6

These words are from the luc-uca package documentation.

michal-h21 commented 2 years ago

That sounds great! I've got an error with your example:

Module csl Error: Failed to find "sort.csl". on input line 23

stack traceback:                                                                               
        [C]: in function 'error'
        ...ocal/texlive/2021/texmf-dist/tex/latex/base/ltluatex.lua:109: in field 'mod
ule_error'
        /home/mint/texmf/scripts/csl/csl-core.lua:27: in function 'csl-core.error'
        /home/mint/texmf/scripts/csl/csl-core.lua:56: in function 'csl-core.read_file'

        /home/mint/texmf/scripts/csl/csl-core.lua:114: in function 'csl-core.init'
        /home/mint/texmf/scripts/csl/csl.lua:35: in function 'csl.init'
        [\directlua]:1: in main chunk.
\lua_now:e #1->\__lua_now:n {#1}                                                               

l.23  \begin{document}

I've found that it is because I don't have the sort.csl file, but the error message is a bit cryptic. Anyway, when I tried it with another CSL file, it seems to work OK.

zepinglee commented 2 years ago

I've found that it is because I don't have the sort.csl file, but the error message is a bit cryptic. Anyway, when I tried it with another CSL file, it seems to work OK.

@michal-h21 It is a unit test file for l3build which should be run with l3build save --config test/latex/config-luatex-2 luatex-2-sort. This command copies test/latex/luatex-2-sort.lvt and all files in test/latex/support/ including sort.csl to build/test-test/latex/config-luatex-2/ before running lualatex.

michal-h21 commented 2 years ago

@zepinglee ah, it works now, thanks. One issue I noticed is that "čáp" is sorted incorrectly. It is because Lua-UCA doesn't support decomposed Unicode letters. It works when I use the normalized character.

zepinglee commented 2 years ago

@zepinglee ah, it works now, thanks. One issue I noticed is that "čáp" is sorted incorrectly. It is because Lua-UCA doesn't support decomposed Unicode letters. It works when I use the normalized character.

I've not heard of precomposed characters before. Are you going to implement that feature in lua-uca?

michal-h21 commented 2 years ago

@zepinglee I've implemented that now in the development version of Lua-UCA. Decomposed characters are now normalized, so "čáp" is correctly sorted after "cihla". I've also added some more caching, I am not sure if it will help for your big test suite, but you can try it.

zepinglee commented 2 years ago

@zepinglee I've implemented that now in the development version of Lua-UCA. Decomposed characters are now normalized, so "čáp" is correctly sorted after "cihla". I've also added some more caching, I am not sure if it will help for your big test suite, but you can try it.

Thanks for you support! It greatly helps this project.

michal-h21 commented 2 years ago

I am glad I can help! Does it help with the speed? I suspect that maybe not, as normalization itself can take some time. If everything works, I can update Lua-UCA on CTAN.

zepinglee commented 2 years ago

I am glad I can help! Does it help with the speed? I suspect that maybe not, as normalization itself can take some time. If everything works, I can update Lua-UCA on CTAN.

@michal-h21 It makes no significant difference in speed with the latest version. It's totally ok to publish it to CTAN.

michal-h21 commented 2 years ago

I thought so. I don't know how to change the code to make it faster. I cache everything that could be potentially slow to calculate. Anyway, I will make a new release.

zepinglee / citeproc-lua

Lua-UCA support #3