Support string cluster ids

adomasven commented 5 years ago

Currently cluster ids are limited to numbers. This is purely a compatibility with citeproc-js issue, but an important one since it is likely that many citeproc-js clients have alphanumeric cluster IDs and switching to citeproc-rs requires additional conversion code to support existing documents.

cormacrelf commented 5 years ago

I’ve been thinking about this one.

Strings are slower. Any strings will be interned, but that’s still bunch of overhead, and still has to allocate before interning. In particular, probably makes renumbering slower, and foils any plans to convert wasm function calls away from JSON-serialisation.
Multiple numeric ID types are confusing in languages like JS without type aliases and newtypes.
How hard would it be to build a JS dictionary to convert these?

adomasven commented 5 years ago

In particular, probably makes renumbering slower, and foils any plans to convert wasm function calls away from JSON-serialisation.

Could you elaborate? Don't clusters already contains Cite objects which have string IDs?

Multiple numeric ID types are confusing in languages like JS without type aliases and newtypes.

Not entirely sure what you mean.

How hard would it be to build a JS dictionary to convert these?

This is not a matter of whether it's hard or not, but rather that each individual client of citeproc-js that switches to citeproc-rs will likely have to do this, which while for each individual project is not a problem, will amount to a significant duplicated amount of work across them all. On the other hand if performance is an issue then a Rust dictionary is an option (or a Dictionary in the Driver API code somewhere to pass on strict numbers to Rust).

cormacrelf commented 5 years ago

Multiple numeric ID types are confusing in languages like JS without type aliases and newtypes. Not entirely sure what you mean.

This is a point in favour of using strings, as they are easier to debug in JS when you come across the following genre of bug. Rust:

#[derive(Copy, Clone, Debug)];
struct CiteId(u32);
#[derive(Copy, Clone, Debug)];
struct ClusterId(u32);

fn needs_cite_id(id: CiteId) {}

needs_cite_id(ClusterId(5)); // compile error, expected `NumericId`, got `OtherId`
needs_cite_id(5); // compile error, expected `NumericId`, got `u32`

JavaScript:

let citeId = 5;
let clusterId = 100;
needsCiteId(clusterId); // No errors ever

Don't clusters already contains Cite objects which have string IDs?

Not in the renumbering call.

Every time you pass a string to Rust from JavaScript you have to allocate. (It's not a fast allocator like jemalloc or libc either, it's the small one bundled in the wasm file.) They're not cheap like the small-string-inline-optimised strings JS engines tend to provide. Using numbers gets you back to JS engine speed. With strings, renumbering would require 50 tiny allocations. Yes, renumbering already does 50 tiny allocations to get the cluster number descriptions from JS objects to Rust structs, but (1) it's the JS allocator, (2) this can be removed with e.g. serde-wasm-bindgen. And yes, clusters already do tiny allocations for the reference ids, but (1) not 50 of them in one call, just once each time you update a cite and (2) the processor actually needs to know the strings as References don't have numeric ids, they have strings. You can improve this with a good string cache, which I already do for cite keys, if you also rely on the wasm-bindgen UTF-16 conversion cache. I might do that. But do you see that it's complicated? I'm not going to go making a call permanently slow just to save literally a minute and a half of coding when there are many other options to explore first. I'm trying hard to avoid allocation, and things like this are why the code is as fast as it is now.

I think it will be worthwhile to have a JavaScript wrapper library, that is both fast and easy to use by doing tedious things like this. (Some bits might be improved with some upcoming WASM reference types stuff.) But to write one, you kinda need to see it in in use in real world code. The wasm driver errs on the side of lower-level building blocks, so everything is at least possible.

adomasven commented 5 years ago

This is a point in favour of using strings, as they are easier to debug in JS when you come across the following genre of bug

I'm sorry, I'm still not entirely sure what you mean. I can see how in the Rust example wrong function calls with strict typing prevents a certain type of bug at comiple time, where that is not available in JS. And it is not available whether you use strings or numbers. So what do you mean when you say that strings are easier to debug in JS?

But do you see that it's complicated? I'm not going to go making a call permanently slow just to save literally a minute and a half of coding when there are many other options to explore first. I'm trying hard to avoid allocation, and things like this are why the code is as fast as it is now.

Makes sense

I think it will be worthwhile to have a JavaScript wrapper library, that is both fast and easy to use by doing tedious things like this. [...] But to write one, you kinda need to see it in in use in real world code.

I'm working on a wrapper for citeproc-js calls to citeproc-rs for Zotero so that we can switch out the processor with a pref in the live client for testing purposes. There will be at least one example of real world use soon. Even then though it would make sense for citeproc-rs to provide a solution for JS clients eventually, whether it's some wrapper code in JS or a lower-level thing.

cormacrelf commented 5 years ago

So what do you mean when you say that strings are easier to debug in JS?

It's not important at all, so don't worry, but think ERROR: no such key exists in get_cluster: "cite-12397898" vs ERROR: no such key exists in get_cluster: 157

Even then though it would make sense for citeproc-rs to provide a solution for JS clients eventually, whether it's some wrapper code in JS or a lower-level thing.

Yes. It's valuable and we will definitely use the real world experience to this end.

cormacrelf commented 4 years ago

https://github.com/cormacrelf/wasm-js-struct-array-bench Looks like it's fine. I'll do this.

zotero / citeproc-rs

Support string cluster ids #20