whisperfish / rust-phonenumber

Library for parsing, formatting and validating international phone numbers.
Apache License 2.0
156 stars 55 forks source link

Refactoring `Database`: `const` and faster loading, dropping bincode-loading #63

Open bheylin opened 10 months ago

bheylin commented 10 months ago

At the moment the Database is quite a complex data structure. I would like to make a single static database, but I'm not sure if that fit's with the design goals of the library. So in an effort to clarify what the design goals of the library are, I have some questions:

rubdos commented 10 months ago

Related to #26

bheylin commented 10 months ago

After the conversation in #62 I'll list some things to consider doing here:

rubdos commented 10 months ago

@bheylin In order to coordinate the effort a bit: I will make the separate issues asap (hopefully before noon CET), but I wonder which ones you intend to tackle yourself. Are you willing to take some of them on?

bheylin commented 10 months ago

Oh yea for sure I'll take some on :laughing:

Otherwise I'd just be complaining and delegating work, and I'm just not ready for a toxic management role yet :laughing:

I would like to keep the definition of the Database the same for now and use mod loader and Database::from in the build.rs and see where that leads.

I think that will allow us to avoid making breaking changes to the API while moving the Databsae a step closer to being static.

So in short, give me this task and I'll make several PR's over time that take steps towards a static DB. We can then decide what point we change the public API.

rubdos commented 10 months ago

That sounds like a good path to follow. Keep the PR's small and concise, commit often, and ask feedback early! That way we'll reach a consensus with the least friction possible.

bheylin commented 10 months ago

@rubdos I'm taking a look at how we could represent the database and it would help me out if I knew what the minimal API phonenumber needs to support.

From what I see the critical/domain API is the set of functions format, is_valid, is_viable and parse. And associated types PhoneNumber, Carrier, Extension, Formatter, NationalNumber. Mode, Type and Validation.

I mentioned before that I would like to make Database private and most likely restructure it so it can load faster. My question now is:

Does the metadata module need to be public? I don't see a use case for the consumer of the library needing it.

As by making the Databsae and metadata types private, it gives us the freedom to restructure the Database / metadata's layout completely.

bheylin commented 10 months ago

@rubdos If we use build.rs to generate a databsae.rs file the existing Database::regions HashMap can be converted into a match based fn.

fn regions_by_code(code: u16) -> Option<&'static [&'static str]> {
    match code {
        1 => Some(&[
            "US", "AG", "AI", "AS", "BB", "BM", "BS", "CA", "DM", "DO", "GD", "GU", "JM", "KN",
            "KY", "LC", "MP", "MS", "PR", "SX", "TC", "TT", "VC", "VG", "VI",
        ]),
        7 => Some(&["RU", "KZ"]),
        // ... and many more variants
        998 => Some(&["UZ"]),
        _ => None,
    }
} 

The Database::by_code HashMap can be replaced by a match based fn.

fn id_by_code(country_code: u16) -> Option<&'static str> {
    match country_id {
        // ...
        247 => Some("AC"),
        // ...
        979 => Some("001"),
        // ....
        _ => None
    }
}

The country_id can then be given to another match based fn that has Metadata hardcoded into the match statement.

fn metadata_by_id(country_id: &str) -> Option<Metadata> {
    match country_code {
        "001" => Some(Metadata { ... })
        // ... This match statement will be a big fella
        _ => None
    }
}

The representation of the Metadata will need to change ofc as we want all static data in that too. That's something I need to play around with.

#[derive(Clone, Debug)]
pub struct Metadata {
    pub(crate) descriptors: Descriptors,
    pub(crate) id: &'static str,
    pub(crate) country_code: u16,

    pub(crate) international_prefix: Option<CachedRegex>,
    pub(crate) preferred_international_prefix: Option<&'static str>,
    pub(crate) national_prefix: Option<&'static str>,
    pub(crate) preferred_extension_prefix: Option<&'static str>,
    pub(crate) national_prefix_for_parsing: Option<CachedRegex>,
    pub(crate) national_prefix_transform_rule: Option<&'static str>,

    pub(crate) formats: &'static [Format],
    pub(crate) international_formats: &'static [Format],
    pub(crate) main_country_for_code: bool,
    pub(crate) leading_digits: Option<CachedRegex>,
    pub(crate) mobile_number_portable: bool,
}

I can build a parallel version of this database representation and compare the performance to HashMap look ups. My hope is that the match statements are converted to jump tables.

I have to think about how to handle the CachedRegex in general.

rubdos commented 9 months ago

At first, I would avoid breaking the API. Maybe we can decorate some things with #[deprecated] at some point, but even that I would do in a follow-up version I'd say.

In your proposal, I would be 99% sure that those would become jump tables. Then again, as far as I understood, the performance problem is in loading the Database, not in looking things up? If so, I would do a first iteration where you keep the original format (again, backwards compatibility, although those are private types?), and create some criterion benchmarks around that format.

If the representation of Metadata needs to change, let's definitely not enforce 'static, because then dynamic database will not be possible and this will break API.

The regex can probably also be done on compile time; the regex crate is quite good :-)

bheylin commented 9 months ago

Yea I've made a branch locally to test out the ramifications of replacing the HashMaps and as expected it involves a lot of churn.

The point of the match / 'static data as HashMap replacements is not only about lookup time, as the creation of the various HashMaps involve a fair amount of heap allocations. I'm presuming that this is slowing down the init phase, but I haven't measured that yet.

I thinking about making a trimmed down Database like type that has no RegexCache as RegexCache cannot be serialized. This partial database can then be serialized to the database.bin. Now the database.bin contains a Vec<Metadata> which is processed into a Database at first load. I'm presuming that all the HashMap creation is causing the slowdown.

What do you think about that as a first step?

rubdos commented 9 months ago

cannot be serialized.

I thought the whole point here was to get rid of the (de)serialization step? Deseralization does also not necessarily imply that the binary representation can be directly worked with in Rust. There's no issue with having some processing time for the deserialization, as long as the default database can be accessed "immediately" by means of const.

Now, a HashMap cannot be const-initialized, as far as I can see, so that might be an actual issue. However, maybe we can make the Database more generic, and replace the HashMap by a generic instantiation of a key-value lookup that we can const-initialize?

I'm presuming that all the HashMap creation is causing the slowdown.

I would really like to see some flame graphs or other profiling data before I agree with that. You cannot optimize something that you didn't measure. Maybe that would be a good first step: to make a few Criterion benchmarks around the current Database loading code, and then gather some profiling data?

bheylin commented 9 months ago

I thought the whole point here was to get rid of the (de)serialization step?

Yes the whole point is to get rid of deserialization, but that cannot be done while we use a Database that is Hashmap based. I know what I want to do ultimately, in terms of using matches as outlined above. But that will mangle the API.

Be aware that with the current exposure of so many types, avoiding breaking the API is almost impossible.

That's why I would like to make almost all types/functions private (through deprecation), unless there is a demonstrated need for them to be public.

Any change I make to any public type is an API breaking change. Database, Metadata and all types they depend on are currently public.

I would really like to see some flame graphs or other profiling data before I agree with that. You cannot optimize something that you didn't measure. Maybe that would be a good first step: to make a few Criterion benchmarks around the current Database loading code, and then gather some profiling data?

Yea I agree that profiling the loading of the Database should be performed first.

So first thing I'll do is make a benchmark of loading the database. I've been using divan instead of criterion lately, the API is friendlier.

rubdos commented 9 months ago

I've been using divan instead of criterion lately, the API is friendlier.

Ack on divan! Haven't used it yet, but I have heard about it enough :-)

bheylin commented 9 months ago

@rubdos I've made a branch with benchmarks for loading Metadata using the existing hashmap impl (benches/load_hashmap_db.rs) and another benchmark loading a hypothetical static db (benches/load_static_db.rs). The static data used in load_static_db.rs is generated using bin/gen_static_db.rs.

This is just to get an idea of the difference in loading time between the same data, but defined in different ways.

Below are the results from the benchmarks on my machine.

The get_metadata_for_country_id benchmarks are interesting as the static versions are much slower than the hashmap versions. I guess that using a string as a key in the match is resulting in less than optimal code.

And the get_metadata_for_country_code_hot shows the difference between the initial lazy_static load (get_metadata_for_country_code) and the performance when 'hot'/loaded.

~ cargo bench

...

     Running benches/load_hashmap_db.rs (target/release/deps/load_hashmap_db-1274ff25527c9f01)
Timer precision: 838 ns
load_hashmap_db                       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_metadata_for_country_code      907.7 ns      │ 59.9 ms       │ 1.395 µs      │ 600.3 µs      │ 100     │ 100
├─ get_metadata_for_country_code_hot  12.82 ns      │ 15.75 ns      │ 13.52 ns      │ 13.64 ns      │ 100     │ 819200
├─ get_metadata_for_country_id        2.702 ns      │ 3.355 ns      │ 2.869 ns      │ 2.876 ns      │ 100     │ 3276800
╰─ get_regionsfor_country_code        14.63 ns      │ 16.83 ns      │ 15.26 ns      │ 15.53 ns      │ 100     │ 819200

     Running benches/load_static_db.rs (target/release/deps/load_static_db-261ff7f611e74b8a)
Timer precision: 838 ns
load_static_db                    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_metadata_for_country_code  1.486 ns      │ 1.821 ns      │ 1.56 ns       │ 1.578 ns      │ 100     │ 6553600
├─ get_metadata_for_country_id    11.53 ns      │ 13.95 ns      │ 12.27 ns      │ 12.33 ns      │ 100     │ 819200
╰─ get_regionsfor_country_code    1.277 ns      │ 1.596 ns      │ 1.375 ns      │ 1.377 ns      │ 100     │ 6553600

I've also made two bins ready for flamegraphing, but for some reason cargo-flamegraph is returning an error. I'll figure that out and post the flamegraphs soon.

bheylin commented 9 months ago

It's a little late here, but I've taken a quick look at the load_hashmap_db.svg. From what I see most of the time is spent parsing/allocating regex. Comparing the load_hashmap_db run to the load_static_db is unfair as there's no regex parsing going on there.

load_hashmap_db flamegraph

bheylin commented 9 months ago

The load_static_db::get_metadata_for_country_id is slow because it compiles down to a hardcoded list of str comparisons.

I rewrote the fn to be a two stage match on the ascii value of the first and second char of the country id. This brings the performance in line with the other fns that use a numerical match.

     Running benches/load_hashmap_db.rs (target/release/deps/load_hashmap_db-896eaa6c73f121ae)
Timer precision: 489 ns
load_hashmap_db                       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_metadata_for_country_code      906.7 ns      │ 60.11 ms      │ 1.152 µs      │ 602.2 µs      │ 100     │ 100
├─ get_metadata_for_country_code_hot  12.83 ns      │ 16.51 ns      │ 14.28 ns      │ 14.09 ns      │ 100     │ 409600
├─ get_metadata_for_country_id        2.66 ns       │ 3.233 ns      │ 2.831 ns      │ 2.868 ns      │ 100     │ 3276800
╰─ get_regionsfor_country_code        14.67 ns      │ 19.53 ns      │ 15.91 ns      │ 15.77 ns      │ 100     │ 409600

     Running benches/load_static_db.rs (target/release/deps/load_static_db-8e00b627b68c2d24)
Timer precision: 838 ns
load_static_db                    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_metadata_for_country_code  1.731 ns      │ 1.997 ns      │ 1.809 ns      │ 1.822 ns      │ 100     │ 6553600
├─ get_metadata_for_country_id    1.733 ns      │ 2.057 ns      │ 1.832 ns      │ 1.834 ns      │ 100     │ 6553600
╰─ get_regionsfor_country_code    1.312 ns      │ 1.628 ns      │ 1.389 ns      │ 1.401 ns      │ 100     │ 6553600
nvzqz commented 9 months ago

@bheylin I noticed each benchmark has different timer precision. Please let me know if some numbers seem wonky as a result of benchmarks switching between CPUs: https://github.com/nvzqz/divan/issues/23.

Also, glad to hear you enjoy the API's friendliness 😄

rubdos commented 9 months ago

From what I see most of the time is spent parsing/allocating regex.

Looks like it indeed! What do you propose going forward? I'm still quite sure we can find a way to do more on compile time. Since we will probably break the API slightly, this should probably go together with #66.

bheylin commented 9 months ago

Oh and sorry about the symbol names, my local install of perf is not demangling symbol names when used with flamegraph (I need to downgrade perf).

Ah yes @ds-cbo notified me about the merge (we work in the same company :laughing:)

I refactored project into a workspace locally with the aim of making various solutions and profiling them.

I want to try out a const regex solution as, my hunch is that the regexs used in the metadata are pretty basic and so they don't require the full, heap allocating goodness of Regex.

Something like https://crates.io/crates/proc-macro-regex could work, but it doesn't support matches. One way or the other if we can avoid using Regex and RegexCache I expect that the Hashmap init time will drop significantly. So the only change to the API would be replacing CachedRegex with whatever type the Regex replacement uses.

But in general there is a lot we can do at compile time, it's more a question of how abrupt a change you want to make next release.

I also still have to see what the binary size is when including all the metadata in the Rust code.

Another idea is to provide more features to trim the data down, for example, not including geolocation data. It's also possible to have both a heap-based and stack-based metadata feature.

rubdos commented 9 months ago

Ah yes @ds-cbo notified me about the merge (we work in the same company 😆)

Small world :')

But in general there is a lot we can do at compile time, it's more a question of how abrupt a change you want to make next release.

As long as most of the API stays the same, I think we'll be fine. We can make 0.4 a nice and breaking release, while maintaining the most-used API syntactically compatible. I think that should be fine :-)

I also still have to see what the binary size is when including all the metadata in the Rust code.

Well, it's in there either way, no?

Another idea is to provide more features to trim the data down, for example, not including geolocation data.

That sounds useful indeed; hide the APIs behind nice orthogonal feature flags.

It's also possible to have both a heap-based and stack-based metadata feature.

I don't know, I think we will conclude that one specific approach is "the best" in the end.

bheylin commented 9 months ago

I also still have to see what the binary size is when including all the metadata in the Rust code.

Well, it's in there either way, no?

It's more about if someone wants to use the lib on a device with a limited stack memory, they may want to use the heap version. But I'm not sure if that's a realistic situation.