Open bheylin opened 10 months ago
Related to #26
After the conversation in #62 I'll list some things to consider doing here:
Deprecate any use of Databsae
in the API
Namely the three fns format_with
, is_valid_with
, parse_with
This will allow us to change the Database
structure without breaking the API.
My concern here is that people are only using a custom Database
because the default internal Database
loads so slowly.
The deprecation should prompt feedback from the projects/people using this lib. And give the maintainers time to discuss the reasons for deprecation.
Remove the use of lazy_static
/ once_cell
(#60) and generate a const Database
using the existing build.rs
This will embed the Database
into the lib binary transparently. Allowing for:
lazy_static
dependency completely bincode
and quickxml
from the [dependencies]
section to build-dependencies
Database
.Database
to load as fast as a lib binary can load (solving #26)@bheylin In order to coordinate the effort a bit: I will make the separate issues asap (hopefully before noon CET), but I wonder which ones you intend to tackle yourself. Are you willing to take some of them on?
Oh yea for sure I'll take some on :laughing:
Otherwise I'd just be complaining and delegating work, and I'm just not ready for a toxic management role yet :laughing:
I would like to keep the definition of the Database
the same for now and use mod loader
and Database::from
in the build.rs
and see where that leads.
I think that will allow us to avoid making breaking changes to the API while moving the Databsae
a step closer to being static.
So in short, give me this task and I'll make several PR's over time that take steps towards a static DB. We can then decide what point we change the public API.
That sounds like a good path to follow. Keep the PR's small and concise, commit often, and ask feedback early! That way we'll reach a consensus with the least friction possible.
@rubdos I'm taking a look at how we could represent the database and it would help me out if I knew what the minimal API phonenumber needs to support.
From what I see the critical/domain API is the set of functions format
, is_valid
, is_viable
and parse
.
And associated types PhoneNumber
, Carrier
, Extension
, Formatter
, NationalNumber
. Mode
, Type
and Validation
.
I mentioned before that I would like to make Database
private and most likely restructure it so it can load faster.
My question now is:
Does the metadata
module need to be public?
I don't see a use case for the consumer of the library needing it.
As by making the Databsae
and metadata
types private, it gives us the freedom to restructure the Database / metadata's layout completely.
@rubdos If we use build.rs
to generate a databsae.rs
file the existing Database::regions
HashMap can be converted into a match
based fn.
fn regions_by_code(code: u16) -> Option<&'static [&'static str]> {
match code {
1 => Some(&[
"US", "AG", "AI", "AS", "BB", "BM", "BS", "CA", "DM", "DO", "GD", "GU", "JM", "KN",
"KY", "LC", "MP", "MS", "PR", "SX", "TC", "TT", "VC", "VG", "VI",
]),
7 => Some(&["RU", "KZ"]),
// ... and many more variants
998 => Some(&["UZ"]),
_ => None,
}
}
The Database::by_code
HashMap can be replaced by a match
based fn
.
fn id_by_code(country_code: u16) -> Option<&'static str> {
match country_id {
// ...
247 => Some("AC"),
// ...
979 => Some("001"),
// ....
_ => None
}
}
The country_id
can then be given to another match
based fn that has Metadata
hardcoded into the match statement.
fn metadata_by_id(country_id: &str) -> Option<Metadata> {
match country_code {
"001" => Some(Metadata { ... })
// ... This match statement will be a big fella
_ => None
}
}
The representation of the Metadata
will need to change ofc as we want all static data in that too.
That's something I need to play around with.
#[derive(Clone, Debug)]
pub struct Metadata {
pub(crate) descriptors: Descriptors,
pub(crate) id: &'static str,
pub(crate) country_code: u16,
pub(crate) international_prefix: Option<CachedRegex>,
pub(crate) preferred_international_prefix: Option<&'static str>,
pub(crate) national_prefix: Option<&'static str>,
pub(crate) preferred_extension_prefix: Option<&'static str>,
pub(crate) national_prefix_for_parsing: Option<CachedRegex>,
pub(crate) national_prefix_transform_rule: Option<&'static str>,
pub(crate) formats: &'static [Format],
pub(crate) international_formats: &'static [Format],
pub(crate) main_country_for_code: bool,
pub(crate) leading_digits: Option<CachedRegex>,
pub(crate) mobile_number_portable: bool,
}
I can build a parallel version of this database representation and compare the performance to HashMap look ups. My hope is that the match statements are converted to jump tables.
I have to think about how to handle the CachedRegex
in general.
At first, I would avoid breaking the API. Maybe we can decorate some things with #[deprecated]
at some point, but even that I would do in a follow-up version I'd say.
In your proposal, I would be 99% sure that those would become jump tables. Then again, as far as I understood, the performance problem is in loading the Database, not in looking things up? If so, I would do a first iteration where you keep the original format (again, backwards compatibility, although those are private types?), and create some criterion benchmarks around that format.
If the representation of Metadata
needs to change, let's definitely not enforce 'static
, because then dynamic database will not be possible and this will break API.
The regex can probably also be done on compile time; the regex
crate is quite good :-)
Yea I've made a branch locally to test out the ramifications of replacing the HashMaps
and as expected it involves a lot of churn.
The point of the match / 'static
data as HashMap
replacements is not only about lookup time, as the creation of the various HashMaps
involve a fair amount of heap allocations. I'm presuming that this is slowing down the init phase, but I haven't measured that yet.
I thinking about making a trimmed down Database
like type that has no RegexCache
as RegexCache
cannot be serialized. This partial database can then be serialized to the database.bin
. Now the database.bin
contains a Vec<Metadata>
which is processed into a Database
at first load. I'm presuming that all the HashMap creation is causing the slowdown.
What do you think about that as a first step?
cannot be serialized.
I thought the whole point here was to get rid of the (de)serialization step? Deseralization does also not necessarily imply that the binary representation can be directly worked with in Rust. There's no issue with having some processing time for the deserialization, as long as the default database can be accessed "immediately" by means of const
.
Now, a HashMap
cannot be const
-initialized, as far as I can see, so that might be an actual issue. However, maybe we can make the Database
more generic, and replace the HashMap by a generic instantiation of a key-value lookup that we can const
-initialize?
I'm presuming that all the HashMap creation is causing the slowdown.
I would really like to see some flame graphs or other profiling data before I agree with that. You cannot optimize something that you didn't measure. Maybe that would be a good first step: to make a few Criterion benchmarks around the current Database loading code, and then gather some profiling data?
I thought the whole point here was to get rid of the (de)serialization step?
Yes the whole point is to get rid of deserialization, but that cannot be done while we use a Database
that is Hashmap
based.
I know what I want to do ultimately, in terms of using matches as outlined above. But that will mangle the API.
Be aware that with the current exposure of so many types, avoiding breaking the API is almost impossible.
That's why I would like to make almost all types/functions private (through deprecation), unless there is a demonstrated need for them to be public.
Any change I make to any public type is an API breaking change.
Database
, Metadata
and all types they depend on are currently public.
I would really like to see some flame graphs or other profiling data before I agree with that. You cannot optimize something that you didn't measure. Maybe that would be a good first step: to make a few Criterion benchmarks around the current Database loading code, and then gather some profiling data?
Yea I agree that profiling the loading of the Database
should be performed first.
So first thing I'll do is make a benchmark of loading the database.
I've been using divan
instead of criterion
lately, the API is friendlier.
I've been using
divan
instead ofcriterion
lately, the API is friendlier.
Ack on divan
! Haven't used it yet, but I have heard about it enough :-)
@rubdos I've made a branch with benchmarks for loading Metadata using the existing hashmap impl (benches/load_hashmap_db.rs
) and another benchmark loading a hypothetical static db (benches/load_static_db.rs
).
The static data used in load_static_db.rs
is generated using bin/gen_static_db.rs
.
This is just to get an idea of the difference in loading time between the same data, but defined in different ways.
Below are the results from the benchmarks on my machine.
The get_metadata_for_country_id
benchmarks are interesting as the static versions are much slower than the hashmap versions. I guess that using a string as a key in the match
is resulting in less than optimal code.
And the get_metadata_for_country_code_hot
shows the difference between the initial lazy_static
load (get_metadata_for_country_code
) and the performance when 'hot'/loaded.
~ cargo bench
...
Running benches/load_hashmap_db.rs (target/release/deps/load_hashmap_db-1274ff25527c9f01)
Timer precision: 838 ns
load_hashmap_db fastest │ slowest │ median │ mean │ samples │ iters
├─ get_metadata_for_country_code 907.7 ns │ 59.9 ms │ 1.395 µs │ 600.3 µs │ 100 │ 100
├─ get_metadata_for_country_code_hot 12.82 ns │ 15.75 ns │ 13.52 ns │ 13.64 ns │ 100 │ 819200
├─ get_metadata_for_country_id 2.702 ns │ 3.355 ns │ 2.869 ns │ 2.876 ns │ 100 │ 3276800
╰─ get_regionsfor_country_code 14.63 ns │ 16.83 ns │ 15.26 ns │ 15.53 ns │ 100 │ 819200
Running benches/load_static_db.rs (target/release/deps/load_static_db-261ff7f611e74b8a)
Timer precision: 838 ns
load_static_db fastest │ slowest │ median │ mean │ samples │ iters
├─ get_metadata_for_country_code 1.486 ns │ 1.821 ns │ 1.56 ns │ 1.578 ns │ 100 │ 6553600
├─ get_metadata_for_country_id 11.53 ns │ 13.95 ns │ 12.27 ns │ 12.33 ns │ 100 │ 819200
╰─ get_regionsfor_country_code 1.277 ns │ 1.596 ns │ 1.375 ns │ 1.377 ns │ 100 │ 6553600
I've also made two bins ready for flamegraphing, but for some reason cargo-flamegraph is returning an error. I'll figure that out and post the flamegraphs soon.
It's a little late here, but I've taken a quick look at the load_hashmap_db.svg
.
From what I see most of the time is spent parsing/allocating regex.
Comparing the load_hashmap_db
run to the load_static_db
is unfair as there's no regex parsing going on there.
The load_static_db::get_metadata_for_country_id
is slow because it compiles down to a hardcoded list of str comparisons.
I rewrote the fn to be a two stage match on the ascii value of the first and second char of the country id. This brings the performance in line with the other fns that use a numerical match.
Running benches/load_hashmap_db.rs (target/release/deps/load_hashmap_db-896eaa6c73f121ae)
Timer precision: 489 ns
load_hashmap_db fastest │ slowest │ median │ mean │ samples │ iters
├─ get_metadata_for_country_code 906.7 ns │ 60.11 ms │ 1.152 µs │ 602.2 µs │ 100 │ 100
├─ get_metadata_for_country_code_hot 12.83 ns │ 16.51 ns │ 14.28 ns │ 14.09 ns │ 100 │ 409600
├─ get_metadata_for_country_id 2.66 ns │ 3.233 ns │ 2.831 ns │ 2.868 ns │ 100 │ 3276800
╰─ get_regionsfor_country_code 14.67 ns │ 19.53 ns │ 15.91 ns │ 15.77 ns │ 100 │ 409600
Running benches/load_static_db.rs (target/release/deps/load_static_db-8e00b627b68c2d24)
Timer precision: 838 ns
load_static_db fastest │ slowest │ median │ mean │ samples │ iters
├─ get_metadata_for_country_code 1.731 ns │ 1.997 ns │ 1.809 ns │ 1.822 ns │ 100 │ 6553600
├─ get_metadata_for_country_id 1.733 ns │ 2.057 ns │ 1.832 ns │ 1.834 ns │ 100 │ 6553600
╰─ get_regionsfor_country_code 1.312 ns │ 1.628 ns │ 1.389 ns │ 1.401 ns │ 100 │ 6553600
@bheylin I noticed each benchmark has different timer precision. Please let me know if some numbers seem wonky as a result of benchmarks switching between CPUs: https://github.com/nvzqz/divan/issues/23.
Also, glad to hear you enjoy the API's friendliness 😄
From what I see most of the time is spent parsing/allocating regex.
Looks like it indeed! What do you propose going forward? I'm still quite sure we can find a way to do more on compile time. Since we will probably break the API slightly, this should probably go together with #66.
Oh and sorry about the symbol names, my local install of perf
is not demangling symbol names when used with flamegraph (I
need to downgrade perf).
Ah yes @ds-cbo notified me about the merge (we work in the same company :laughing:)
I refactored project into a workspace locally with the aim of making various solutions and profiling them.
I want to try out a const regex solution as, my hunch is that the regexs used in the metadata are pretty basic and so they don't require the full, heap allocating goodness of Regex
.
Something like https://crates.io/crates/proc-macro-regex could work, but it doesn't support matches.
One way or the other if we can avoid using Regex
and RegexCache
I expect that the Hashmap init time will drop significantly. So the only change to the API would be replacing CachedRegex
with whatever type the Regex
replacement uses.
But in general there is a lot we can do at compile time, it's more a question of how abrupt a change you want to make next release.
I also still have to see what the binary size is when including all the metadata in the Rust code.
Another idea is to provide more features to trim the data down, for example, not including geolocation data. It's also possible to have both a heap-based and stack-based metadata feature.
Ah yes @ds-cbo notified me about the merge (we work in the same company 😆)
Small world :')
But in general there is a lot we can do at compile time, it's more a question of how abrupt a change you want to make next release.
As long as most of the API stays the same, I think we'll be fine. We can make 0.4 a nice and breaking release, while maintaining the most-used API syntactically compatible. I think that should be fine :-)
I also still have to see what the binary size is when including all the metadata in the Rust code.
Well, it's in there either way, no?
Another idea is to provide more features to trim the data down, for example, not including geolocation data.
That sounds useful indeed; hide the APIs behind nice orthogonal feature flags.
It's also possible to have both a heap-based and stack-based metadata feature.
I don't know, I think we will conclude that one specific approach is "the best" in the end.
I also still have to see what the binary size is when including all the metadata in the Rust code.
Well, it's in there either way, no?
It's more about if someone wants to use the lib on a device with a limited stack memory, they may want to use the heap version. But I'm not sure if that's a realistic situation.
At the moment the
Database
is quite a complex data structure. I would like to make a single static database, but I'm not sure if that fit's with the design goals of the library. So in an effort to clarify what the design goals of the library are, I have some questions:Is there a concrete use case for loading a database that's not the default? The
fn parse_with
accepts adatabase
param, is that used by the community?Database
can become a static structure represented as Rust source code generated from theassets/PhoneNumberMetadata.xml
?The bincode (
Vec<Meatadata>
) is processed on first use (throughlazy_static
at time of writing) into aDatabase
which is a composite of aRegexCache
and threeHashMaps
. This cache and HashMap generation has quite the runtime cost. It looks like it can be created offline and then stored as bincode. Then it's simply a matter of loading the database at runtime.