whisperfish / rust-phonenumber

Library for parsing, formatting and validating international phone numbers.
Apache License 2.0
162 stars 55 forks source link

Move `database.bin` generation to an `xtask` #62

Closed bheylin closed 11 months ago

bheylin commented 11 months ago

The database.bin is currently generated in the build.rs. This is unnecessary and has a few downsides.

I propose that the generation of database.bin is moved into an xtask.

cargo-xtask is way to add free-form automation to a Rust project, a-la make, npm run or bespoke bash scripts.

The two distinguishing features of xtask are:

It doesn't require any other binaries besides cargo and rustc, it fully bootstraps from them
Unlike bash, it can more easily be cross platform, as it doesn't use the shell.

This means restructuring the project into a workspace where one of the members is the xtask package. But the benefits are:

fabricedesre commented 11 months ago
  • build.rs files can be seen as suspicious. A build.rs file executing on my local machine may be outside of the risks I'm willing to take

Maybe playing devil's advocate (I'm not familiar with xtask) but how is that different from running the task from a risk perspective?

bheylin commented 11 months ago
  • build.rs files can be seen as suspicious. A build.rs file executing on my local machine may be outside of the risks I'm willing to take

Maybe playing devil's advocate (I'm not familiar with xtask) but how is that different from running the task from a risk perspective?

Fair question. From a risk perspective...

rubdos commented 11 months ago

I think I disagree with your security model. If you mistrust a rather simple and straightforward build.rs file supplied by the rust-phonenumber developers, then I would also mistrust a checked in binary file that is manually updated (although verifiable), and I would also mistrust anything that remotely touched proc-macros.

I really don't think the benefit in terms of your security model and build time outweighs the mental overhead required for us to maintain an xtask package and to run the xtask whenever the xml changes. If I need to weigh "manually running and maintaining a separate package for the sake of loading a file" against "38 lines of build.rs", I think I know which one wins.

That said, I agree with the sentiment that a bincode-formatted file that is used for nearly all users is not optimal, which you report in #63. I'll comment on that tomorrow!

bheylin commented 11 months ago

Hi @rubdos

I get that the build.rs reduces the maintenance work, and I take what you say about this seriously. I appreciate the work you've put into developing this library and my aim is not to trivialize your time or make your life harder :smile_cat:

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

We can talk back and forth about security, but that's more a symptom of the issue. My aim here is to only reference the critical dependencies in the published lib.

This ties in with issue #64 too. As depending on the outcome of that conversation, maybe we can shift the dependencies needed for creating the database into another package. We certainly don't require quick_xml to parse a phonenumber once the database exists. :smile_cat:

In short, I would like to split the existing lib code into two parts;

Let's withhold talking about the implementation details of the data processing task and instead imagine it as a task much like dependabot. That's automatically triggered when the source xml is out of date. It makes a PR and generates a bincode file for you to review and merge.

In that case I think your maintenance burden is not drastically increased.

What do you think?

fabricedesre commented 11 months ago

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

bheylin commented 11 months ago

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

From what I understand about build.rs, it will always be run once wherever it's being built. Using cargo:rerun-if-changed will trigger if to be run again if the given file changes.

I just noticed the assets/upgrade.sh script. I presume one of the maintainers is running this every few months.

Is this not the perfect place to trigger the generation of the database?

rubdos commented 11 months ago

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

I still find your argumentation ill-posed. Compilation of the library itself is also work that's needlessly happening on your machine. Very many libraries have build.rs scripts.

I still feel like we have an x-y problem here. Do you take issue with the build time that the library poses? Because that, in my eyes, would be a rather good argument to look closely at how we can further optimize the build process. If you indeed take issue with the build time, then I would also like to see numbers and a benchmark, such that we can track the build script performance over time.

As an aside: after #63, I imagine we get rid of the bincode outputted file altogether. Maybe the database can even be put behind a feature flag.

We certainly don't require quick_xml to parse a phonenumber once the database exists. 😸

I did not know we still depended on quick_xml at runtime, that's indeed something to look at. That too could become a feature flag.

Can't that be fixed by outputting cargo:rerun-if-changed=assets/PhoneNumberMetadata.xml and splitting out the meta-data loader to be only part of the build stage?

That fixes reruns, but those shouldn't really happen once the crate has been build as a dependency for your own project anyway. It would probably be a good thing to implement either way, because it speeds up the development of rust-phonenumber by reducing the number of build.rs runs.

Is this not the perfect place to trigger the generation of the database?

It would be the correct place to trigger it. But still:


My suggested approach to clean things up:

bheylin commented 11 months ago

My issue with the build.rs is more that it's work that's needlessly happening on my machine, and to do that work, it pulls in dependencies that are not critical to the domain problem, parsing a phone-number.

I still find your argumentation ill-posed. Compilation of the library itself is also work that's needlessly happening on your machine. Very many libraries have build.rs scripts.

I still feel like we have an x-y problem here. Do you take issue with the build time that the library poses? Because that, in my eyes, would be a rather good argument to look closely at how we can further optimize the build process. If you indeed take issue with the build time, then I would also like to see numbers and a benchmark, such that we can track the build script performance over time.

My problem is the amount of dependencies being brought into a project I maintain. Not build times per se, but of course more dependencies mean longer build times.

We review our dependencies periodically, and the rust-phonenumber came up because it actually caused an issue in a script I use to upgrade deps. This is the only dependency we have the uses metadata in it's semver. But that's another story :laughing: And my script was at fault there.

As an aside: after #63, I imagine we get rid of the bincode outputted file altogether. Maybe the database can even be put behind a feature flag.

Yea I find the Database as part of the public API to be a strange use case. Neither the C++ or Java impl have this feature from what I can see.

I think it's a good idea to at least feature gate it, and consider deprecating it and seeing if you get feedback on usage in the wild.

We certainly don't require quick_xml to parse a phonenumber once the database exists. 😸

I did not know we still depended on quick_xml at runtime, that's indeed something to look at. That too could become a feature flag.

:+1: much appreciated!

Is this not the perfect place to trigger the generation of the database?

It would be the correct place to trigger it. But still:

* Checking in binary files needlessly into version control feels wrong.

* We create a security issue because checking in binary files into version control obfuscates the source tree for easy verification. This could be partially remedied by having CI check the consistency between the XML and the bincode file.

Yes I agree that checking in a binary is opaque. The C++ impl uses protobufs from what I see, which has the sames downsides. I think the most transparent solution is generating Rust code. You could even move the runtime processing of the Vec<Metadata> into a Database into build time. This would presumably solve #26. As the default database is simply embedded into the binary in it's final form.

My suggested approach to clean things up:

* [ ]  Evaluate whether loading a `bincode` file is the most optimal way to load the database (it probably is not), remove the `bincode` dependency altogether, especially at runtime.

:+1:

* [ ]  We find a nice way to fix [Make database static #63](https://github.com/whisperfish/rust-phonenumber/issues/63): refactor the `Database` structure, make sure it can be loadable in `const` time and at compile time, without the need for `quick_xml`.

:+1: If you remove the Database from the API you can simplify the Database even further, by not requiring lazy_static or std::sync::Once. And I think the Database may not even need Arcs or Mutexes if restructured a little. But that's a different issue as such.

* [ ]  Evaluate whether `Database` loading (and thus the runtime `quick_xml` dependency) is used by the community (it probably is, let's not make any assumptions), and consider putting this behind a feature flag

:+1: Like I said above, I would go so far as to deprecate it and gather feedback on resistance to that. If the database loads quickly then I don't see a use case for providing a custom database.

* [ ]  Add `cargo:rerun-if-changed` to the `build.rs`

:+1:

rubdos commented 11 months ago

Cool, I think we reached an agreement about basically all points then. I'm in favour of getting the dependency tree smaller, but it should really not go at the cost of a maintenance burden.

I suggest I close this specific issue, and I create issues for the tasks we discussed. Would you agree with that?

bheylin commented 11 months ago

Cool, I think we reached an agreement about basically all points then. I'm in favour of getting the dependency tree smaller, but it should really not go at the cost of a maintenance burden.

I suggest I close this specific issue, and I create issues for the tasks we discussed. Would you agree with that?

Yep let's close this issue, the issues you outlined are more focused.

Thanks for your openness :wink:

rubdos commented 11 months ago

Closing in favour of #64 and #63