Open jmvalin opened 5 months ago
[Jean-Marc Valin]
This new release brings many improvements, including improved training, SSE4.1 and AVX2 optimizations, and run-time CPU detection (RTCD).
The distributed models are now trained using only publicly available datasets.
This sound very good. How can I learn how to do the training myself? That knowledge would be a great argument for getting this into the Debian main distribution.
<URL: https://bugs.debian.org/980839 >
-- Happy hacking Petter Reinholdtsen
This release expects the build process to execute a arbitrary-shellscript to download an arbitrary file as part of the build process; outside of cmake, in a potentially privileged environment.
If I can find the time, I will PR fix for your Cmake or alike; I'm sorry I'm not doing it now - because it's very exciting to me this project is moving again.
But as it stands, this is a pretty big hole for practical secure CI-pipelines and reliability in isolated or secure environments. In short - do not do this.
THAT SAID:
I'm honestly glad this project is getting some much needed attention; but please remember that it's been relied upon for various uses, on various platforms for a very long time now.
If your not considering the need to build this as a standalone Risc-V library for a non-linux operating-system; You should be. It may need to be built without internet access. And on windows XP; to support cable TV stations using it with Gstreamer.
And many, many more that I'm aware of.
Keep up the good work, do good things, and have pride - just don't forget the use-cases beyond your own. ;)
[John Bargman]
This release expects the build process to execute a arbitrary-shellscript to download an arbitrary file as part of the build process; outside of cmake, in a potentially privileged environment.
This sound like a prospective xz disaster, indeed. Where is the code causing this? At least the Debian build servers do not all have Internet access during build, so it would fail at least some architectures.
-- Happy hacking Petter Reinholdtsen
So there's two things here: one is the download script and the other is the actual content being fetched. In terms of the script itself, it's fairly simple to audit and see that it's not doing anything bad. It's also in git and signed, so I don't really see a trust issue compared to trusting the rest of the source code.
When it comes the the actual model data being fetched, I can see how it's not as safe as the rest of the Git content (sha1 and signature). One possible way to solve that would be the equivalent to what was just added to Opus: https://github.com/xiph/opus/commit/9faf6f071
The idea Is that Git will hold the sha256 of the downloaded content and thus it can check that the checksums actually match before using it. That way compromising the server will just result in a build failure.
The idea Is that Git will hold the sha256 of the downloaded content and thus it can check that the checksums actually match before using it. That way compromising the server will just result in a build failure.
This is without question an acceptable solution - I come from a version of linux called NixOS; our package manager does this for all external downloads as part of the build step; ( which is how I came to be here: https://github.com/NixOS/nixpkgs/issues/304433 )
The shellscript itself is an acceptable compromise if an external download is absolutely required; in my personal ideal external data would be fetched by git submodule (providing a solid revision control), or the dependency treated as such.
It seems the model for RNNoise has always been a dependency;
[Jean-Marc Valin]
So there's two things here: one is the download script and the other is the actual content being fetched. In terms of the script itself, it's fairly simple to audit and see that it's not doing anything bad. It's also in git and signed, so I don't really see a trust issue compared to trusting the rest of the source code.
Most Linux distributions block package building from downloading stuff from the Internet, to ensure reproducibility, consistency and completeness, and I would strongly suggest to not base the RNNoise build on such practice for rnnoise.
-- Happy hacking Petter Reinholdtsen
@petterreinholdtsen Do you have alternatives to suggest for distributing a 20 MB binary file? In version 0.1, the model was much smaller (and it still wasn't great) and didn't include the binary model (which is the preferred form for making modifications in that case).
[Jean-Marc Valin]
@petterreinholdtsen Do you have alternatives to suggest for distributing a 20 MB binary file? In version 0.1, the model was much smaller (and it still wasn't great) and didn't include the binary model (which is the preferred form for making modifications in that case).
I would suggest to include it in the git repository and release tarballs.
-- Happy hacking Petter Reinholdtsen
All package managers I know of have the ability to specify multiple source packages - I'd expect them to specify the model data as one of the sources. The nixpkgs approach seems good.
Having a hash of the data in git should be enough for revision control to ensure that it has the same version of the model as the code was designed for. Submodules remain tricky to use, especially because the model is large and most people would want a shallow clone of it.
OK, the download process should be safer now as the download script will verify the sha256 before using the downloaded binary file.
[Thomas Daede]
Having a hash of the data in git should be enough for revision control to ensure that it has the same version of the model as the code was designed for.
Note, for release tarballs, such approach will break the deserted island test of free software. Those bringing the tarball and all its build dependencies with you to a computer on a deserted island with no Internet connectivity would not be able to build it as the build require Internet connectivity and the abililty to download the model without interferrence.
I heard a variant of the deserted island a few years ago, where the people gathering together at AfricaSource, a free software development confernece in the middle of a savannah, with no Internet connectivity, used a collecting of Debian DVDs as their source of build dependencies.
-- Happy hacking Petter Reinholdtsen
[Thomas Daede] [SNIP] Note, for release tarballs, such approach will break the deserted island test of free software. [SNIP] -- Happy hacking Petter Reinholdtsen
I have to agree with this statement - although one would hope that the dependency will find it's way onto a debian disk too ;)
Unless I missed something, the RNNoise 0.2 passes the desert island test fine. The download mechanism only applies to Git.
[Jean-Marc Valin]
Unless I missed something, the RNNoise 0.2 passes the desert island test fine. The download mechanism only applies to Git.
Sound good. From where can I download a tarball with the model included, to verify no download from Internet take place during build?
-- Happy hacking Petter Reinholdtsen
On the Releases page: https://github.com/xiph/rnnoise/releases/download/v0.2/rnnoise-0.2.tar.gz
This new release brings many improvements, including improved training, SSE4.1 and AVX2 optimizations, and run-time CPU detection (RTCD).
The distributed models are now trained using only publicly available datasets.