New cache format - Githubissues

mattias-p commented 3 years ago

After Engine has performed a run, its cache can be saved to disk. This effectively creates an archive of those pieces of the internet that are needed to perform this particular run. The cache file can be restored from disk and the same run can be performed again with the same result, provided that it is the version of Engine that restores the cache is the same version that stored it and that a sufficiently similar profile is being used.

However there are some limitations to the current implementation.

The cache file does not include the parameters of the run(s) that were used to collect the cache data.
The cache file does not include the version number of Engine that was used to create it. This is probably only useful occasionally when debugging.
The cache file does not include ASN lookup requests. Consequently a test that uses ASN lookup makes new ASN queries no matter if you're performing a fresh run or a run from restored cache data.
The cache file does not include AXFR requests. Consequently a test that uses AXFR makes new AXFR queries no matter if you're performing a fresh run or a run from restored cache data.
The file format isn't flexible enough to support MethodsNT. E.g. it does not distinguish between NS responses given from parent zone name servers and responses given from the zones own name servers.
The file format is tightly coupled with Engine's the internal cache representation. This is bad for two reasons. First, it's wasteful because there are lots of repetitions of the strings "Zonemaster::Engine::Packet" and "Zonemaster::LDNS::Packet". Second, if we want to change the internal cache representation, we'll either have to change the cache file format or we'll have to emulate the old internal cache representation just to be able to save and restore the cache files.

A new cache file format should be designed and implemented. All existing data files should be converted to the new format. We should consider supporting both cache formats during a transition period.

mattias-p commented 3 years ago

The serialization format needs to support multiple sets of data of different types. I've found a comparison of different serialization formats from 2016 targeting Python. While the numbers and conclusions may not apply to our use case the report does highlight how the choice of format impacts performance.

marc-vanderwal commented 2 months ago

However there are some limitations to the current implementation.

The cache file does not include the parameters of the run(s) that were used to collect the cache data.

We’d need to save:

the version of Zonemaster that generated the saved cache file (because test cases can change);
the domain under test;
which IP protocol versions were enabled;
which test cases were enabled;
other settings from the effective profile that affect the queries that Zonemaster sends;

The cache file does not include the version number of Engine that was used to create it. This is probably only useful occasionally when debugging.

See above.

The cache file does not include ASN lookup requests. Consequently a test that uses ASN lookup makes new ASN queries no matter if you're performing a fresh run or a run from restored cache data.

The ASN lookup data should be saved in an implementation-agnostic format: e.g. a companion structure that just keeps a record of the IP address to ASN mappings that were discovered during that run.

The cache file does not include AXFR requests. Consequently a test that uses AXFR makes new AXFR queries no matter if you're performing a fresh run or a run from restored cache data.

We’d have to make sure that all queries go through the cache. Also, the cache should, if it doesn’t already do so, store packets corresponding to negative responses as well as positive ones.

However, it might be useful to consider not caching the answer sections of positive AXFR responses.

The file format isn't flexible enough to support MethodsNT. E.g. it does not distinguish between NS responses given from parent zone name servers and responses given from the zones own name servers.

Doesn’t the current implementation already use the target name server’s IP address as part of the cache key? Maybe this is a deficiency of something else than the cache.

Either way, the cache needs to store the source IP address of responses.

The file format is tightly coupled with Engine's the internal cache representation. This is bad for two reasons. First, it's wasteful because there are lots of repetitions of the strings "Zonemaster::Engine::Packet" and "Zonemaster::LDNS::Packet". Second, if we want to change the internal cache representation, we'll either have to change the cache file format or we'll have to emulate the old internal cache representation just to be able to save and restore the cache files.

It is indeed unnecessarily verbose. Storing a base64-encoded wire format representation of each packet should suffice.

We should also think about the cache keys. The current implementation applies an MD5 function to turn the actual cache key into a string that is then used as keys for a Perl hash. It might be a good opportunity to question whether that MD5 function really serves a purpose.

marc-vanderwal commented 2 months ago

We might need to keep the old cache format around, or at least migrate the data files’ contents to a different structure inside the new format.

mattias-p commented 2 months ago

We (@marc-vanderwal, @MichaelTimbert and I) noted that we need to store multiple different sections in the new cache format and discussed which one to choose. In the end we settled on CBOR which seems to be a good fit for our use case. It's a binary format similar to JSON, but with good performance.

The Perl implementation that people are using seems to be CBOR::XS and the Rust implementation that people are using seems to be ciborium.

We should make a PoC to ensure we can get reproducible serializations of Map data with CBOR.

marc-vanderwal commented 2 months ago

For writing CBOR files, it might be useful to consider prepending the CBOR-encoded data with the “magic value” described in RFC 8949, § 3.4.6, in order to distinguish them from the old format.

matsduf commented 2 months ago

The cache file does not include AXFR requests. Consequently a test that uses AXFR makes new AXFR queries no matter if you're performing a fresh run or a run from restored cache data.

We’d have to make sure that all queries go through the cache. Also, the cache should, if it doesn’t already do so, store packets corresponding to negative responses as well as positive ones.

However, it might be useful to consider not caching the answer sections of positive AXFR responses.

According to the specification on Nameserver03 it looks for the SOA first in the answer section. Then that should be cached.

zonemaster / zonemaster-engine

New cache format #938