Open arrowd opened 1 year ago
That's the annoying thing with tar files: They don't have an index. So far this hasn't been much of an issue, as there are only few large packages, but since we now started to parse tons of translations it will probably indeed become an issue.
If libarchive has the right API for that (which I assume it does), we could read through the archive in its entirety once and create a hash map with the right offsets. We could then immediately jump to the correct positions. This would slow down processing slightly for shorter archives, but massively increase speed for large ones.
Or maybe just extract the archive if we detect that getFileData()
is called too often?
Skipping over large data chunks is waaaaaay faster than extracting anything and dealing with disk write and read I/O. Also, what would "too often" even be ;-) Creating a hash table for files above a certain size, or even all of them, makes much more sense :-)
Should this be done inside ArchiveDescompressor
then?
I'd say yes, because the data structure needed would hold very specific information about the individual archive. But I'll have to look again how the decompressor is used (haven't looked at that code in detail in ages).
If libarchive has the right API for that (which I assume it does)
I was unable to find out such API. The only function that actually uses an archive_entry
returned by archive_read_next_header
is archive_read_extract
, which extracts data to disk.
Maybe we can count the number of translation files on the first readContents()
call and then somehow decide if we should extract to a temporary location?
While testing my own asgen backend I tried processing a package for the Stellarium project. It took more than 2 hours to process it. A little debugging revealed that the package
When asgen runs
compose
on the package, its callbacks repeatedly call toPackage.getFileData()
, which is implemented in the following way for all backends:Reading the
ArchiveDecompressor.readData()
code reveals that it iterates over the whole archive looking for a requested file. This results inO(n*m)
complexity wheren
is a number of files in the package andm
is the number of translation files.This clearly needs some caching/optimization, which can be made backend-agnostic, but since this is the first time I'm working with D I decided to first report the issue and hear back the options or suggestions on that.