ximion / appstream-generator

A fast AppStream metadata generator
GNU Lesser General Public License v3.0
44 stars 29 forks source link

Package.getFileData is a bottleneck sometimes #111

Open arrowd opened 1 year ago

arrowd commented 1 year ago

While testing my own asgen backend I tried processing a package for the Stellarium project. It took more than 2 hours to process it. A little debugging revealed that the package

  1. quite large by itself
  2. has a gazillion files
  3. has a lot of Qt translation files

When asgen runs compose on the package, its callbacks repeatedly call to Package.getFileData(), which is implemented in the following way for all backends:

    override
    const(ubyte[]) getFileData (string fname)
    {
        if (!pkgArchive.isOpen)
            pkgArchive.open (this.getFilename);

        return pkgArchive.readData(fname);
    }

Reading the ArchiveDecompressor.readData() code reveals that it iterates over the whole archive looking for a requested file. This results in O(n*m) complexity where n is a number of files in the package and m is the number of translation files.

This clearly needs some caching/optimization, which can be made backend-agnostic, but since this is the first time I'm working with D I decided to first report the issue and hear back the options or suggestions on that.

ximion commented 1 year ago

That's the annoying thing with tar files: They don't have an index. So far this hasn't been much of an issue, as there are only few large packages, but since we now started to parse tons of translations it will probably indeed become an issue.

If libarchive has the right API for that (which I assume it does), we could read through the archive in its entirety once and create a hash map with the right offsets. We could then immediately jump to the correct positions. This would slow down processing slightly for shorter archives, but massively increase speed for large ones.

arrowd commented 1 year ago

Or maybe just extract the archive if we detect that getFileData() is called too often?

ximion commented 1 year ago

Skipping over large data chunks is waaaaaay faster than extracting anything and dealing with disk write and read I/O. Also, what would "too often" even be ;-) Creating a hash table for files above a certain size, or even all of them, makes much more sense :-)

arrowd commented 1 year ago

Should this be done inside ArchiveDescompressor then?

ximion commented 1 year ago

I'd say yes, because the data structure needed would hold very specific information about the individual archive. But I'll have to look again how the decompressor is used (haven't looked at that code in detail in ages).

arrowd commented 1 year ago

If libarchive has the right API for that (which I assume it does)

I was unable to find out such API. The only function that actually uses an archive_entry returned by archive_read_next_header is archive_read_extract, which extracts data to disk.

arrowd commented 1 year ago

Maybe we can count the number of translation files on the first readContents() call and then somehow decide if we should extract to a temporary location?