Allow fetch requests to read from a single downloaded tar.gz file

yaqwsx / jlcparts

Better parametric search for components available for JLC PCB assembly

https://yaqwsx.github.io/jlcparts/

MIT License

554 stars 51 forks source link

Allow fetch requests to read from a single downloaded tar.gz file #111

Closed dougy83 closed 8 months ago

dougy83 commented 8 months ago

This PR allows all fetch requests to read from a single downloaded all-data.tar.gz file, rather than making thousands of parallel network requests. This file is made at the same time as the rest of the data, and is simply tar gzip of the data folder (.json, .json.gz, *.sha256 files).

Falls back to individual fetch requests if the archive file is unavailable, or if the requested file is not within the archive.

yaqwsx commented 8 months ago

Thank you for the PR! At the moment, the database file is around 30 MB and it is possible that we will soon hit GH pages limit (50 MB). Could you also implement splitting the file into multiple parts (just like we do it e.g., for the cache.sqlite3)?

dougy83 commented 8 months ago

The all-data.tar.gz created was 24.5MB; it would have to more than double in size to hit 50 MB. Do you want the splitting to be done as part of this PR?

dougy83 commented 8 months ago

Combined data tar.gz file is now split into two parts

maksz42 commented 8 months ago

Combined data tar.gz file is now split into two parts

@yaqwsx probably meant splitting the file into 50MB parts.

Also if all the data is a single gzip (which I think is a good idea) then I think each single subcategory json shouldn't be gzipped on its own. Higher compression ratio and less computing power.

As the gzip file is quite big, I'm wondering if it is possible to start ungzipping the file on the fly, while it's still downloading.

And I saw that Firefox added support for DecompressionStream so we could use native decompression instead of pako.js

dougy83 commented 8 months ago

Also if all the data is a single gzip (which I think is a good idea) then I think each single subcategory json shouldn't be gzipped on its own. Higher compression ratio and less computing power.

If files are not compressed prior to adding to combined file, they will need to be stored in memory uncompressed. This means 450MB of memory would be required instead of ~25MB currently. The .tar.gz of already compressed files is 25,089KB (25,116KB as 2 files); with no pre-compression it's 25,015KB. There is no compression advantage.

As the gzip file is quite big, I'm wondering if it is possible to start ungzipping the file on the fly, while it's still downloading.

The time to unzip the two (split) .tar.gz files is 155ms and 137ms on my laptop in a chromium-based browser. I don't think this is a performance bottleneck.

maksz42 commented 8 months ago

This means 450MB of memory would be required instead of ~25MB currently.

That makes sense. But double (de)compression is weird. Isn't tarball enough?

Edit: I checked and tarball without compression gives a 29MB file so looks like double gzip is a better approach.

dougy83 commented 8 months ago

That makes sense. But double (de)compression is weird. Isn't tarball enough?

Gzip has mechanisms to handle uncompressible data (e.g. already compressed data) without adding appreciable overhead. I think you'll find that such sections won't require actual decompression either, as they'll be marked as stored.

Edit: I checked and tarball without compression gives a 29MB file so looks like double gzip is a better approach.

Yes, tar file format uses 512 byte blocks, with each stored file needing at least two blocks (one header block plus one data block). This means a 12 byte file will take up 1024 bytes in the tar archive. Applying gzip to the tar file can remove much of this overhead.

yaqwsx commented 8 months ago

@dougy83: I am confused right now; why did you with-drawn your changes? Does it mean you abandon this PR or do you plan to rework it?

dougy83 commented 8 months ago

Hi, sorry. I put my commits into another branch and reset my master, which inadvertently closed this PR. My master is now following the other branch again.

That said, I'm having a look at redoing the whole database to not query IndexedDB; currently select with full text search is down from ~15 seconds to 1.2 seconds, and I think I should be able to get it to <400ms. It removes all changes in this PR anyway.

dougy83 commented 8 months ago

I'm closing this PR as it's obsolete with the db changes I made, which I'll put in another PR. Changes for this PR will be left in the "fetch-single-file" branch on my repo, but are not useful.