whotracksme / whotracks.me

Data from the largest and longest measurement of online tracking.
https://www.ghostery.com/whotracksme
MIT License
407 stars 73 forks source link

Conserve Git LFS bandwidth #231

Closed philipp-classen closed 3 years ago

philipp-classen commented 3 years ago

Currently, we exceed our Git LFS limits relatively quickly. Some ideas to reduce the amount of downloaded data:

philipp-classen commented 3 years ago

After some experiments, I come to the conclusion that LFS creates more problems in our case then it solves. lfs.fetchexclude does not integrate well and looking at the state of the art of switching the Git LFS backend from Github to S3 is not promising (support is experimental at best, and the few solutions that are still active developed seem to require running your own servers).

My recommendation is to move all data files out of Git and put them on a public mirror. The data should be compressed and scripts should be provided to download it (maybe rsync based to detect existing data) and extract it locally. The current PR workflow could be preserved by updating the download script (adding a new month), so we can run tests for the PR before it has an effect.


For documentation, this was my example .lfsconfig:

 [lfs]
    fetchexclude = "whotracksme/data/assets/2017-*/**/*,whotracksme/data/assets/2018-*/**/*,whotracksme/data/assets/2019-*/**/*,whotracksme/data/assets/2020-*/**/*"