webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 209 forks source link

do not copy files into archive #408

Open anarcat opened 5 years ago

anarcat commented 5 years ago

Is your feature request related to a problem? Please describe.

I find it difficult to use pywb on large datasets because the files are copied into the collections instead of just "referenced" there.

Describe the solution you'd like

When I add a file to a collection, it should be just treated as if it's in the collection somehow, without having to copy gigantic files around.

Describe alternatives you've considered

I have looked for options in the wb-manager program to see if something could fit the bill, particularly a way to symlink or hardlink files around. Haven't found anything. I also looked at the auto-indexer but i'm not sure how that works.

I also know about the wb-manager index command, but that only works for files already in the archive directory.

I also suspect custom user-defined collections might fit the bill, but I haven't figured out how to use those just yet, plus they probably require restarting the wayback process every time since a special configuration needs to be made for every archive...

traverseda commented 4 years ago

On copy-on-wire filesystems like btrfs it would be nice if we could use btrfs's reflinks.