Consider mounting volumes outside of the container for data persistence

machawk1 commented 6 years ago

You can then stop the container using docker stop wasp and start it again with docker start wasp. Note that your archive is stored in the container. If you remove the container, your archive is gone.

Docker allows one to mount directories outside of the container as volumes. Doing so would prevent the above scenario of the data disappearing when the container is gone.

ibnesayeed commented 6 years ago

Persisting data from containers is a well-known and well-documented topic so we can assume those familiar with Docker would know how to do it, be it using volumes, bind mounts, or third-party storage drivers. However, the documentation of this repo should at least describe all the places (or declare them as volumes) where data of different services are being stored so that users know where to mount drives for persistence. Also, a simple bind mount example command will not hurt either.

That said, I am not a big fan of monolithic containers that run too many services in a single container. This might work well when things are used as a portable desktop application, but for any serious scalable work setup every service should have its own container and orchestrated using a stack file (or docker-compose).

arjenpdevries commented 6 years ago

Mastodon solves this with docker-compose - that works indeed quite nice, so we can "borrow" their setup.

johanneskiesel commented 6 years ago

That would be useful indeed. Currently we have the following places:

/home/user/srv/warcprox/archive contains the WARC files
/home/user/srv/pywb/collections/archive contains a link to the WARC files and the pywb indexes + templates
/home/user/srv/elasticsearch/index contains the elastic search index

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently. This is (currently) not the case for the elasticsearch index, but this could be changed relatively easily.

It would then be enough to store the WARC files persistently, which would make sense to me. This would also allow you to just add WARC files you recorded with another system.

What do you think?

(As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort. I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.)

ibnesayeed commented 6 years ago

I think we don't need to put applications in deeper directories when running in containers because of the file system isolation. I would perhaps suggest to place all the individual apps directly under the / or the container file system or make a directory at /wasp and place everything under that. This way, unnecessary repetition of the /home/user/srv path prefix can be avoided when dealing with volumes.

Alternatively, we should be able to modify data directories of all these applications and place them under something like /data/{warcprox,pywb,ealsticsearch}. This way, the code is isolated from the data and one is not a sub-directory of the other. Also, this structure would allow mounting just one directory if the sub-directory structure on the host is the same or each application's data directory separately.

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently.

If I know it correctly, PyWB indexes WARC files automatically that are not indexed already (i.e., their CDXJ records are missing). @ikreymer correct me if I am wrong here. If so, then persisting PyWB index is also important otherwise each time a container is started, CDXJ indexing needs to happen all over again. This might not be a big deal for small collections, but it will become important otherwise.

As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort.

If a stack/compose file is provided, it can define necessary volumes and make them available in each service to share the file system to deal with this.

I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.

If the intent of this project is only for single-user small setups then this assumption is fair enough.

arjenpdevries commented 6 years ago

Mastodon uses e.g. Postgres and Redis as external services, that run using their own images. This is e.g. how Postgres is used in Mastodon.

In my docker-compose.yml file for the idf.social, the Postgres storage directories are connected to directories in the host file system:

db:
  restart: always
  image: postgres:9.6-alpine
networks:
  - internal_network
volumes:
  - /data/mastodon/postgres/postgres:/var/lib/postgresql/data:z

The volumes directive causes the data for the Postgres database to reside outside the container, in directory /data/mastodon/postgres/postgres. (The :z attribute is necessary for SELinux.)

See also the docker-compose.yml for the full, more complex definition, as Mastodon uses more services, and also defines some additional volumes to store its own dynamic data in the host filesystem.

ibnesayeed commented 6 years ago

There are many ways to achieve this. We can ever declare volumes and networks as top-level objects in the compose file, then use those to deploy using docker compose for quick testing or built-in docker stack for a more robust long-running system. Shared volumes will allow file-based communication and shared networks would allow containers service to talk to reach to each other using service names.

webis-de / wasp

Consider mounting volumes outside of the container for data persistence #3