yigbt / deepFPlearn

Link molecular structures of chemicals (in form of topological fingerprints) with multiple targets
Other
8 stars 6 forks source link

Work on creating the Singularity container and CD #2

Closed halirutan closed 4 years ago

halirutan commented 4 years ago

TODO:

bernt-matthias commented 4 years ago

Ping @halirutan @mai00fti : Could you create a release of the repos? Maybe v0.1? Maybe set the pre-release checkbox?

I would like to create a conda package.

bernt-matthias commented 4 years ago

Potentially stupid question: Why is the tensorflow-gpu conda package installed? Is the really needed in the tensorflow container? Intuitively I would guess that all functionality should be already in the container?

bernt-matthias commented 4 years ago

One more question: in tests/ there are a bunch of python scripts. How are they called and in which order are they executed?

halirutan commented 4 years ago

Potentially stupid question: Why is the tensorflow-gpu conda package installed? Is the really needed in the tensorflow container?

Simple answers: 1. Because in the environment.yml we need it if someone wants to create the conda env. 2. I was too lazy to test if it works in the singularity container without it. After all we're setting up a new conda env and I wasn't totally sure if it actually reuses the packages that are already available. Since the singularity built takes ages, I simply haven't investigated yet.

One more question: in tests/ there are a bunch of python scripts. How are they called and in which order are they executed?

The run... files are not for unit-testing and they cannot be run automatically because they rely on the /data which is not available in the repo. Therefore, only the test_... files are executed with pytest.

Could you create a release of the repos?

Not sure if I'm allowed to do this. I need to check what privileges I have.

halirutan commented 4 years ago

Ping @halirutan @mai00fti : Could you create a release of the repos? Maybe v0.1? Maybe set the pre-release checkbox?

Done: https://github.com/yigbt/deepFPlearn/releases/tag/v0.1

@bernt-matthias Damn, I guess you needed the release from this branch, right?

bernt-matthias commented 4 years ago

I guess you needed the release from this branch, right?

Should be fine.

bernt-matthias commented 4 years ago

One step further: I successfully built an image using pip only: https://github.com/bernt-matthias/deepFPlearn/runs/1166177083?check_suite_focus=true (in my original branch https://github.com/bernt-matthias/deepFPlearn/tree/topic/singularity-CD)

In contrast to the conda based container the image is 'only' 1.5GB (the other one has nearly 4GB). One problem might be that I need to install rdkit via apt-get, which seems to install to python 2.7. Not sure if this leads to problems. Tests would be nice (also in general adding CI for the project).

Not sure yet why the conda image gets that large. I will play around with multi stage builds in the style of https://pythonspeed.com/articles/conda-docker-image-size/

bernt-matthias commented 4 years ago

If I'm not wrong, then single stage build now works now because I use the slim version of quay.io/singularity/singularity

Nevertheless I got also multstage builds to work:

Sizes are from building on my machine, here I have seen different sizes .

I have the three container definitions in my branch https://github.com/bernt-matthias/deepFPlearn/tree/topic/singularity-CD

Not sure if the commented packages in the environment are really not needed, but at least I could not find them used with grep.

If the pip version is running I guess this may be the preferred way.

halirutan commented 4 years ago

@bernt-matthias I need to look at this closer.

Tests would be nice

The rdkit can be tested with tests/test_fingerprint.py since it uses rdkit to build fingerprints. A call to pytest from the deepFPlearn directory (with activated env) is sufficient.

Not sure yet why the conda image gets that large.

My current understanding is that the conda dependency resolution is superior to that of pip. I mean, it's not only important what packages we use but also which packages the dependencies depend on. I don't know what magic conda uses but it takes a very long time to resolve all dependencies.

A definite test is if we re-run the training. A small subset of our cases should be sufficient. But I need to test this.

Not sure if the commented packages in the environment are really not needed, but at least I could not find them used with grep.

The whole thing is historically grown from Jana's first tries. So there is a good chance that not all packages are needed.

Can you push the commits from your branch to this one?

On a side note: We really need to fix the naming of the environment :) This is also a historical thing and conda_rdkit2019 does not really fit anymore :joy:

bernt-matthias commented 4 years ago

The rdkit can be tested with tests/test_fingerprint.py since it uses rdkit to build fingerprints. A call to pytest from the deepFPlearn directory (with activated env) is sufficient.

Then we should add a %test section to the final stage

My current understanding is that the conda dependency resolution is superior to that of pip.

True

I mean, it's not only important what packages we use but also which packages the dependencies depend on. I don't know what magic conda uses but it takes a very long time to resolve all dependencies.

Conda computes the complete dependency graph of the packages and its dependencies and gets an optimal solution by solving a SAT problem .. that's why it takes a long time some times. Problem is known, here are some recommendations https://github.com/bioconda/bioconda-recipes/issues/13774 (e.g. fixing python / R versions to a reasonable range .. for instance py>=3.5)

A definite test is if we re-run the training. A small subset of our cases should be sufficient. But I need to test this.

Would be a cool addition. Wondering if the gpu container will run on the github actions virtual machines which probably do not have GPUs.

Can you push the commits from your branch to this one?

Done

On a side note: We really need to fix the naming of the environment :) This is also a historical thing and conda_rdkit2019 does not really fit anymore joy

Jep. That's an easy one. We just need to decide which to take.

halirutan commented 4 years ago

@bernt-matthias

Not sure if the commented packages in the environment are really not needed, but at least I could not find them used with grep.

I was having a look at this while working on another issue with ~Java~ Jana (have to remember her name properly). We definitely use sklearn as direct import but I was able to scrape a lot of the packages. I also fixed the versions to values I'm seeing here because in the bioconda issue you linked, it was mentioned as one major slow-down in the resolution of dependencies. I'm currently running the whole analysis on my machine with the new environment and will do the same on the HPC. Right now, the list of packages looks like this:

  - python=3.8.2
  - pip=20.1.1
  - numpy=1.19.1
  - conda=4.8.4
  - conda-build=3.20.3
  - tensorflow-gpu=2.2.0
  - scikit-learn=0.23.1
  - pandas=1.0.5
  - keras=2.4.3
  - rdkit=2020.03.4
  - matplotlib=3.2.2
  - pytest=5.4.3
  - jsonpickle=1.4.1

Then we should add a %test section to the final stage

Agreed! It would be really nice if we had some sort of test-data that we could use for training. Just to be sure. Will talk to Jana about it.

Would be a cool addition. Wondering if the gpu container will run on the github actions virtual machines which probably do not have GPUs.

Should work because we basically have the same situation on your UFZ HPC where we can't access the GPU atm. It just falls back to CPU.

halirutan commented 4 years ago

I tested the adjusted environment locally and ran some tests. You see that it still builds here. I'd say when the analysis on the HPC I'm starting now runs through with the changed environment, we should discuss what we need to change in the GitHub action to push the container automatically on each new tagged release on master.

It's really awesome that we made such progress and I highly appreciate your help in all this! :grinning:

halirutan commented 4 years ago

@bernt-matthias Surprisingly we now run into the same error we had before about "no space left on device" :(

bernt-matthias commented 4 years ago

Hrm. Habe den Job nochmal neu gestartet...

Gestern noch kurz mit @mai00fti drueber geschaut. Denken es ware gut, wenn die/das def file(s) unabhaengig von CI laufen wuerden. Dafuer muesste man eigentlich nur sources per git auschecken.

bernt-matthias commented 4 years ago

Attention I did force push to this branch :(

halirutan commented 4 years ago

@bernt-matthias I took some time tonight to work on this, and I got everything running:

I have this on a private repo for now, but I would work with Jana on it tomorrow. We need to set up a sylabs account for the group and add access-tokens to this repo (which I'm not allowed to do). Then we need to fix the description, collection, etc of the container so that has the right information on it.

I'd also force push to this branch when everything is running.

halirutan commented 4 years ago

@bernt-matthias @mai00fti Yeah, works. Right now, we decided to push the containers to Jana's sylabs account until a better solution with more quota is available. I'll push some more documentation to this branch and then it should be ready for merging.

We'll build the container on each release (and not on each push) and after merging this branch, I'll add the latest changes we made to the dfpl package to master and create a first real release.

halirutan commented 4 years ago

@bernt-matthias On my local machine, both the dfpl.def and multi.def run through smoothly. However, there is a huge difference in size

image

Have you tried building the multi.def yourself and do you get other results?

Edit:

Since Jana mentioned it that this wasn't clear. The singularity build works out of the box on every machine and you don't have to set up anything special. Just clone the whole repo, go into the singularity_container folder and call

SING=$(command -v singularity)
sudo $SING build dfpl-multi.sif multi.def
bernt-matthias commented 4 years ago

For me the local build of the multi_conda def file is 4.1G.

Questions:

bernt-matthias commented 4 years ago

Maybe try building locally on a fresh clone using git clone -b feat_singularity_CD --single-branch --depth 1

halirutan commented 4 years ago

@bernt-matthias You are right. I wasn't paying attention that the whole file-tree is copied into the container and it happily included the sifs I already built. Keeping this in mind, I suggest taking your multi.def to build the container. I will make all necessary changes. Thanks for looking into it.