zerothi / sisl-files

Test files and other large files part of the sisl-suite
Mozilla Public License 2.0
0 stars 5 forks source link

Structure and what to store #14

Closed zerothi closed 1 month ago

zerothi commented 5 months ago

@tfrederiksen did #13 and this spurred the question of how the files should be organized in this repo.

So far it hasn't been too stringent. But I think we ought to do something now.

While the current repo isn't that big, it just scales very rapidly with the CI runners.

E.g. for 100 MB and CI running 100 times (for 3 python versions), it will amount to a bandwidth usage of 30GB. Currently I am paying for 50 GB every 30 days. So we should preferably strive to be below that amount.

My proposal would be the following:

Hmm.. I think that was all I thought about, comments?

tfrederiksen commented 5 months ago

Inspired by what you mentioned in #13 ,

If anything, we can partially solve this by creating two PR's, one with the entire content, and one with the reduced content.

what about keeping two branches here, full and minimal? Then one can easily add files to minimal as we go along, if relevant material already exists in full?

tfrederiksen commented 5 months ago

Maybe energy is better spent by focusing on compressing or replacing the largest files?

$ du -hs tests/
132M    tests/

$ find tests -type f -size +1M -exec ls -lh {} + | awk '{print $5, $9}' | sort -hr
84M tests/sisl/io/tbtrans/1_graphene_all.TBT.nc
8.6M tests/sisl/io/siesta/SrTiO3_noncollinear.PDOS
7.7M tests/sisl/io/siesta/SrTiO3_polarized.PDOS
3.8M tests/sisl/io/siesta/SrTiO3.PDOS
2.8M tests/sisl/io/gulp/FORCE_CONSTANTS_2ND
2.5M tests/sisl/io/vasp/nitric_oxide/soi/CHGCAR.gz
2.4M tests/sisl/io/vasp/graphene_md/OUTCAR
1.7M tests/sisl/io/gulp/zz.gout
1.7M tests/sisl/io/gulp/graphene_8x8.gout
1.6M tests/sisl/io/orca/nitric_oxide/molecule.output
1.2M tests/sisl/io/vasp/nitric_oxide/soi/CHG.gz
1.1M tests/sisl/io/vasp/graphene/LOCPOT
1.1M tests/sisl/io/vasp/graphene/CHGCAR

$ find tests -type f -size +1M -exec stat --format="%s" {} + | awk '{total += $1} END {print total/1024/1024, "MB"}'
119.026 MB

In other words, there are only 13 files larger than 1MB and they make up more than 90% of the whole repo.

zerothi commented 5 months ago

I think we can do both.

If we have a way of doing things, then we can move older stuff as needed (agreed it would be better sooner than later). If changing them anyways, we might as well fix it :)

I'll amend something to the README file to clarify how things should be created. Then we can iterate on that, then move things as we go.

tfrederiksen commented 5 months ago

One problem about some of the existing test runs is that the input files are missing or incomplete. It will be a significant effect to redo or tweak these cases. I would therefore argue that we should strive to make all test runs self-contained.

zerothi commented 5 months ago

I have tried to search for the tests, and I think basically everything is pretty easy to get a hold on. I will commence the implementation of the full branch, once I have a first step of an example. I'll ping you! :) Thanks!

zerothi commented 4 months ago

@tfrederiksen I am slowly making progress here.

Could you be of assistance in doing a complete directory PR for orca with the tests we have? I have never done an ORCA run, so the readme there would clarify some things for me ;)

zerothi commented 1 month ago

This is now completed.

the new README.md contains information on how to add new tests, this will be enforced from now on.