Structure and what to store

zerothi commented 5 months ago

@tfrederiksen did #13 and this spurred the question of how the files should be organized in this repo.

So far it hasn't been too stringent. But I think we ought to do something now.

While the current repo isn't that big, it just scales very rapidly with the CI runners.

E.g. for 100 MB and CI running 100 times (for 3 python versions), it will amount to a bandwidth usage of 30GB. Currently I am paying for 50 GB every 30 days. So we should preferably strive to be below that amount.

My proposal would be the following:

define a very fine structure of the outputs of the files.
- each run should have a separate folder <code>/<test-name>/
- each folder should have a small document describing what it contains. I.e. system information, which version of the code it was runned with, and which required files are also needed (and where to locate them)
- for codes that requires pseudo's (or data files in any form, DFTB requires SK files). I think we should leave these out. BUT, we should explain in the <test-name>/README.md file where things can be fetched.
- if a run has subsequent dependent runs, they should be located in sub-directories. Consider transiesta, which requires 2 electrode runs, and then a transiesta + tbtrans run. This would result in a folder structure like this:
```
<test-name>/
<test-name>/elec1/
<test-name>/elec2/
<test-name>/transiesta/
<test-name>/transiesta/tbtrans
```
which ensures coherence.
for codes that requires pseudo's (or data files in any form, DFTB requires SK files). I think we should leave these out. BUT, we should explain in the <test-name>/README.md file where things can be fetched.
each folder should only contain files that are to be tested against.
everytime a new file is added to an existing <test-name> it should be updated in the README.md to reflect when it was added. This may be important since we cannot ensure that the new file was added with a different version, so it should be listed there as well.
preferably to ensure that files are only added which are needed, a PR against this repo, and against sisl should be made simultaneously so it can easily be cross-referenced, otherwise we can't see which files were used.

Hmm.. I think that was all I thought about, comments?

tfrederiksen commented 5 months ago

Inspired by what you mentioned in #13 ,

If anything, we can partially solve this by creating two PR's, one with the entire content, and one with the reduced content.

what about keeping two branches here, full and minimal? Then one can easily add files to minimal as we go along, if relevant material already exists in full?

tfrederiksen commented 5 months ago

Maybe energy is better spent by focusing on compressing or replacing the largest files?

$ du -hs tests/
132M    tests/

$ find tests -type f -size +1M -exec ls -lh {} + | awk '{print $5, $9}' | sort -hr
84M tests/sisl/io/tbtrans/1_graphene_all.TBT.nc
8.6M tests/sisl/io/siesta/SrTiO3_noncollinear.PDOS
7.7M tests/sisl/io/siesta/SrTiO3_polarized.PDOS
3.8M tests/sisl/io/siesta/SrTiO3.PDOS
2.8M tests/sisl/io/gulp/FORCE_CONSTANTS_2ND
2.5M tests/sisl/io/vasp/nitric_oxide/soi/CHGCAR.gz
2.4M tests/sisl/io/vasp/graphene_md/OUTCAR
1.7M tests/sisl/io/gulp/zz.gout
1.7M tests/sisl/io/gulp/graphene_8x8.gout
1.6M tests/sisl/io/orca/nitric_oxide/molecule.output
1.2M tests/sisl/io/vasp/nitric_oxide/soi/CHG.gz
1.1M tests/sisl/io/vasp/graphene/LOCPOT
1.1M tests/sisl/io/vasp/graphene/CHGCAR

$ find tests -type f -size +1M -exec stat --format="%s" {} + | awk '{total += $1} END {print total/1024/1024, "MB"}'
119.026 MB

In other words, there are only 13 files larger than 1MB and they make up more than 90% of the whole repo.

zerothi commented 5 months ago

I think we can do both.

If we have a way of doing things, then we can move older stuff as needed (agreed it would be better sooner than later). If changing them anyways, we might as well fix it :)

I'll amend something to the README file to clarify how things should be created. Then we can iterate on that, then move things as we go.

tfrederiksen commented 5 months ago

One problem about some of the existing test runs is that the input files are missing or incomplete. It will be a significant effect to redo or tweak these cases. I would therefore argue that we should strive to make all test runs self-contained.

zerothi commented 5 months ago

I have tried to search for the tests, and I think basically everything is pretty easy to get a hold on. I will commence the implementation of the full branch, once I have a first step of an example. I'll ping you! :) Thanks!

zerothi commented 4 months ago

@tfrederiksen I am slowly making progress here.

Could you be of assistance in doing a complete directory PR for orca with the tests we have? I have never done an ORCA run, so the readme there would clarify some things for me ;)

zerothi commented 1 month ago

This is now completed.

the new README.md contains information on how to add new tests, this will be enforced from now on.

zerothi / sisl-files

Structure and what to store #14