weecology / MillionTrees

The MillionTreesBenchmark
https://milliontrees.idtrees.org/
GNU General Public License v3.0
9 stars 3 forks source link

Make docs/index.rst dynamically represent the size of the datasets #11

Closed bw4sz closed 1 week ago

bw4sz commented 3 weeks ago

The MillionTrees readthedocs should dynamically reflect the size of the datasets. So on the front landing page something like

The MillionTrees Benchmark for Airborne Tree Prediction
=======================================================

The MillionTrees seeks to collect a million tree locations to create a global benchmark for machine learning models for airborne tree prediction. Machine learning models need large amounts of data to generate realistic predictions. Existing benchmarks often have small amounts of data, often less than 10,000 trees, from single geographic locations and resolutions. The MillionTrees will cover a range of backgrounds, taxa, focal views and resolutions. To make this possible, we need your help!

.. figure:: public/open_drone_example.png
  :alt: Image Placeholder
  :width: 50%

Current Status
--------------

There are currently 3 datasets available for the MillionTrees benchmark:

* TreeBoxes: A dataset of X tree crowns from y sources

* TreePolygons: A dataset of X tree crowns from y sources

* TreePoints: A dataset of X tree crowns from y sources

Where the docs read the current status of the dataset. We could generate a pre-commit github action, a ipython notebook, or it might be possible for .rst to read a substitution.

Current .csv files are

/orange/ewhite/DeepForest/MillionTrees/TreeBoxes_v0.0/official.csv /orange/ewhite/DeepForest/MillionTrees/TreePoints_v0.0/official.csv /orange/ewhite/DeepForest/MillionTrees/TreePolygons_v0.0/official.csv

We could add these to releases to keep track of them, we could read them off of hipergator.

boxes = pd.read_csv("/orange/ewhite/DeepForest/MillionTrees/TreeBoxes_v0.0/official.csv")
>>> boxes.shape
(282288, 7)
>>> boxes.source.value_counts()
source
Radogoshi et al. 2021            101837
Weecology_University_Florida      93126
Sun et al. 2022                   31228
World Resources Institute         22820
Velasquez-Camacho et al. 2023     14772
NEON_benchmark                     6633
Reiersen et al. 2022               4663
Kwon et al. 2023                   3827
Zamboni et al. 2021                3382
Name: count, dtype: int64