weecology / DeepTreeAttention

Implementation of Hang et al. 2020 "Hyperspectral Image Classification with Attention Aided CNNs" for tree species prediction
MIT License
119 stars 38 forks source link

IssueSubmit#1-mgwein: main branch on hipergator #145

Closed mgwein closed 2 years ago

mgwein commented 2 years ago

DeepTreeAttention_22026095.out.txt DeepTreeAttention_22026095.err.txt

Do we need CHM.py? There are issues running start_cluster.py (im told its outdated method for hipergator)

To generate the attached log files, this was commented out from train.py:

#Generate new data or use previous run
# if config["use_data_commit"]:
#     print("***********************************************************************************************************************************************************")
#     config["crop_dir"] = os.path.join(config["data_dir"], config["use_data_commit"])
#     client = None    
# else:
#     crop_dir = os.path.join(config["data_dir"], comet_logger.experiment.get_key())
#     os.mkdir(crop_dir)
#     client = start_cluster.start(cpus=50, mem_size="4GB")    
#     config["crop_dir"] = crop_dir
mgwein commented 2 years ago

DeepTreeAttention_22011007.out.txt DeepTreeAttention_22011007.err.txt

### Uncommenting the code in issue comment above yields the following error in .err:

Task exception was never retrieved
future: <Task finished name='Task-65' coro=<_wrap_awaitable() done, defined at /home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 1\nCommand:\nsbatch /scratch/local/22011007/tmpuhl8k9z6.sh\nstdout:\n\nstderr:\nsbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n\n')>
Traceback (most recent call last):
  File "/home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/site-packages/dask_jobqueue/core.py", line 325, in start
    out = await self._submit_job(fn)
  File "/home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/site-packages/dask_jobqueue/core.py", line 308, in _submit_job
    return self._call(shlex.split(self.submit_command) + [script_filename])
  File "/home/mgwein/.conda/envs/DeepTreeAttention/lib/python3.9/site-packages/dask_jobqueue/core.py", line 403, in _call
    raise RuntimeError(
RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
sbatch /scratch/local/22011007/tmpuhl8k9z6.sh
stdout:

stderr:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

### and the following output in .out:

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p hpg2-compute
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH -t 24:00:00
#SBATCH --error=/orange/idtrees-collab/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/orange/idtrees-collab/logs/dask-worker-%j.out

/home/mgwein/.conda/envs/DeepTreeAttention/bin/python -m distributed.cli.dask_worker tcp://10.13.164.106:41657 --nthreads 1 --memory-limit 3.73GiB --name dummy-name --nanny --death-timeout 300 --local-directory /orange/idtrees-collab/tmp/ --resources cpu=1 --protocol tcp://

To tunnel into dask dashboard:
For GPU dashboard: ssh -N -L 8787:c45a-s5.ufhpc:8787 -l b.weinstein hpg2.rc.ufl.edu
For CPU dashboard: ssh -N -L 8781:c45a-s5.ufhpc:8781 -l b.weinstein hpg2.rc.ufl.edu
SERC_clark
...
DSNY_townsend
plotID 100_contrib raised: Cannot find CHM path for [ 404169.66624124 3285334.88387813  404199.00012306 3285355.83554369] from plot ['100_contrib'] in lookup_pool: No matches for geoindex 404000_3285000 in sensor pool with bounds [ 404169.66624124 3285334.88387813  404199.00012306 3285355.83554369]
...
plotID OSBS_townsend_contrib_8 raised: Cannot find CHM path for [ 406105.16756718 3288153.11311048  406105.16756718 3288153.11311048] from plot ['OSBS_townsend_contrib_8'] in lookup_pool: No matches for geoindex 406000_3288000 in sensor pool with bounds [ 406105.16756718 3288153.11311048  406105.16756718 3288153.11311048]
[DeepTreeAttention_22011007.out.txt](https://github.com/weecology/DeepTreeAttention/files/8374586/DeepTreeAttention_22011007.out.txt)
[DeepTreeAttention_22011007.err.txt](https://github.com/weecology/DeepTreeAttention/files/8374587/DeepTreeAttention_22011007.err.txt)
mgwein commented 2 years ago

train.csv

Here is the updated train.csv file with missing files removed

bw4sz commented 2 years ago

I had re-made the entry to the repo to be able to make more independence between runs and not always looking in the same place for the train.csv and test.csv, this way multiple branches can be run simultaneously. When you comment out that in the config you are generating data from scratch. The dask-jobqueue is just one of many many permissions you don't have to do that. This was the whole reason I copied a directory for you and ritesh of crops. It should easy enough to mock the directory structure. In your config file

you'll see

#Crop generation, whether to make a new dataset and customize which parts to recreate
#Checkout data artifact from comet
use_data_commit: 103c9c5aa9394f4a9b8b7c95ed1b171b

#Make new dataset
data_dir: /blue/ewhite/b.weinstein/DeepTreeAttention/

I just copied over that commit folder for you and you should be good to go.

cp -r c75914da262947709f47c0f1e328f845 /orange/idtrees-collab/DeepTreeAttention/crops/

so set your config to

#Crop generation, whether to make a new dataset and customize which parts to recreate
#Checkout data artifact from comet
use_data_commit: c75914da262947709f47c0f1e328f845 

#Make new dataset
data_dir: /orange/idtrees-collab/DeepTreeAttention/crops/

In the folder c75914da262947709f47c0f1e328f845 you will find the train.csv, test.csv and all the crops.

mgwein commented 2 years ago

DeepTreeAttention_23371042.out.txt DeepTreeAttention_23371042.err.txt

### Sure this is a minor overlook, error file reading:

Traceback (most recent call last):
  File "/blue/azare/mgwein/DeepTreeAttention/DeepTreeAttention/src/neon_paths.py", line 48, in find_sensor_path
    year_match = match[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/blue/azare/mgwein/DeepTreeAttention/DeepTreeAttention/train.py", line 101, in <module>
    results = m.evaluate_crowns(
  File "/blue/azare/mgwein/DeepTreeAttention/DeepTreeAttention/src/main.py", line 373, in evaluate_crowns
    results, features = self.predict_dataloader(
  File "/blue/azare/mgwein/DeepTreeAttention/DeepTreeAttention/src/main.py", line 338, in predict_dataloader
    img_path = neon_paths.find_sensor_path(lookup_pool=rgb_pool, bounds=geom.bounds)
  File "/blue/azare/mgwein/DeepTreeAttention/DeepTreeAttention/src/neon_paths.py", line 50, in find_sensor_path
    raise ValueError("No matches for geoindex {} in sensor pool with bounds {}".format(geo_index, bounds))
ValueError: No matches for geoindex 404000_3286000 in sensor pool with bounds (404025.292555431, 3286382.9618206667, 404033.49255543103, 3286390.9618206667)
COMET INFO: -----------------------------------
COMET INFO: Comet.ml ExistingExperiment Summary
COMET INFO: -----------------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/mgwein/deeptreeattention/98ab8e260830423098d9feac8007f969

### Almost there!!

bw4sz commented 2 years ago

This is unrelated, i'll fix it out tomorrow. The 'rgb_pool' is the config["rgb_sensor_pool"] in the .yml, its set to

/orange/ewhite/NeonData/*/DP3.30010.001/**/Camera/**/*.tif 

which you don't have permissions for. It is the enormous raw archive. I was just thinking I want to refactor this code, its just to make a plot and should be its own function. As a quick workaround, just pass experiment=None into predict_dataloader within evaluate_crowns.

here: https://github.com/weecology/DeepTreeAttention/blob/a83348adc8071e65243b35053b8be29b755bc827/src/main.py#L378

bw4sz commented 2 years ago

Pull upstream, i have refactored.

In train.py you can turn off

visualize.rgb_plots(
    df=results,
    config=config,
    test_crowns=data_module.crowns,
    test_points=data_module.canopy_points,
    experiment=comet_logger.experiment)

that should do it. Close when ready.