weecology / deepforestr

Other
11 stars 2 forks source link

Mac OS current error #5

Closed henrykironde closed 4 months ago

henrykironde commented 3 years ago
 model = df_model()
Reading config file: /Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/deepforest/data/deepforest_config.yml
> model$use_release()
Model from DeepForest release https://github.com/weecology/DeepForest/releases/tag/1.0.0 was already downloaded. Loading model from file.
Loading pre-built model: https://github.com/weecology/DeepForest/releases/tag/1.0.0
> 
> annotations_file = get_data("testfile_deepforest.csv")
> model$config$cpus = 1L
> model$config$workers = 1L
> model$config$epochs = 1
> model$config["save-snapshot"] = FALSE
> model$config$train$csv_file = annotations_file
> model$config$train$root_dir = get_data(".")
> 
> model$config$train$fast_dev_run = TRUE
> 
> model$create_trainer()
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Running in fast_dev_run mode: will run a full train, val, test and prediction loop using 1 batch(es).
> model$trainer$fit(model)

  | Name  | Type      | Params
------------------------------------
0 | model | RetinaNet | 32.1 M
------------------------------------
31.9 M    Trainable params
222 K     Non-trainable params
32.1 M    Total params
128.592   Total estimated model params size (MB)
Epoch 0:   0%|          | 0/1 [00:00<00:00, 4152.78it/s]  /Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:106: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
/Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:327: UserWarning: The number of training samples (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
/Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:382: UserWarning: One of given dataloaders is None and it will be skipped.
  rank_zero_warn("One of given dataloaders is None and it will be skipped.")
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
henrykironde commented 3 years ago

Some more error report from terminal, the above was from Rstudio

> model$trainer$fit(model)

  | Name  | Type      | Params
------------------------------------
0 | model | RetinaNet | 32.1 M
------------------------------------
31.9 M    Trainable params
222 K     Non-trainable params
32.1 M    Total params
128.592   Total estimated model params size (MB)
/Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:327: UserWarning: The number of training samples (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
/Users/henrysenyondo/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:382: UserWarning: One of given dataloaders is None and it will be skipped.
  rank_zero_warn("One of given dataloaders is None and it will be skipped.")
Epoch 0:   0%|                                                                                                | 0/1 [00:00<00:00, 4782.56it/s][W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
zsh: abort      R
henrykironde commented 3 years ago

Looks like there a crash on binary libiomp OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initializ Some reference : 1) https://github.com/dmlc/xgboost/issues/1715 2) https://stackoverflow.com/questions/53014306/error-15-initializing-libiomp5-dylib-but-found-libiomp5-dylib-already-initial

Worked for me after setting Sys.setenv("KMP_DUPLICATE_LIB_OK"="TRUE"). We have to be careful since libiomp5.dylib vs libomp.dylib may give us different results

spono commented 2 years ago

same OMP issue on W10 when running model = df_model():

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

What do you suggest "[The best thing to do is ] to ensure that only a single OpenMP runtime is linked into the process"?

Your solution using Sys.setenv("KMP_DUPLICATE_LIB_OK"="TRUE") seems "risky" for the actual use in a production environment (having no idea if and when it may cause issues). Thanks in advance

ethanwhite commented 1 year ago

I've now fixed the OMP issue via a change in the installation instructions that removes the mkl package which is causing this issue e10c158cfa846b2b683d8595336eb04219c9960b

Can someone using macOS follow the new installation instructions and see if the rest of the issues reported here remain? I'm still seeing training issues on Windows, but things now work properly for predicting from the release model

mirandateats commented 1 year ago

Using macOS, I ran into the following issues during installation:

  1. reticulate::conda_remove('r-reticulate', packages = 'mkl') returned the following:

"+ '~/Library/r-miniconda/bin/conda' 'remove' '--yes' '--name' 'r-reticulate' 'mkl' Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed

PackagesNotFoundError: The following packages are missing from the target environment:

Error: Error 1 occurred removing conda environment r-reticulate"

After this error, I continued with installation anyways...

  1. I had to run install.packages('devtools') (not included in the installation code) before running devtools::install_github('weecology/deepforestr')

  2. It seems that any code including the df_model() function crashes RStudio. Examples that have caused a crash: model <- df_model() deepforestr::df_model()

ethanwhite commented 1 year ago

Thanks for the report @mirandateats! Unfortunately we've had ongoing stability issues with reticulate (which is how we run the core Python package from within R) on non-Linux systems. We'll keep trying to address those issues, but at the moment my recommendation is to do the core DeepForest work using the Python package directly and then import the results to R for further analysis and visualization.

ethanwhite commented 11 months ago

@mirandateats - it looks like some of the upstream issues have been resolved now and I have things running properly on Windows 10. Can you try a fresh install and let me know if you're still running into issues?

ethanwhite commented 11 months ago

@spono - after some upstream fixes everything seems to be working on Windows now. Can you try a fresh install and then see if the test code below runs

library(deepforestr)

model = df_model()
model$use_release()

annotations_file = get_data("testfile_deepforest.csv")

model$config$train$csv_file = annotations_file
model$config$train$root_dir = get_data(".")

model$create_trainer()
model$train$fit(model)
ethanwhite commented 11 months ago

@henrykironde - can you test again on macOS since our upstream issues seem to be resolved now (at least on Windows)

robAndrus34 commented 6 months ago

@henrykironde and @ethanwhite - I'm curious if you've resolved this issue. I ran into the same problem on macOS yesterday. After a basic install according to the directions on the website, Rstudio crashed when I ran model = df_model()

Thank you.

ethanwhite commented 6 months ago

Thanks for the report @robAndrus34! We haven't managed to reproduce this locally in part due to not having many mac's in the lab. If you have time to work with us on debugging on macOS we'd be happy to do that. If you need to get something up and running quickly then it's pretty easy to do in Python even if you don't much Python work. Let us know which direction you'd like to go and we'll be happy to help.

robAndrus34 commented 6 months ago

Thanks @ethanwhite . I decided to go the Python route for now. At some future date, I may be interested in troubleshooting the R issue. Thanks

ethanwhite commented 6 months ago

Sounds good @robAndrus34 - let us know if you have any questions as you get things up and running in Python

ethanwhite commented 6 months ago

This failure is now reflected in our failing macOS tests which may help us explore this further.

ethanwhite commented 4 months ago

Tests are now passing for macOS on non-M1 chips and everything is working on local tests on Linux and Windows including RStudio. Therefore I'm going to go ahead and close this issue. Please open a new issue with detailed information if you have issues.