Migration to DVC - Githubissues

jessicarose00 commented 5 months ago

This PR migrates the code base to an improved method for versioning data and models with the use of dvc. Dvc enables us to better track and save data and ML models, create and switch between versions of those data and models, and compare metrics among experiments. This PR restructures all scripts associated with model training and modifies those used for deployment.

General Changes

The repository structure is modified to include a new training pipeline with 5 stages:

stage_load_data
stage_prep_features
stage_select_and_tune
stage_train_model
stage_evaluate_model

Parameters are extracted to a dedicated configuration file (params.yaml) and reusable code is migrated into separate modules for each stage. The experiment pipeline is automated with dvc and can be reproduced according to changes in the params.yaml file. The dvc.yaml file configures the machine learning workflow by listing the dependencies and outputs of each stage. The README.md is updated to document the new repository structure.

Method Changes

In addition to the restructuring of the code base, a few methodological changes are incorporated into this PR and listed below.

Median subsetting is performed on the sentinel-2 ARD to improve generalizability. 4 indices are randomly selected and used to calculate the median for each sample, instead of using all 12 indices. This introduces variation and improved classification performance.
Feature selection is improved, using a backward selection method that incrementally removes features from the training set and refits a catboost classifier until the accuracy metric no longer improves.
Catboost's random search is expanded to include min_data_in_leaf hyperparameter.
Improved consistency in the use of metrics for performance and evaluation.

Specific Requests

@JohnMBrandt could you please review and confirm any edits required to the use of a random seed in the data split, here. @JohnMBrandt could you please review and confirm the transition to a subset median for ARD, here. @rlrognstad could you please review and confirm the use of Catboost feature importance to replace a SHAP explainer, here. Otherwise, general review and comments are more than welcome!

JohnMBrandt commented 4 months ago

The Yaml and config files look great --

how you chose the starting ranges for the hyper parameter search?
did you solve the issue of the seg fault?
the random state selection for train/test split looks okay, but why is this a nested for loop, I don't understand what it does --

for x in range(subsample):
                for i in indices:

Is there a reason why you make deep copies and assign new objects to the plantations data class? e.g. X_train_tmp = copy.deepcopy(self.X_train) -- it looks to me like you store six copies of the data in this class, why can't they be overwritten or done in place?
The subset median looks okay with the following changes: a) whenever you use np.zeros please make sure to specify np.float32 as it will default to np.float64 which is not needed, this goes for many of your functions, and b) as above, I think your code is creating many copies where you could do things in place, while not incrementing the ref count to parent objects, which will help prevent future memory leaks. E.g.:

med_ard = np.median(varied_median, axis=0).astype(np.float32) makes a duplicate of varied median, then saves the median to a new array as np.float64, then casts it to np.float32, which if the original array is np.float32, this takes 3x as much ram, and increments the ref count of varied_median, which does not get used again but now has a ref count and will not be cleared by the gc since its child is returned. This uses 3x less ram and does not have a ref count problem: varied_median = np.median(np.float32(varied_median), axis = 0, overwrite_input = True).

jessicarose00 commented 4 months ago

Thanks for the review! To respond to your comments:

hyperparameter ranges were determined from background research on reasonable values for each parameter, and then adjusted based on well-performing values from previous searches.
The segmentation fault was caused by a bug with shap explainer (see closed issue here) so instead I am using catboost's built in feature importance method. This method has the option to select the type of feature importance to calculate (ShapValues are an option, FeatureImportance is currently used).
If you are referring to this line the nested for loop is required so that we are iterating through the count of indices we want to subsample (range of 1-4), and the random indices that were selected.

I will review and update the areas you've flagged where duplicative copies are being made.

jessicarose00 commented 4 months ago

I've made the following updates to src/features/create_xy.py:

removed instance of np.empty() and replaced with np.zeros()
confirmed all instances of np.zeros() specify np.float32
fixed incorrect calculation of varied median for this function
modified functions scale_X_arrays() and filter_features() in PlantationsData class to modify arrays in place rather than creating redundant copies

Other additional edits:

moved legacy code, such as s1, s2, dem imports from src/features/create_xy.py to src/legacy/load_raw_stack_data.py
adjusted texture calculation to always use ARD (previously had optional parameter to use s2)
moved hard coded parameters to params.yaml

wri / plantation_classifier

Migration to DVC #10

General Changes

Method Changes

Specific Requests