Closed jessicarose00 closed 4 months ago
The Yaml and config files look great --
for x in range(subsample):
for i in indices:
plantations data class
? e.g. X_train_tmp = copy.deepcopy(self.X_train)
-- it looks to me like you store six copies of the data in this class, why can't they be overwritten or done in place?np.zeros
please make sure to specify np.float32
as it will default to np.float64
which is not needed, this goes for many of your functions, and b) as above, I think your code is creating many copies where you could do things in place, while not incrementing the ref count to parent objects, which will help prevent future memory leaks.
E.g.:med_ard = np.median(varied_median, axis=0).astype(np.float32)
makes a duplicate of varied median, then saves the median to a new array as np.float64, then casts it to np.float32, which if the original array is np.float32
, this takes 3x as much ram, and increments the ref count of varied_median
, which does not get used again but now has a ref count and will not be cleared by the gc since its child is returned. This uses 3x less ram and does not have a ref count problem: varied_median = np.median(np.float32(varied_median), axis = 0, overwrite_input = True)
.
Thanks for the review! To respond to your comments:
I will review and update the areas you've flagged where duplicative copies are being made.
I've made the following updates to src/features/create_xy.py
:
np.empty()
and replaced with np.zeros()
np.zeros()
specify np.float32scale_X_arrays()
and filter_features()
in PlantationsData class to modify arrays in place rather than creating redundant copiesOther additional edits:
src/features/create_xy.py
to src/legacy/load_raw_stack_data.py
params.yaml
This PR migrates the code base to an improved method for versioning data and models with the use of dvc. Dvc enables us to better track and save data and ML models, create and switch between versions of those data and models, and compare metrics among experiments. This PR restructures all scripts associated with model training and modifies those used for deployment.
General Changes
The repository structure is modified to include a new training pipeline with 5 stages:
stage_load_data
stage_prep_features
stage_select_and_tune
stage_train_model
stage_evaluate_model
Parameters are extracted to a dedicated configuration file (
params.yaml
) and reusable code is migrated into separate modules for each stage. The experiment pipeline is automated with dvc and can be reproduced according to changes in theparams.yaml
file. Thedvc.yaml
file configures the machine learning workflow by listing the dependencies and outputs of each stage. TheREADME.md
is updated to document the new repository structure.Method Changes
In addition to the restructuring of the code base, a few methodological changes are incorporated into this PR and listed below.
min_data_in_leaf
hyperparameter.Specific Requests
@JohnMBrandt could you please review and confirm any edits required to the use of a random seed in the data split, here. @JohnMBrandt could you please review and confirm the transition to a subset median for ARD, here. @rlrognstad could you please review and confirm the use of Catboost feature importance to replace a SHAP explainer, here. Otherwise, general review and comments are more than welcome!