Open SamTov opened 9 months ago
I like the idea that each of the instances put into the recorder can handle their data independently.
How would it get stored? Would you generate one file for each property or does each instance write into the same file?
How do you want to give them the ability to do that? My suggestion would be to add this to each module in the moment the recorder gets initialized, via using a decorator or childclassing. How did you have in mind handling that?
I would have one database for each recorder which themselves are differentiated based on the data they operate on. For example, a recorder that computes metrics on the training dataset is one hdf5 database, regardless of how many things it is computing. The individual measurements would be given a group in the database as they are now, for example, loss, trace, entropy. This allows us to standardize naming for each of the measurements and not have to deal with annoying naming problems like appending train, test, or whatever to groups inside the database.
My implementation approach would be to make a Measurement parent class which has certain properties like a group_name
(name of the measurement in the hdf5 database), data_shape
(the shape of the data it returns) and methods like update
which can be called to update the database if it is passed to a recorder. When you create the recorder, it would build the group structure by using the data shapes and names of the measurements. I would avoid decorators as they, in general, are super messy and don't handle complex operations well. The parent class idea is neat as it would also standardize how people add new measurements to the codebase. In the example above, all of those modules would inherit from a Measurement
class, whereas something like SimpleTraining
would not.
I like the idea. How would you avoid re-computations of e.g. the NTK or predictions?
Right, requirements is tough. There are a few options, one of which is to introduce the concept of a requirement as an attribute to the measurement classes which is checked elsewhere. For example, the entropy and trace would both depend on the existance of the NTK. We can build some kind of computation graph and have that computed first. But I think we should approach this once we have started implementing it so we can see how the code works and the best way to do it. The dependancy checks should come in the recorder though.
I see. We can discuss that in detail then. How about checkpointing? I would argue now is also the place to include checkpointing into the recorders, like creating a checkpointing recorder. Or would you try handling that outside of the recorders?
Just as a side remark, I would suggest to try handling data internally as pytrees and writing loading and dumping methods to communicate with the hdf5 data base, since then everything is smoothly jitable. What do you think about that?
I want this as an issue here so we can all comment on it and consider what the options are moving forward regaring the recorders.
My thought for the recorders was to make all modules in ZnNL that can generate data or metric, also be able create a hdf5 group and store their record into it. This would essentially allow one to have the following API for recorders:
In this way, users can pass anything to the recorders and that class will know how to handle the db update when it is called.
Is there any other approach that you had in mind for this problem? One other, for example, would be to build in something like tensor board or mlflow to the backend of the whole software and call hooks during epochs. This isn't trivial from a compression POV but still possible.