yt-project / yt

Main yt repository
http://yt-project.org
Other
464 stars 276 forks source link

Dataset summary #2737

Open matthewturk opened 4 years ago

matthewturk commented 4 years ago

When yt only did AMR data, it had print_stats(). We don't have this kind of functionality that much anymore.

It would be nice if we could have a base class Dataset implementation of something like summary that gives some good info, and then dispatch to the subclasses to supplement. For instance, for an AMR dataset this might be the "level stats" we used to do. For particle, it might be the particle counts, maybe something about the EWAH index filling fraction, etc.

neutrinoceros commented 4 years ago

related (and ancient) issue : #823

matthewturk commented 4 years ago

You are absolutely right - very relevant and appropriate.

neutrinoceros commented 4 years ago

Random thought: pandas.Dataframe has a .describe method that I think does a similar job. Using this name would have some benefits over print_stats

matthewturk commented 4 years ago

+1000 on .describe! Also, I've sketched out some overview ideas in the widgyts repo in a PR.

On Sat, Aug 8, 2020, 12:12 PM Clément Robert notifications@github.com wrote:

Random thought: pandas.Dataframe has a .describe method that I think does a similar job. Using this name would have some benefits over print_stats

  • very discoverable to users coming from pandas
  • reduces the overhead when you're learning both libraries over a short time
  • it doesn't infer a print statement is involved, which leaves rooms for... Jupyter widget intégration !! (If that's ever going to be useful)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yt-project/yt/issues/2737#issuecomment-670951965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAVXO63OM7D3S3EUKK7EJDR7WBOVANCNFSM4OW3GNYQ .

neutrinoceros commented 3 years ago

Note on our private discussion about this earlier today: A different but related improvement would be to change Dataset.__repr__ from

def __repr__(self):
    return self.basename

to something more useful such as

def __repr__(self):
    return f"{self.class}: {self.basename}\n{self.field_list}"

The only problem being that it may break downstream code where instances of Dataset are directly fed into formatted strings. A possible workaround would be to take advantage of the fact that __repr__ is used in string formatting if and only if __str__ isn’t implemented, so we would only need to change our current __repr__ to __str__ and that'd do the trick.

neutrinoceros commented 1 year ago

Citing @matthewturk in #4202

Computing the min, max, variance and mean of a field should be accessible through a single call, much like it is in pandas. In pandas, numerical columns output:

count mean std min max Since most of these can be computed in a single pass, it would be useful to do the same in yt. As an example, right now this would require calling .mean(), .max(), .min(), and .std() on a data object; min and max are both calculated in the same pass so this would be reusable, but we could batch the whole thing.

neutrinoceros commented 1 year ago

t would be nice if we could have a base class Dataset implementation of something like summary that gives some good info, and then dispatch to the subclasses to supplement

I'm thinking a nice way to do that would be to have two methods

class Dataset:
    def _create_summary(self) -> str:
        ...
        return summary

    def summary(self) -> None:
        print(self._create_summary())

The summary method itself, which has the hard-to-extend side effect of printing, wouldn't need to be extended, and children classes would have the freedom to either

(of course my examples are meaningless) We could also hide the details of caching the summary in the summary method to guarantee it is done consistently in all classes (instead of giving this responsibility to implementers of _create_summary)

chrishavlin commented 1 year ago

hey! I was just thinking about all this not knowing an issue existed!

In my case I was thinking about how to separate the generation and the display of summary info (so that I could embed the info in a yt-napari QTWidget without having to parse the existing formatted strings).

After reading through the thread now, I have some thoughts!

It seems to me that there are two related but separate needs here:

  1. a summary method for dataset-wide statistics. this may involve index traversal as in the current print_stats, but probably does not need to read field data. It likely can be calculated once and cached at the dataset level.
  2. a describe method for generating summary statistics of field data. If we want to follow pandas for inspiration here, we'd want similar 'include' and 'exclude' arguments to allow lists of fields or field types to be specified. and as @matthewturk pointed out in one comment, we want this to operate on data objects, so similar to other yt functions we could have a data_source argument (that defaults to all_data()).

Is that a fair characterization?

in playing around with implementing 1., I actually ended up very close to @neutrinoceros's mock up... but don't have anything to share quite yet...

Has there been any other related progress in yt or widgyts since this was last brought up?

matthewturk commented 1 year ago

To my knowledge, there haven't been any steps toward it. I think I like both of these points.

neutrinoceros commented 1 year ago

The 4.2.0 release is "imminent", so I'll remove this from the milestone.

matthewturk commented 1 year ago

OK, but for 4.3 we need to discuss what are blockers instead of sticking to a time-based schedule. Or, we need to (as a community) decide on a time-based schedule.