Open matthewturk opened 4 years ago
related (and ancient) issue : #823
You are absolutely right - very relevant and appropriate.
Random thought: pandas.Dataframe has a .describe
method that I think does a similar job. Using this name would have some benefits over print_stats
+1000 on .describe
! Also, I've sketched out some overview ideas in the
widgyts repo in a PR.
On Sat, Aug 8, 2020, 12:12 PM Clément Robert notifications@github.com wrote:
Random thought: pandas.Dataframe has a .describe method that I think does a similar job. Using this name would have some benefits over print_stats
- very discoverable to users coming from pandas
- reduces the overhead when you're learning both libraries over a short time
- it doesn't infer a print statement is involved, which leaves rooms for... Jupyter widget intégration !! (If that's ever going to be useful)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yt-project/yt/issues/2737#issuecomment-670951965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAVXO63OM7D3S3EUKK7EJDR7WBOVANCNFSM4OW3GNYQ .
Note on our private discussion about this earlier today:
A different but related improvement would be to change Dataset.__repr__
from
def __repr__(self):
return self.basename
to something more useful such as
def __repr__(self):
return f"{self.class}: {self.basename}\n{self.field_list}"
The only problem being that it may break downstream code where instances of Dataset are directly fed into formatted strings. A possible workaround would be to take advantage of the fact that __repr__
is used in string formatting if and only if __str__
isn’t implemented, so we would only need to change our current __repr__
to __str__
and that'd do the trick.
Citing @matthewturk in #4202
Computing the min, max, variance and mean of a field should be accessible through a single call, much like it is in pandas. In pandas, numerical columns output:
count mean std min max Since most of these can be computed in a single pass, it would be useful to do the same in yt. As an example, right now this would require calling .mean(), .max(), .min(), and .std() on a data object; min and max are both calculated in the same pass so this would be reusable, but we could batch the whole thing.
t would be nice if we could have a base class Dataset implementation of something like summary that gives some good info, and then dispatch to the subclasses to supplement
I'm thinking a nice way to do that would be to have two methods
class Dataset:
def _create_summary(self) -> str:
...
return summary
def summary(self) -> None:
print(self._create_summary())
The summary
method itself, which has the hard-to-extend side effect of printing, wouldn't need to be extended, and children classes would have the freedom to either
class DatasetA(Dataset):
def _create_summary(self) -> str:
base = super()._create_summary()
return base + "\nI'm DatasetA"
class DatasetB(Dataset):
def _create_summary(self) -> str:
base = super()._create_summary()
return base.replace("cell", "butterfly")
(of course my examples are meaningless)
We could also hide the details of caching the summary in the summary
method to guarantee it is done consistently in all classes (instead of giving this responsibility to implementers of _create_summary
)
hey! I was just thinking about all this not knowing an issue existed!
In my case I was thinking about how to separate the generation and the display of summary info (so that I could embed the info in a yt-napari QTWidget without having to parse the existing formatted strings).
After reading through the thread now, I have some thoughts!
It seems to me that there are two related but separate needs here:
summary
method for dataset-wide statistics. this may involve index traversal as in the current print_stats
, but probably does not need to read field data. It likely can be calculated once and cached at the dataset level. describe
method for generating summary statistics of field data. If we want to follow pandas for inspiration here, we'd want similar 'include' and 'exclude' arguments to allow lists of fields or field types to be specified. and as @matthewturk pointed out in one comment, we want this to operate on data objects, so similar to other yt functions we could have a data_source
argument (that defaults to all_data()
). Is that a fair characterization?
in playing around with implementing 1., I actually ended up very close to @neutrinoceros's mock up... but don't have anything to share quite yet...
Has there been any other related progress in yt or widgyts since this was last brought up?
To my knowledge, there haven't been any steps toward it. I think I like both of these points.
The 4.2.0 release is "imminent", so I'll remove this from the milestone.
OK, but for 4.3 we need to discuss what are blockers instead of sticking to a time-based schedule. Or, we need to (as a community) decide on a time-based schedule.
When yt only did AMR data, it had
print_stats()
. We don't have this kind of functionality that much anymore.It would be nice if we could have a base class
Dataset
implementation of something likesummary
that gives some good info, and then dispatch to the subclasses to supplement. For instance, for an AMR dataset this might be the "level stats" we used to do. For particle, it might be the particle counts, maybe something about the EWAH index filling fraction, etc.