Smart Config - Githubissues

loopyme commented 4 years ago

Missing functionality

Configuration is always a big problem for me.

When I was beginner of PP, I didn’t know how to set various parameters for my data set and my ML case, the used time and memory was sometimes unacceptable.

Until recently, I read all of the code, familiar with all the implementations and configurations, so I can choose the efficient configuration method. But you can't expect every user to use this approach to learn to configure their own case.

Some friends of mine always complains about how slow PP is, but I find that PP itself is actually not that slow, it's the constant configuration makes PP slow.

What's worse, some of the default config items become problems when I am trying to tune the performance. For example, here are some test result about running time when I am doing performance tests on a dual-core-processor server(The benchmark is to generate HTML reports on some commonly used data sets):

Branch	Use_dask	bayesian_blocks	Repeat	Benchmark1(ms)	Benchmark2(ms)	Benchmark3(ms)	Benchmark4(ms)	Benchmark5(ms)
loopy-patch-fast	True	False	10	5361	10151	12342	6013	1804
loopy-patch-fast	False	False	10	16089	12734	16799	9680	1802
loopy-patch-fast	True	True	10	35903	78999	92227	6945	1906
master	False	False	10	17098	12697	16287	11783	1742
master(Default)	False	True	10	39990	73714	86397	13032	1863

As the table above shows, bayesian_blocks(Default to be True) takes more than 60% time and produces an almost same histogram on large data sets. What's worse, this problem will become more serious as the data set increases. On some particular data sets, the ratio even rises to more than 90%.

Different data sets should handle differently to be both fast and effective. Otherwise, user experience and ease of use will be greatly affected, especially for beginners, even complete and detailed documentation is not enough in this case.

In fact, when running on some large data sets, tweaking the config parameters, using parallel scheduling can save about 75%~95% of the time and produce an almost same report.

So I propose these two features below.

Proposed feature

Interactive config widget: Since PP is always used in notebook, why not create an interactive config widget to help user (especially the beginners) make their own config? It will improve the user experience a lot and make it more convenience.
Auto config: When no configuration is specified, we should generate a config according to the input data.

Both of the feature are not difficult to implement, but 'Auto config' requires some experienced to run some tests and carefully choose the strategies and thresholds.

Additional context

Recently, I am focusing on pipelining the project, performing performance tuning and fixing related bugs. As a result, recently I may not able to implement these two features. That's also why I did not send PR, but left an issue here.

sbrugman commented 4 years ago

As a quick fix the Bayesian Bins default is set to False. Both proposed features are welcome contributions!

github-actions[bot] commented 4 years ago

Stale issue

neomatrix369 commented 4 years ago

Missing functionality

Configuration is always a big problem for me.

@loopyme as usual very good feedback and contributions from you.

Since you know the code and understand data, data types and data structure (I'm correct here I think, or else let me know). have you also considered a config recommendation system based on the dataset it's given.

This system could be available to both notebook and non-notebook users. Here's what I think the API could look like:

recommended_configs = PandaProfiling().get_recommended_configs_for(dataset)

recommended_configs would be a list of configs to choose from.

Why list because against each config, it could give an estimate of the (min, max, avg) of time PP might take to process that data using the recommended config. And then the end-user can choose the one it wants to use.

The next comes - how do we know this or find out what to recommend? We can't know for everyone or every possible machine - so this is the discussable part of the feature:

estimate min-max time taken from all of your performance tests and usages when profiling a dataset
scale these values based on CPU speed and memory available (and maybe memory speed as well)
add a factor or remove if the system is using SSD or not
add a delta (configurable by the user) to the whole computation above
if possible also add the accuracy score or say how good the reports will be once you use this config (this accuracy isn't how good the recommendation is, that is a different take managed elsewhere)

All the above could be put on individual coefficients and the system could tune these coefficients to get to closer to what the actual values are saying. So it learns from the usage - because the first few times it may be wrong but then it can correct itself.

This might sound like "why do this" but then if we don't start working on this, we will end up making mistakes or trial-and-error with the config to get it right and spend a lot of time and frustrate ourselves.

Trial and error and frustration wastes a lot of time but if we start building this from scratch we can get smarter with performance optimisations both manual and automatic ones.

I think we can get many of the above values with levels of certainty (and add error buffers for safety) - because now the steps are occurring inside a pipeline-like system if I'm not mistaken. And we can capture timings generated at each such stage plus we can also gather hardware metrics and specifications to link to these numbers. Someone else might have other suggestions to improve everything said above.

(Happy to discuss this further, as it is not as non-deterministic as it may seem)

loopyme commented 4 years ago

Hi, @neomatrix369!

The 'Config Recommendation System' your proposed is very promising and I've had similar thoughts before, which I called 'auto-config'. The key problem, just like you mentioned - how do we find out what to recommend?, I've also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution.

As far as I know, PP is essentially a cool tool of generating report and most of the configuration items are used to describe user needs, not run-time parameters. So from the user's perspective, config is always fixed with a given demand. For example, if I need correlations between variables and want to use A as reject threshold, no matter how much time or memory computation will take, I will still need them and config should not change.

As a result, recommended_configs strategy may only applicable to some runtime-related configuration items like pool_size. If we add more run-time control parameters later, maybe it will become a nice move.

(BTW, I think the root problem with the 'bayesian_blocks' I mentioned earlier in this issue is that the third-party package that implement this feature are not scale on big data set.)

I found out two ways, which may improve the user experience about config:

Config widget, like I mentioned in this issue, which I think will guide user to accurately express their needs. (See #477)
Task scheduling system, can be used to avoid redundant calculations with a given config and support more fine-grained configuration options. (See Task-Graph It may not be applied to PP, but similar ideas can be used for subsequent adjustments)

The pr and task-graph thing is WIP and currently on hold, I am sorry that these work are currently stalled partly because of some mechanism selection, I am occupied with some other work related to computational graphs in this period of time. Once I have some time, I will continue the previous work.

neomatrix369 commented 4 years ago

@loopyme absolutely love the task graph idea, I have been thinking something similar but using joblib's features but the two can be fused in some form to make an efficient pipeline/job/task execution system. And especially helpful/useful with things like PP because of the different params we can/or want to change to experiment to get the various reports we want to generate about our datasets.

neomatrix369 commented 4 years ago

how do we find out what to recommend?, I've also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution. Interesting you have meandered around a similar path. This is a very raw idea and needs some PoC and exploratory work before we can nail it. Hence a feature-flagged approach will help.

Initially, we will have to collect data and also the right data - both of which could come through iterations. And this does not have to be from others but initially, it will be from our own setups (machines, environments, etc...). When things mature, we can also get samples from others to help fine-tune the internal model.

I have just put together some ideas after reading your post so it needs further thinking and experimentation but I have a feeling the template of the path is more or less fine to walk on.

After reading your task graph resource I'm more of the idea that its smart optimisation(s) on the pipeline-end we might need to make, as opposed to suggesting a single or list of suitable configurations.

I'm still thinking that the system (whatever we call it recommender or autoconfig) can make suggestions/predictions about:

the time it would take to generate these reports (with an acceptable level of error) per configuration (based on its past experience)
accuracy of the reports produced (again per config)
it's own accuracy (about each assessment it makes)

ydataai / ydata-profiling

Smart Config #441

Missing functionality

Proposed feature

Additional context

Missing functionality