Closed loopyme closed 4 years ago
As a quick fix the Bayesian Bins default is set to False. Both proposed features are welcome contributions!
Stale issue
Missing functionality
Configuration is always a big problem for me.
@loopyme as usual very good feedback and contributions from you.
Since you know the code and understand data, data types and data structure (I'm correct here I think, or else let me know). have you also considered a config recommendation system based on the dataset it's given.
This system could be available to both notebook and non-notebook users. Here's what I think the API could look like:
recommended_configs = PandaProfiling().get_recommended_configs_for(dataset)
recommended_configs
would be a list of configs to choose from.
Why list because against each config, it could give an estimate of the (min, max, avg) of time PP might take to process that data using the recommended config. And then the end-user can choose the one it wants to use.
The next comes - how do we know this or find out what to recommend? We can't know for everyone or every possible machine - so this is the discussable part of the feature:
All the above could be put on individual coefficients and the system could tune these coefficients to get to closer to what the actual values are saying. So it learns from the usage - because the first few times it may be wrong but then it can correct itself.
This might sound like "why do this" but then if we don't start working on this, we will end up making mistakes or trial-and-error with the config to get it right and spend a lot of time and frustrate ourselves.
Trial and error and frustration wastes a lot of time but if we start building this from scratch we can get smarter with performance optimisations both manual and automatic ones.
I think we can get many of the above values with levels of certainty (and add error buffers for safety) - because now the steps are occurring inside a pipeline-like system if I'm not mistaken. And we can capture timings generated at each such stage plus we can also gather hardware metrics and specifications to link to these numbers. Someone else might have other suggestions to improve everything said above.
(Happy to discuss this further, as it is not as non-deterministic as it may seem)
Hi, @neomatrix369!
The 'Config Recommendation System' your proposed is very promising and I've had similar thoughts before, which I called 'auto-config'. The key problem, just like you mentioned - how do we find out what to recommend?, I've also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution.
As far as I know, PP is essentially a cool tool of generating report and most of the configuration items are used to describe user needs, not run-time parameters. So from the user's perspective, config is always fixed with a given demand. For example, if I need correlations between variables and want to use A as reject threshold, no matter how much time or memory computation will take, I will still need them and config should not change.
As a result, recommended_configs
strategy may only applicable to some runtime-related configuration items like pool_size
. If we add more run-time control parameters later, maybe it will become a nice move.
(BTW, I think the root problem with the 'bayesian_blocks' I mentioned earlier in this issue is that the third-party package that implement this feature are not scale on big data set.)
I found out two ways, which may improve the user experience about config:
The pr and task-graph thing is WIP and currently on hold, I am sorry that these work are currently stalled partly because of some mechanism selection, I am occupied with some other work related to computational graphs in this period of time. Once I have some time, I will continue the previous work.
@loopyme absolutely love the task graph idea, I have been thinking something similar but using joblib
's features but the two can be fused in some form to make an efficient pipeline/job/task execution system. And especially helpful/useful with things like PP because of the different params we can/or want to change to experiment to get the various reports we want to generate about our datasets.
how do we find out what to recommend?, I've also thought about it for a long time, could not find a proper solution till now. You have proposed a nice one, but I think it may be still not a very proper solution. Interesting you have meandered around a similar path. This is a very raw idea and needs some PoC and exploratory work before we can nail it. Hence a feature-flagged approach will help.
Initially, we will have to collect data and also the right data - both of which could come through iterations. And this does not have to be from others but initially, it will be from our own setups (machines, environments, etc...). When things mature, we can also get samples from others to help fine-tune the internal model.
I have just put together some ideas after reading your post so it needs further thinking and experimentation but I have a feeling the template of the path is more or less fine to walk on.
After reading your task graph resource I'm more of the idea that its smart optimisation(s) on the pipeline-end we might need to make, as opposed to suggesting a single or list of suitable configurations.
I'm still thinking that the system (whatever we call it recommender or autoconfig) can make suggestions/predictions about:
Missing functionality
Configuration is always a big problem for me.
When I was beginner of PP, I didn’t know how to set various parameters for my data set and my ML case, the used time and memory was sometimes unacceptable.
Until recently, I read all of the code, familiar with all the implementations and configurations, so I can choose the efficient configuration method. But you can't expect every user to use this approach to learn to configure their own case.
Some friends of mine always complains about how slow PP is, but I find that PP itself is actually not that slow, it's the constant configuration makes PP slow.
What's worse, some of the default config items become problems when I am trying to tune the performance. For example, here are some test result about running time when I am doing performance tests on a dual-core-processor server(The benchmark is to generate HTML reports on some commonly used data sets):
As the table above shows,
bayesian_blocks
(Default to be True) takes more than 60% time and produces an almost same histogram on large data sets. What's worse, this problem will become more serious as the data set increases. On some particular data sets, the ratio even rises to more than 90%.Different data sets should handle differently to be both fast and effective. Otherwise, user experience and ease of use will be greatly affected, especially for beginners, even complete and detailed documentation is not enough in this case.
In fact, when running on some large data sets, tweaking the config parameters, using parallel scheduling can save about 75%~95% of the time and produce an almost same report.
So I propose these two features below.
Proposed feature
Both of the feature are not difficult to implement, but 'Auto config' requires some experienced to run some tests and carefully choose the strategies and thresholds.
Additional context
Recently, I am focusing on pipelining the project, performing performance tuning and fixing related bugs. As a result, recently I may not able to implement these two features. That's also why I did not send PR, but left an issue here.