text analysis using pandas_profiling

shahanesanket commented 4 years ago

Is your feature request related to a problem? Please describe. I would like to analyze text fields the same way numeric and categorical fields are analyzed and reported. Especially, before working on any NLP problem it'll be very helpful and time saving to have this analysis done in a line of code.

To start with I would like to see:

Missing value analysis
Text length analysis 2.1 min, max, average, quantiles 2.2 freq words, infrequent words (can include the deepmoji project's tokenizer. it's very robust) 2.2 word cloud. (if it isn't a far stretched goal)

Currently, I am heavily relying on pandas_profiling and the only alternative I have is doing this text analysis manually. I would like to contribute if this is something the managers think of building into the project.

neomatrix369 commented 4 years ago

Hey @shahanesanket great idea, I have a library, underway, see https://bit.ly/better-nlp-launch, I would love to have these features embedded into it. We can then apply them into pandas-profiling or any other library, let me know what you think of the idea and if you like to collaborate on this idea together with me and others?

shahanesanket commented 4 years ago

Hi @neomatrix369 would love to contribute.

neomatrix369 commented 4 years ago

Hi @neomatrix369 would love to contribute.

How about you take a peek at the library and also the notebooks/kernels I have published, and then give me a shout if you need any help or have questions.

Otherwise, I'll be happy to receive any PR from you. You can also start a discussion on a topic related to the above and we can split the work between the two of us.

The only way to get started is to start with it!

sbrugman commented 4 years ago

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

neomatrix369 commented 4 years ago

@neomatrix369 @shahanesanket This discussion is out of scope of this repository, please continue it somewhere else (for example at the repository manu suggested above).

A key design decision in the pandas-profiling package is that analyses should be objective, to be useful for a broad audience. This means that relying on untransparent machine learning models are not considered for data profiling.

That being said, we have developed tangled-up-in-unicode to perform objective analysis provided the Unicode Character Database.

Note that you can always use model-specific predictions and add them to your DataFrame, and analyse those.

Sorry about that @sbrugman - the intent of my points was to produce something that would be useful in general and also that could be incorporated into the pandas-profiling library - so it's win-win for both sides.

I have still to understand what you mean in the rest of your comment above but I'm thinking you know what you are talking about and happy to wait and see the above in pandas-profiling library..

neomatrix369 commented 4 years ago

As a response to this issue I started working on a basic NLP profiler project:

Kaggle kernel: https://www.kaggle.com/neomatrix369/nlp-profiler-simple-dataset
Utility script: https://www.kaggle.com/neomatrix369/nlp-profiler-class

It's still early days and hopefully, I (or someone else) would love to integrate it with/into Pandas profiling. So far the response has been pretty good. Many are recognising it's potential and purpose.

I'm happy to invite you to continue discussing this on https://github.com/neomatrix369/awesome-ai-ml-dl/issues/45

ydataai / ydata-profiling

text analysis using pandas_profiling #278