zake7749 / DeepToxic

top 1% solution to toxic comment classification challenge on Kaggle.
MIT License
193 stars 68 forks source link

Question about the code #5

Closed yaxum closed 5 years ago

yaxum commented 5 years ago

Hello!

I am new to Python and NLP. Although I am missing input variables in the function "train_folds(train_features, test_features) " or have I missed something?

Sincerely, Yaxum Cedeno

zake7749 commented 5 years ago

Hi yaxum,

I guess you were talking about LogisticRegression.ipynb. This notebook was for doing experiments at the beginning of the competition and it actually has nothing to do with this project.

The function train_folds and _train_model were deprecated. I would recommend skipping them and starting from cells under the title Logistic Regression. If you are curious about the implementation of LogisticRegression with CV, I refer you to this clean and awesome kernel.

All the best, Justin Yang

yaxum commented 5 years ago

Hi Zake!

Thank you for the tips! I have had a hard time implementing a NLP process on multiple articles with multiple labels. Do you have any tips on articles or other codes that I can get inspiration from? Besides the link you handed to me now.

Sincerely, Yaxum Cedeno


From: Justin Yang notifications@github.com Sent: Wednesday, January 16, 2019 12:34 PM To: zake7749/DeepToxic Cc: yaxum; Author Subject: Re: [zake7749/DeepToxic] Question about the code (#5)

Hi yaxum,

I guess you were talking about LogisticRegression.ipynb. This notebook was for doing experiments at the beginning of the competition and it actually has nothing to do with this project.

The function train_folds and _train_model were deprecated. I would recommend skipping them and starting from cells under the title Logistic Regression. If you are curious about the implementation of LogisticRegression with CV, I refer you to this clean and awesome kernelhttps://www.kaggle.com/thousandvoices/logistic-regression-with-words-and-char-n-grams.

All the best, Justin Yang

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zake7749/DeepToxic/issues/5#issuecomment-454747838, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AoZ4zGMPCff3FaDLq91wz46yWsRCHKMEks5vDw5igaJpZM4aCpgv.

zake7749 commented 5 years ago

Sure. I would suggest reading [For Beginners] Tackling Toxic Using Keras and Deep Learning for NLP with Pytorch.

yaxum commented 5 years ago

Thank you very much! This helped me alot.

I took the example from https://www.kaggle.com/thousandvoices/logistic-regression-with-words-and-char-n-grams and tried it out on my data. I have multiple articles with multiple labels (industry, informatics, .... ) after 3 hours it was done the model however was the result in many different decimals of the probability of every row for every label. However did I though that the output would be a binomial output with either 0 or 1 per row and label. What am I missing and thinking wrong?

Sincerely, Yaxum Cedeno


From: Justin Yang notifications@github.com Sent: Wednesday, January 16, 2019 1:42 PM To: zake7749/DeepToxic Cc: yaxum; Author Subject: Re: [zake7749/DeepToxic] Question about the code (#5)

Sure. I would suggest reading [For Beginners] Tackling Toxic Using Kerashttps://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras and Deep Learning for NLP with Pytorchhttps://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zake7749/DeepToxic/issues/5#issuecomment-454765586, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AoZ4zP5EcY4GzbZvNvMUXB1Rne4ql_l4ks5vDx42gaJpZM4aCpgv.

zake7749 commented 5 years ago

The outputs are correct since this task is for multi-label classification rather than multi-class classification.

yaxum commented 5 years ago

Sorry I expressed myself wrong. I Supposed that the output would be either 0 or 1 not a probability. Although I did do my own cut off with the number 0.5 to replace all numbers below 0.5 with 0 and above with 1. Thank you again.


From: Justin Yang notifications@github.com Sent: Thursday, January 17, 2019 10:56 AM To: zake7749/DeepToxic Cc: yaxum; Author Subject: Re: [zake7749/DeepToxic] Question about the code (#5)

The outputs are correct since this task is for multi-label classification rather than multi-class classification.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zake7749/DeepToxic/issues/5#issuecomment-455111638, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AoZ4zHSD47Gj3JWCarZnbUFOhP3FYSx2ks5vEEjagaJpZM4aCpgv.