zake7749 / DeepToxic

top 1% solution to toxic comment classification challenge on Kaggle.
MIT License
193 stars 70 forks source link

Please state in which files the preprocessing is performed:? #1

Closed astro6026 closed 5 years ago

astro6026 commented 5 years ago

Please state in which files the preprocessing is performed:? I am a newbie....So i was curious how you actually ran the classifer.....Maybe if you tell some detail about file...how you did POS tagging means in which file,location of Riad's Dataset for speeling correction.....I mean the whole preprocessing part is a bit difficult to understand as i am unable to understand the sequence in which the files are run??

zake7749 commented 5 years ago

Hi, astro6026,

Firstly, Riad's work is not included in this repository. If you're interested in details of his dataset like spelling correction, you have to ask him at Kaggle directly.

As for preprocessing, the main work is in clean_data.ipynb. Before running this notebook, you must download the dataset of this competition from Kaggle and the pre-trained word embeddings I list in my readme file.

Besides, you might find that I do not upload the file cleanwords.txt. I guess it was ignored by the git setting and I'm sorry for my carelessness. You can rebuild this file by yourself, the format should be:

typo, corrected word

For example:

wikibreak,wiki break
wikidata,wiki data

This file should be placed in the folder features. By the way, the pre-trained wordvec should also be placed in this folder.

If pre-processing work properly, cleaned dataset should be generated at the folder Dataset and by that time you can run train_cleaned_word_level.ipynb to train our models.

POS tagging was written in dirty_pos.ipynb. As you can see it, the codes and structures are so messy that I still have no time to refactor this part. Because we tried this idea at a very late moment, the solution is very difficult to understand. I don't recommend to reuse our POS strategy.

I am currently busy with personal affairs, so I can not recheck this project before this weekend. If you still have problems at that time, please let me know.

astro6026 commented 5 years ago

Thanks... I hope I will be able to able to understand it........

astro6026 commented 5 years ago

I wanted to ask one more thing, Please tell me what is the sequence of files I should run to rebuild the project at local level............And also I wanted to know how u evaluated the resulted of each model...Did u upload the final predication CSV files each time on kaggle?

zake7749 commented 5 years ago

The steps are as follows:

  1. Download the dataset and pre-trained word embeddings, and put them into the corresponding directories.
  2. Execute clean_data.ipynb.
  3. Execute train_cleaned_word_level.ipynb. The predictions would be generated under the path submit_path_prefix.

Please note that predictions generated by train_cleaned_word_level.ipynb are based on the LAST checkpoint rather than the BEST checkpoint.

I evaluate the performance of my models by cross-validation.

astro6026 commented 5 years ago

i have a request from u can u please upload the clean text file............and give a little explanation about the get_av_rnn function used in model_zoo file of soo toxic and the AttentionWeight class in the same file

zake7749 commented 5 years ago

What's the clean text file? Is the cleanwords.txt or the cleaned dataset?

If you're talking about the former, you can create your own mapping arbitrarily (just have to follow the format I mentioned before) since this file is not the key part in our solution. You still can reach the same performance without this file. If you're talking about the latter, due to the restriction of competition rules, I cannot upload this dataset.

The detail of get_av_rnn please check our solution. As for AttentionWeight, it basically is a weighted sum pooling which estimates the importance of hidden output from each timestep by applying dot product with a trainable weight matrix.

astro6026 commented 5 years ago

I want to ask few more things .like in the medium article u wrote that "My number of feature maps is ranged from 100~ 400"........Firstly, How have u determined it? Secondly what lead u to the get_av_rnn model.....Thirdly can some optimization be done to find the hyperparameters......like using the meta-heuristic algorithm like Genetic Algorithm and Particle Swarm Optimization........If yes then where?

astro6026 commented 5 years ago

I am trying to understand only the word level part of the AV_RNN model.............Please Help......How are the features being extracted from the text......does it has to do with word embeddings??... Please answer my question

zake7749 commented 5 years ago
  1. I determined hyper-parameters by evaluating model performance with cross-validation.
  2. av stands for all views. I think summarizing the hidden outputs with multiple pooling methods would be more generalized.
  3. Sure, there are many methods for hyper-parameters selection while I do not use any of them in this competition.
  4. I don't quite understand your question. I used the pre-trained word vectors and aggregated them by a rnn.
astro6026 commented 5 years ago

i meant how you determined the size of the feature set and where in the code have u extracted the feature set..........as u wrote in the medium article "My number of feature maps is ranged from 100~ 400".................

zake7749 commented 5 years ago

As I said, I determined hyper-parameters by cross-validation.

The term feature maps are the outputs of filters. It is just a hyper-parameter of CNN.

astro6026 commented 5 years ago

Can you tell me about the GlobalMaxPooling1D, GlobalAvgPooling1D and SpatialDropout1D layer.........Actually, I can't figure out the form in which the data is travelling

astro6026 commented 5 years ago

between layer......

zake7749 commented 5 years ago

I would recommend you read the Keras Documentation that has detailed descriptions of all layers.