yyd007 / Computational-Research-Class

0 stars 0 forks source link

Review - Kanyao #7

Open khan1792 opened 6 years ago

khan1792 commented 6 years ago

Cool idea and it can be very useful for public policy! I also have some questions and suggestions about this project.

  1. I think the proportion of the classes in the response is unbalanced according to the results shown in the confusion matrix. You should report the proportion so that we can better evaluate the prediction results. Unless the precisions of all classes are very high, reporting the proportion is necessary.

  2. Your datasets have a large time span, especially in the Chicago crime data. Using this kind of data for prediction must be very cautious. If the crime condition was changing over time and especially if the changes did not follow a consistent pattern, using the whole dataset for prediction will heavily reduce the accuracy. There are two solutions for it. First, you can identify the changing pattern based on time at first, then create some variables that can reflect it, and finally put them in models with whole datasets. Second, you can also use recently data (for example, from 2015 - 2018) for model building. Even for this solution, it will be helpful to make sure the time period you choose does not include large pattern change.

  3. You randomly split the data into two sets. However, since your data is time series, a better way is to use past data (for example, 2015-2017), split it into train and validation sets. after know the best parameters, use them to build a model in the whole 2015 - 2017 dataset (no split). Finally, you can test the final model by using 2018 data as the test set. This method called out of sample test, and it can better avoid overfitting.

  4. If the unbalance of the proportion will influence the prediction, you still need to balance the classes in your response and then adjust the results after prediction.

  5. The predictive results are actually a bit disappointing because of its low precision. You can evaluate the results by the equation, recall of class i - the proportion of class i, to precisely evaluate how much accuracy your models improve for each class. If the result is still disappointing when you use the methods mentioned above, I will suggest you to use some parametric models, such as fixed effects models. They can be more useful for policy making.

  6. For the layout of the poster, it is overall well-organized. A minor suggestion is that you might remove the description of three algorithms, but mention more about the process of data manipulation for a high predictive results.

I believe your research will have a very good result if you consider more methods of data manipulation!