This project aims to identify whether someone is infected with COVID-19 or not given that person’s reported symptoms. They tried multiple classification models and tried different hyperparameters with them. The purpose of this project is to make it easier to identify risks related to COVID-19 to guide policy and decision making. They are using symptoms and positive/negative diagnoses from a database of PCR results provided by the Israeli Ministry of Health.
One thing I like about this project is how detailed and clear the dataset cleansing is described. It’s incredibly important for any data claims to be backed up by transparent data, so being able to follow along in this process is very valuable. I also really liked how the way this report was written makes it feel like I’m working through these thought processes too. It’s also a great practice to include next steps, so I like the Future Work section as well.
One thing I don’t like about this report are the tables on page 2. They don’t contain any units or explanation of what the numbers mean besides the title of the table, which is not clear enough for me to actually understand what’s going on in the project. I understand these are proportions, but it should be made more clear what the numerators and denominators in these calculations were. Furthermore, proportions are normally between 0 and 1, but all of these numbers are greater than 1, so that’s confusing me further. Perhaps labelling these percentages would make more sense, but again, I don’t know what these numbers mean, so I couldn’t say. Assuming these are percentages, none of these columns add up to 100, so what does that missing chunk of data represent?
Another suggestion is that there is incredibly little danger of you overfitting even all of the possible features you have, which seems to me to be about a dozen, when you have hundreds of thousands of data points. You can only really overfit when the number of features you have is close to the number of training datapoints. That said, perhaps you could find something interesting if you included the features you left out! It’s very unlikely to hurt.
Also, it would be nice if the graphs on page 3 was big enough for the labels to be clear. And please label your axes, including units, for increased clarity like your graph on page 1.
This project aims to identify whether someone is infected with COVID-19 or not given that person’s reported symptoms. They tried multiple classification models and tried different hyperparameters with them. The purpose of this project is to make it easier to identify risks related to COVID-19 to guide policy and decision making. They are using symptoms and positive/negative diagnoses from a database of PCR results provided by the Israeli Ministry of Health.
One thing I like about this project is how detailed and clear the dataset cleansing is described. It’s incredibly important for any data claims to be backed up by transparent data, so being able to follow along in this process is very valuable. I also really liked how the way this report was written makes it feel like I’m working through these thought processes too. It’s also a great practice to include next steps, so I like the Future Work section as well.
One thing I don’t like about this report are the tables on page 2. They don’t contain any units or explanation of what the numbers mean besides the title of the table, which is not clear enough for me to actually understand what’s going on in the project. I understand these are proportions, but it should be made more clear what the numerators and denominators in these calculations were. Furthermore, proportions are normally between 0 and 1, but all of these numbers are greater than 1, so that’s confusing me further. Perhaps labelling these percentages would make more sense, but again, I don’t know what these numbers mean, so I couldn’t say. Assuming these are percentages, none of these columns add up to 100, so what does that missing chunk of data represent?
Another suggestion is that there is incredibly little danger of you overfitting even all of the possible features you have, which seems to me to be about a dozen, when you have hundreds of thousands of data points. You can only really overfit when the number of features you have is close to the number of training datapoints. That said, perhaps you could find something interesting if you included the features you left out! It’s very unlikely to hurt.
Also, it would be nice if the graphs on page 3 was big enough for the labels to be clear. And please label your axes, including units, for increased clarity like your graph on page 1.