whotracksme / whotracks.me

Data from the largest and longest measurement of online tracking.
https://www.ghostery.com/whotracksme
MIT License
407 stars 73 forks source link

How can I get top 10,000 sites? #267

Closed shivamidow closed 11 months ago

shivamidow commented 2 years ago

Hi all, Thanks for this invaluable dataset about web trackers.

I am looking into the dataset to get the top 10,000 sites seen in tracker webpages. The sites.csv for the United States for each month is about 6,000 websites. The site_categories.csv is just 8,115 websites. How/Where can I get the full list of the top 10,000 sites?

philipp-classen commented 2 years ago

I'm glad to hear that the data is useful to you! :-)

Indeed the full data set has around 6000 sites at the moment. I was trying to understand why, but the amount of traffic that we get for the long tail seems to be the reason. Also, not every site is actually sending traffic that could be tracking the user. That also limits it; hard to tell what theoretical limit to expect.

In the aggregation code, I see the constant of 10,000 being used, and we refer to it in the documentation. But as you observed, in practice with the current input traffic that the system gets, this limit will not be reached.

On the other way around, if you come across example of sites that WhoTracksMe misses, but that should be included (and has enough data to be relevant), we can investigate. It is always good to double-check the data.

philipp-classen commented 1 year ago

We recently found a problem in the aggregation code (at one place, it used a bad default of 6,000 instead of 10,000). Starting with the March data (2023-03/us/sites.csv), the generated file has now 10,000 entries instead of 6,000.

Maybe that also solves this issue?

philipp-classen commented 11 months ago

Should be addressed. Feel free to reopen if not.