whotracksme / whotracks.me

Data from the largest and longest measurement of online tracking.
https://www.ghostery.com/whotracksme
MIT License
407 stars 73 forks source link

Migrate to trackerdb #315

Closed philipp-classen closed 11 months ago

philipp-classen commented 1 year ago

https://github.com/ghostery/trackerdb has been open sourced in February 2023.

Currently, the data is checked in here (and requires a manual step to keep it in sync): https://github.com/whotracksme/whotracks.me/blob/master/whotracksme/data/assets/trackerdb.sql

Since the data is all public now, we should instead use it directly from the other repository.

philipp-classen commented 1 year ago

Requires some work. Multiple tests are breaking with the converted trackerdb.sql file and the website can't be generated.

To try it out, the data can be found on the migrated_trackerdb branch.

philipp-classen commented 1 year ago

Inconsistencies: WhoTracks.me expects that all trackers have a domain, for instance:

https://github.com/ghostery/trackerdb/blob/f7fbd12a1a1e16aea9b92c75999bc69971f9604a/db/patterns/tealium_ads.eno#LL1C1-L1C1

name: Tealium Ads
category: advertising
website_url: https://www.tealium.com/
organization: tealium

--- filters
##.tealium-ad
##.tealiumAdSlot
--- filters

ghostery_id: 4055

Compare it without a working one:

https://github.com/ghostery/trackerdb/blob/f7fbd12a1a1e16aea9b92c75999bc69971f9604a/db/patterns/tiktok_analytics.eno#LL1C1-L1C1

name: TikTok Analytics
category: site_analytics
website_url: https://analytics.tiktok.com
organization: bytedance_inc

--- domains
analytics.tiktok.com
--- domains

--- filters
||analytics.tiktok.com^$3p
--- filters

ghostery_id: 4050

Maybe there are other inconsistencies (apart from domains being mandatory).

philipp-classen commented 1 year ago

Related to https://github.com/ghostery/trackerdb/issues/13

philipp-classen commented 1 year ago

The May release is the first released computed with trackerdb data. In that regard, the migration is done.

Still, I leave the ticket open, since the manual step of taking the released trackerdb binary dump and converting it to trackerdb.sql should be automated, too. But, at least, the data model is now consistent.

philipp-classen commented 11 months ago

Added the update_trackerdb.sh script to automate the update of the trackerdb.sql file.