yohhaan / topics_api_analysis

This is the code artifact of the paper "A Public and Reproducible Assessment of the Topics API on Real Data"
https://arxiv.org/abs/2403.19577
GNU General Public License v3.0
2 stars 1 forks source link

Simulator is not actually random, leading to extremely inaccurate results #1

Closed jkarlin closed 3 months ago

jkarlin commented 4 months ago

Hi. I’m an engineer on the Topics API for Chrome. I took a brief look at your code after seeing rather surprising results in the related paper and it’s important to point out an issue that I came across as it has a significant impact on the simulation (and therefore the paper’s) results.

You’re using a worker pool to create the topics for each user on sites A and B, but you’re not reseeding the random number generator on each worker (which is forked off the original process). The result is that each worker creates the same stream of random numbers!

This means that in your simulator, sites A and B are getting the same Topics for the same user, rather than chosen at random.

This is a significant problem with your published work. For example, fixing this bug in your code reduces the 5-epoch reidentification rate from ~57% to ~3% with params[1] provided in the README.

An easy fix is to add os.register_at_fork(after_in_child=np.random.seed) before creating your worker pool.

Josh

[1] python3 topics_simulator.py data/web_data/users_topics_5_weeks.tsv 5 topics_classifier/chrome4/config.json data/crux/crux_202401_chrome4_topics-api.tsv 10 1 data/reidentification_exp/5_weeks_10_unobserved`

yohhaan commented 4 months ago

Hello Josh,

Thanks for reaching out and reporting this to our attention!

We looked into this subtle bug regarding the initialization of the random number generator seed across these forked processes. We confirm that numpy preserves the random state across forks and that the proposed solution fixes it by forcing an auto-seed for each new fork. Thus, we re-ran our simulation on these real dataset of browsing histories.

While the results that we now obtain have changed quantitatively; 2.3%, 2.9%, and 4.1% of these users are uniquely re-identified after 1, 2, and 3 observations of their topics, respectively, our findings do not change qualitatively: real users can be fingerprinted by the Topics API and the information leakage worsens over time as more users get uniquely re-identified.

Here is our plan; we will modify the simulator code (https://github.com/yohhaan/topics_api_analysis), update the corresponding metrics in the paper, and push a new version to arXiv (https://arxiv.org/abs/2403.19577) in which we will state your contribution and the help you provided. Thanks again!

Best,

Yohan

yohhaan commented 3 months ago

Hello,

Corrections to the code have been made in commit b20e193 and revisions are posted here.

Thanks again!