wildfish / django-gdpr-assist

Tools to help manage user data in the age of GDPR
Other
174 stars 15 forks source link

Feature discussion: deletion of records from model tables #39

Open ghost opened 3 years ago

ghost commented 3 years ago

We have a use case where a Django application manages some API and session tokens, and we'd like to remove them from the anonymised (manage.py anonymise_db) version of the database. Replacing them with mock data doesn't seem to make sense, especially for the session tokens.

There are ways to do this outside of the library and Django: we could, for example, perform post-anonymization SQL commands to truncate the relevant tables, and/or exclude the API token tables from the application database backup/restore processes. The main benefits to an in-library solution would be convenience and consistency (one command and one layer of configuration + PrivacyMeta to manage bulk data migration from pristine to anonymised).

Has django-gdpr-assist considered adding support to clear the contents of model tables during the manage.py anonymise_db step, and/or does this seem to make sense as a context and feature request?

radiac commented 3 years ago

This would be a useful feature. My first thought was to do something with the PrivacyMeta.can_anonymise option, but there's also the scenario where you may want to be able to anonymise data in production, but want to purge otherwise anonymisable records elsewhere, eg when copying a production database to staging.

Perhaps two new options on the anonymise_db command:

Not exactly snappy switches, but should give flexibility for the two scenarios.

Not sure when we'll be able to get onto this, but we'll add it to the list.

ghost commented 3 years ago

@radiac Perhaps this is overcomplicating things, but from thinking about how to make this comprehensible at the command-line and for people re-reading cron-jobs etc after-the-fact:

Perhaps there are 'bulk anonymization strategies', and three currently-known categories of data: 'anonymisable', 'unanonymisable', and 'untagged'.

At the moment anonymise_db implements a single strategy for anonymisable data: it anonymises it. But 'delete' could be another valid strategy.

The truth table of valid strategies could be something like this:

Strategy / Category Anonymisable Unanonymisable Untagged
Anonymise x
Retain x x
Delete x x x

Without wanting to introduce the overhead of a configuration file, perhaps these could be specified at the CLI in comma-separated, colon-assignment notation:

manage.py anonymise_db --strategy "unanonymisable:delete,untagged:retain"

(with perhaps a printout of the plan before confirmation, and sensible defaults)

ghost commented 3 years ago

(sorry, thinking aloud a bit here as I inspect the code; this is something that I might be able to offer some time on, depending on how priorities and the scope-of-work here turn out)

It looks like one challenge with the strategy-based approach would be that the registry (which is opt-in for models, based on the presence of PrivacyMeta) may be unaware of the complete list of 'untagged' Django models in an application.

I've drafted some code previously to enumerate a 'complete' list of Django models, but there's a good chance that there would be edge cases with an approach like that. I'll do a bit more of a literature review to see whether Django model enumeration is a well-understood and robustly-solved problem elsewhere.

ghost commented 3 years ago

Great; enumeration of models is built-in to Django (and in fact, already used within the codebase here), so retrieving a complete and reliable list of those shouldn't be a major concern.

ghost commented 2 years ago

@radiac I might have been a bit too keen jumping straight to a prototype implementation here, although I think it was useful to help me understand the shape of the problem; let me know what you think of the proposed approach when you have a bit of time.