Have people promise that they won't share pii, or if they do they know it will be in transcripts?
Do named entity recognition, do a hash of the name or location when it shows up in transcripts
Dealing with PII in videos:
Check if there are more than one person in the frame and blank the frame
Data features fall at different levels:
Millisecond level raw audio and video features- where is person's head pointing
Utterance data, transcript level data
Survey level data
Github will hold the utterance level and survey level data, some of the high-frequency data is large and would either want to go on LFS or stay in S3. Most of the time we'll use the milisecond data to figure out higher-level features that are not so frequent.
Todo:
[x] Get videos from our pilots moved to a dedicated S3 bucket. @JamesPHoughton @RachelAbigail
need to use the metadata from the pilots to identify which folders need to move
[x] Get a few videos into this bucket quickly so that we can get some analysis started
[ ] Documentation about what the videos are and where they come from @JamesPHoughton @RachelAbigail
[x] Set up user accounts to share S3 bucket with the group
[ ] Put together a dirty run of the workflow @ChristopherLucas @dcknox
Notes from the conversation:
How do we deal with PII in transcripts:
Dealing with PII in videos:
Data features fall at different levels:
Github will hold the utterance level and survey level data, some of the high-frequency data is large and would either want to go on LFS or stay in S3. Most of the time we'll use the milisecond data to figure out higher-level features that are not so frequent.
Todo: