Open iqis opened 5 years ago
@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.
For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.
And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.
Wow, looks like I need to do some extra thinking.
Right now as I'm trying to figure out the best approach, I need to know some common characteristics in downstream analyses. Some detailed use cases will help. For example, what are some frequently used statistical models? Are modeling usually done for each and every subject, or across some kind of summation of a group?
@wendtke to
sqlite
and filehash
@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.
For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.
And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.
I think I miscalculated; see here for HRV output data for 67 individuals read and wrangled in R..
Maybe you were referring to raw ECG signals? That could make more sense.
On Sat, Jul 20, 2019, 4:18 PM Kathleen Wendt notifications@github.com wrote:
@MalloryJfeldman https://github.com/MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.
For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.
And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.
I think I miscalculated; see here https://media.discordapp.net/attachments/575140896249479184/601979176806776833/Screen_Shot_2019-07-19_at_21.29.02.png for HRV output data for 67 individuals read and wrangled in R..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wendtke/psyphr/issues/58?email_source=notifications&email_token=AKE6JFWTB5UMCUJVUOZ6R7DQANXH5A5CNFSM4IE6D5D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NVITI#issuecomment-513496141, or mute the thread https://github.com/notifications/unsubscribe-auth/AKE6JFTAUL5J2ZVHG7SMKQ3QANXH5ANCNFSM4IE6D5DQ .
Maybe you were referring to raw ECG signals? That could make more sense.
Maybe. I thought I was checking the properties of only the output files. Oh well.
Hey- sorry I'm coming to this late. Our studies can generate close up to ~2 GB in output files. Like I said - we never actually ran our experience sampling data through proprietary software so I don't have a good sense for what that might look like (I think this study is not very representative but I would suspect that if we did try and run our experience sampling data through Mindware, we would generate closer to 5-6 GB in output). I think in general, it's fairly typical to generate output files across 2-5 channels for one person for sessions that last between 1 and 4 hours. So thats' 2-5 output files per person containing summaries of physio data from 1-4 hours of recording. I'd say a typical sample is between 50 and 150 subjects; although people are pushing for more these days. For within-subject analysis these numbers can be lower.
Looking at "Mallory Pilot 1" here, out of 600+ MBs of raw data comes only 1MB of .xlsx
workbooks.
I know we're only dealing with workbooks at the moment, but it makes me wonder: following the above ratio, would 2GBs of output be coming from 1.2TB of input? Wow, that's massive!
We want
psyphr
to work on a normal laptop, which nowadays has somewhere between 4-12G's of usable memory, and R normally should not use more than half of the total memory. Currentlyread_study()
reads everything all at once. A really big study can create a problem.If the problem exists, there are at least two ways to mitigate the problem:
What is a likely the total size of a study? I'm looking for a figure at about the 80th percentile, and I surely hope it will be small enough.