Study/file size - Githubissues

iqis commented 5 years ago

We want psyphr to work on a normal laptop, which nowadays has somewhere between 4-12G's of usable memory, and R normally should not use more than half of the total memory. Currently read_study() reads everything all at once. A really big study can create a problem.

If the problem exists, there are at least two ways to mitigate the problem:

Construct a promise in lieu of reading in the data; the data is read from disk as needed.
Read the study and cache the resulting R object onto disk incrementally.

What is a likely the total size of a study? I'm looking for a figure at about the 80th percentile, and I surely hope it will be small enough.

wendtke commented 5 years ago

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

iqis commented 5 years ago

Wow, looks like I need to do some extra thinking.

iqis commented 5 years ago

Right now as I'm trying to figure out the best approach, I need to know some common characteristics in downstream analyses. Some detailed use cases will help. For example, what are some frequently used statistical models? Are modeling usually done for each and every subject, or across some kind of summation of a group?

wendtke commented 5 years ago

@wendtke to

compare sqlite and filehash
review @iqis branch for stash/pointer approach
develop questions for team meeting 20190722

wendtke commented 5 years ago

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

I think I miscalculated; see here for HRV output data for 67 individuals read and wrangled in R..

iqis commented 5 years ago

Maybe you were referring to raw ECG signals? That could make more sense.

On Sat, Jul 20, 2019, 4:18 PM Kathleen Wendt notifications@github.com wrote:

@MalloryJfeldman https://github.com/MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

I think I miscalculated; see here https://media.discordapp.net/attachments/575140896249479184/601979176806776833/Screen_Shot_2019-07-19_at_21.29.02.png for HRV output data for 67 individuals read and wrangled in R..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wendtke/psyphr/issues/58?email_source=notifications&email_token=AKE6JFWTB5UMCUJVUOZ6R7DQANXH5A5CNFSM4IE6D5D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NVITI#issuecomment-513496141, or mute the thread https://github.com/notifications/unsubscribe-auth/AKE6JFTAUL5J2ZVHG7SMKQ3QANXH5ANCNFSM4IE6D5DQ .

wendtke commented 5 years ago

Maybe you were referring to raw ECG signals? That could make more sense.

Maybe. I thought I was checking the properties of only the output files. Oh well.

wendtke commented 5 years ago

@geanders Do you have any thoughts on own solution vs. filehash vs. SQLite vs. RSQLite as underlying data management system for large studies within psyphr?

MalloryJfeldman commented 5 years ago

Hey- sorry I'm coming to this late. Our studies can generate close up to ~2 GB in output files. Like I said - we never actually ran our experience sampling data through proprietary software so I don't have a good sense for what that might look like (I think this study is not very representative but I would suspect that if we did try and run our experience sampling data through Mindware, we would generate closer to 5-6 GB in output). I think in general, it's fairly typical to generate output files across 2-5 channels for one person for sessions that last between 1 and 4 hours. So thats' 2-5 output files per person containing summaries of physio data from 1-4 hours of recording. I'd say a typical sample is between 50 and 150 subjects; although people are pushing for more these days. For within-subject analysis these numbers can be lower.

iqis commented 5 years ago

Looking at "Mallory Pilot 1" here, out of 600+ MBs of raw data comes only 1MB of .xlsx workbooks.

I know we're only dealing with workbooks at the moment, but it makes me wonder: following the above ratio, would 2GBs of output be coming from 1.2TB of input? Wow, that's massive!

wendtke / psyphr

Study/file size #58