Support for staging data

johnbradley commented 3 years ago

Do we need to have the ability to download data, run the pipeline, and upload the results?

wodanaz commented 3 years ago

I dont think we do because it often requieres entering the sftp server in which the access is password protected and it depends on the info that Nico sends me. I have to navigate the file system and fo get * and exit.

I guess we can do that manually but it is a very good idea. How would you address this in a automatic way?

johnbradley commented 3 years ago

I thought Nico was going to use DDS for delivering the data. Assuming that's still the case, we could create a script that 1) downloads the data from a DDS project 2) runs the pipeline 3) uploads the results/logs to a DDS project ( could be the same or a different project )

With DDS you just need to create your credential file once on HARDAC. There would be no need to deal with a username/password. You only need to provide the name of the DDS project.

wodanaz commented 3 years ago

@johnbradley, there have been some changes tp the protocol this week. First the transfer via sftp to hardac worked really well because the hospital needed the first batch of data to be analyzed very quickly. And I couldn't figure how to transfer data from the DDS to hardac directly. Then, we learned that the Illumina kit that we were using has long amplicons that struggle to cover the genome and in cases where the RNA is very degraded, many samples fail to amplify. Now, we will have a new kit by Swift that should favor shorter amplicons that allow us to recover more sequence from low viral titer, the other good thing is that they don't have human primers .

wodanaz commented 3 years ago

With DDS you just need to create your credential file once on HARDAC. There would be no need to deal with a username/password. You only need to provide the name of the DDS project.

I never did this before. It seems like a great option too. looking into it right now.

johnbradley commented 3 years ago

Here are some instructions from Nico on using DDS on HARDAC: http://seqweb.gcb.duke.edu/DataDelivery/NGSequencingCoreDataDownloadInstructionB.html

Basically, setup .ddsclient config file in your home directory, then from an interactive session:

module load ddsclient
ddsclient download -p <projectName>

wodanaz commented 3 years ago

@johnbradley, there have been some changes tp the protocol this week. First the transfer via sftp to hardac worked really well because the hospital needed the first batch of data to be analyzed very quickly. And I couldn't figure how to transfer data from the DDS to hardac directly. Then, we learned that the Illumina kit that we were using has long amplicons that struggle to cover the genome and in cases where the RNA is very degraded, many samples fail to amplify. Now, we will have a new kit by Swift that should favor shorter amplicons that allow us to recover more sequence from low viral titer, the other good thing is that they don't have human primers .

In addition, there will be a few changes to the pipeline for the adapter removal and the deduplication steps. But will know when I get that new data.

wodanaz commented 3 years ago

Here are some instructions from Nico on using DDS on HARDAC: http://seqweb.gcb.duke.edu/DataDelivery/NGSequencingCoreDataDownloadInstructionB.html

oh thanks! was just searching in google for this

wodanaz commented 3 years ago

Here are some instructions from Nico on using DDS on HARDAC: http://seqweb.gcb.duke.edu/DataDelivery/NGSequencingCoreDataDownloadInstructionB.html

Basically, setup .ddsclient config file in your home directory, then from an interactive session:
module load ddsclient
ddsclient download -p <projectName>

I am doing it.

Thank you!

wodanaz commented 3 years ago

I thought Nico was going to use DDS for delivering the data. Assuming that's still the case, we could create a script that

downloads the data from a DDS project

runs the pipeline

uploads the results/logs to a DDS project ( could be the same or a different project )

With DDS you just need to create your credential file once on HARDAC. There would be no need to deal with a username/password. You only need to provide the name of the DDS project.

Will the results stay in hardac? The final plan is to make phylogenetic trees but that something that doesn't need to be automated for now. Because there is some edition that has to be done by had to the alignments.

johnbradley commented 3 years ago

Will the results stay in hardac?

It's up to you. If you prefer to keep the data locally on HARDAC for other processing we can do that. It's really just a question of running rm or not from a script. If you are running out of space on your HARDAC disk quota automatically cleaning up the data may be an attractive option.

wodanaz commented 3 years ago

Ok, I like this option a lot. Transferring all data back to DDS sounds like a great plan. Keeping the consensus sequences and the summary table of genome coverage is good enough to keep in hardac for further processing and those files shouldn't be too big.

johnbradley commented 3 years ago

@wodanaz How do you want to store the results in DDS?

One project with all the results
A new project for each set of results
Upload the results to the input project

Do you want to keep the input data on HARDAC or can we delete that data when the pipeline finishes?

Do you have a particular directory on HARDAC where you want to keep result data files? If not how about something like this:

data/
   input/        <- downloaded projects go here in directories named after the downloaded project
   output/      <- result files and logs directory go into directories named after the downloaded project
run-escape-variants.sh 
...

wodanaz commented 3 years ago

A new project for each set of results will help me track the specific sequencing job of a given batch of samples if anything happens. Let's remove the input from hardac when the pipeline finishes. Which is good to keep storage low and for not keeping protected data (But the new sequencing kit will not be sequencing host's rna anymore)

wodanaz commented 3 years ago

Do you have a particular directory on HARDAC where you want to keep result data files?

I do but a this moment is in the directory I was analyzing Sempowsky's lab data

I created a new one for these projects:

/data/wraycompute/alejo/sars2_genotype/assembling_results

wodanaz commented 3 years ago

data/ input/ <- downloaded projects go here in directories named after the downloaded project output/ <- result files and logs directory go into directories named after the downloaded project run-escape-variants.sh

and these can go into that new directory

wodanaz / Assembling_viruses

Support for staging data #12