wodanaz / Assembling_viruses

0 stars 0 forks source link

make supermetadata table and modify titles #38

Closed johnbradley closed 3 years ago

johnbradley commented 3 years ago

Notes from @wodanaz

Here are two ways I have been changing the name of the consensus sequences manually.

To do this, I need to know if the samples are from the hospital or campus. Because in campus date is by week, in the hospital is by collection day. So I have two different methods.

Right now, every run is from the same week so, I just add it as a string. very easy but it may change in case we start sequencing earlier samples

cat Devos_6871_210317A5.csv | sed -r 's/,/\t/g' | sed -r 's/_NC_045512//g' | awk '{print $1 "\t" $2 }' | sed '1d' > variants.tab

sed -r 's/.bqsr//g' coverage.gatk.tab | sed '1d' | awk '{print $1 "\t" $7 }'  | sort -k1d  > coverage.sort.tab

join -j 1 <(sort coverage.sort.tab) <( sort variants.tab) | awk '{ print $1 "$" $2 "$" $3 "$Week_29"  }' > supermetadata.tab

for i in `cat supermetadata.tab`; do
root=`echo ${i} | cut -d'$' -f 1`;  
rootA=`echo ${i} | cut -d'$' -f 2`;  
rootB=`echo ${i} | cut -d'$' -f 3`;  
rootC=`echo ${i} | cut -d'$' -f 4`;  
result=${rootA/.*}
if [ "$result" -ge 90  ] ; then  # Here I use a filter to only modify fasta files with a coverage higher than 90%
sed -r "s/_NC_045512/|${rootA}|${rootB}|${rootC}/g" $root.cleaned.fasta > $root.info.fasta ;
fi ;
done

cat *info.fasta > Devos_6871_210317A5.fasta

for the hospital, it's just a bit different. I really need the date table.

cat <DDS_project (lineage data)>.csv | sed -r 's/,/\t/g' | sed -r 's/\|/\t/g' | awk '{print $1 "\t" $3 }' | sed '1d' > variants.tab 
# DDS_project (lineage data) is the name of the pangolin output file

sed -r 's/.bqsr//g' coverage.gatk.tab | sed '1d' | awk '{print $1 "\t" $7 }'  | sort -k1d  > coverage.sort.tab

join -j 1 <(sort coverage.sort.tab) <( sort date.tab) | awk '{ print $1 "\t" $2 "\t" $3 }' > metadata1.tab

join -j 1 <(sort metadata1.tab) <( sort variants.tab ) | awk '{ print $1 "$" $2 "$" $3 "$" $4 }' > supermetadata.tab

for i in `cat supermetadata.tab`; do
root=`echo ${i} | cut -d'$' -f 1`;
rootA=`echo ${i} | cut -d'$' -f 2`;
rootB=`echo ${i} | cut -d'$' -f 3`;
rootC=`echo ${i} | cut -d'$' -f 4`;
result=${rootA/.*}
if [ "$result" -ge 90  ] ; then  # Here I use a filter to only modify fasta files with a coverage higher than 90%
sed -r "s/_NC_045512/|${rootA}|${rootB}|${rootC}/g" $root.cleaned.fasta > $root.info.fasta ;
fi ;
done

cat *info.fasta > hospital.fasta # here hospital can be just the same name as in the DDS project

date.tab is a tab file that has two columns, one with sample ID and a second column with date

johnbradley commented 3 years ago

@wodanaz I have a few questions: How can I tell if we are running in campus or hospital mode?

For the campus version:

For the hospital version:

wodanaz commented 3 years ago

@wodanaz I have a few questions: How can I tell if we are running in campus or hospital mode?

For the campus version:

  • Where is the Devos_6871_210317A5.csv file coming from?

This is coming from pangolin output. or ($EVDIR/$PROJECTNAME.csv)

  • Should I create an optional command line argument to specify the week( "Week_29")?

Yes, because it varies, but will depend if there is a mix of weeks in the sample. To make it more general. I would just include the date.tab file from the beginning.

For the hospital version:

  • Will the date.tab file included in the DDS project that we download?

No, I should have that file in the directory (I hope to always get this file before the sequencing is done)

  • Should I create a command line argument for run-escape-variants.sh to optionally pass in the date.tab file?

Yes, I think this is a great idea. Just in case I don't have the file at the time of running

Best,

Alejo

johnbradley commented 3 years ago

@wodanaz How should I tell the difference between campus or hospital mode? One options is I to add a command line argument to specify either campus or hospital modes.

wodanaz commented 3 years ago

Great question. I was thinking that we could deprecate the surveillance mode and replacing it with -m (mode, with h (hospital surveillance) or c (campus surveillance ) or e (experimental) as options). Here, experimental doesn't have patient metadata, so we don't need to change the name in the fasta consensus.

This also means we need to change which files are saved. And, for now the reports needed for mutations are only in the spike protein region. So, after doctors changed their minds on what data is relevant. I run depth_compiler.pl and genotype_compiler.pl to get mutations and their depths, then I filter the vcf files by the spike region with a new bed file that I called spike.bed which I made as:

cat spike.bed
NC_045512   21563   25384

To generate this file for genotypes within spike I made the following change:

module load bedtools2
gunzip *gatk.filt.vcf.gz

ls *gatk.filt.vcf > vcfs.list

for i in `cat vcfs.list`; do
root=`basename $i .gatk.filt.vcf`;
echo '#!/usr/bin/env bash' > $root.vcf2tab.sh;
echo "#SBATCH -N 1" >> $root.vcf2tab.sh;
#echo "#SBATCH --mem 500" >> $root.vcf2tab.sh;
echo "intersectBed -a $i -b spike.bed > $root.spike.vcf"  >> $root.vcf2tab.sh;
echo "awk '{ print  \$2 \"\\t\" \$5}' $root.spike.vcf  | grep -E '22812|22813|22917|23012|23063|23403|23592|23593'' > $root.spike.tab"  >> $root.vcf2tab.sh;
done

for file in *vcf2tab.sh ; do sbatch $file ; done

perl genotype_compiler.pl *.spike.tab > spike_genotypes.tab
cp spike_genotypes.tab spike_genotypes.backup.tab
sed -r 's/ /\t/g'  spike_genotypes.tab | sed -r 's/.spike.tab//g' | sort -k1,1n > spike_genotypes.final.tab 

and for read depths, to get an idea of data quality I run the following,

for i in `cat vcfs.list`; do
root=`basename $i .gatk.filt.vcf`;
echo '#!/usr/bin/env bash' > $root.vcf2tab.sh;
echo "#SBATCH -N 1" >> $root.vcf2tab.sh;
#echo "#SBATCH --mem 500" >> $root.vcf2tab.sh;
echo "intersectBed -a $i -b spike.bed > $root.spike.vcf"  >> $root.vcf2tab.sh;
echo "awk '{ print  \$2 \"\\t\" \$10}' $root.spike.vcf  | grep -E '22812|22813|22917|23012|23063|23403|23592|23593' | sed -r 's/:/\t/g' | awk '{ print  \$1 \"\\t\" \$3}'> $root.depth.tab"  >> $root.vcf2tab.sh;
done

for file in *vcf2tab.sh ; do sbatch $file ; done

perl depth_compiler.pl *.depth.tab > spike_depths.tab
cp spike_depths.tab spike_depths.backup.tab
sed -r 's/ /\t/g'  spike_depths.tab | sed -r 's/.depth.tab//g' | sort -k1,1n  > spike_depths.final.tab 

Now, I have two tab files. Which, I wish I could merge with the metadata from coverage, date and lineage using bash but I do it in excel.

I just don't know how to transpose tables in bash. I have tried some things from R without success. Do you have any ideas?

Thanks so much John!

johnbradley commented 3 years ago

I just don't know how to transpose tables in bash. I have tried some things from R without success. Do you have any ideas?

I've never tried to transpose a table in bash. Bash is good for simple tasks, but can rather challenging to use for complicated tasks. This is probably the case for transposing some data. I can add a script to merge the tab files. What programming language do you have the most familiarity with R, python, perl, other?

wodanaz commented 3 years ago

I more familiar with R, but I can do python as well. Yea note that the table generated by the compiler has samples as columns and transposing to samples as rows would make the merged table easier to understand.

Yeah, I thought transposing was an easy task for bash but it is not lol. i found it very hard to do, and using t() funtion in R was hard to keep the rownames and colnames in shape

johnbradley commented 3 years ago

Since you are more familiar with R I'll see if I can create an R script to merge these tables.

wodanaz commented 3 years ago

Thanks, I would love to learn how to make a transposition more efficently.

;)

johnbradley commented 3 years ago

@wodanaz Should the spike genotype_compiler/depth_compiler code above replace the current genotype_compiler/depth_compiler code? Or should I add the spike code as a new step in addition to the current genotype_compiler/depth_compiler code?

wodanaz commented 3 years ago

No, they are still the same. the difference is the input tables, which have been pre-filtered and customized to only search for the mutations of concern.

johnbradley commented 3 years ago

@wodanaz For merging are you wanting to transpose and add in the coverage, date, and lineage from supermetadata.tab? So if spike_genotypes.final.tab looks something like this:

    S01 
22812   G   
22813   T   
22917   C

And supermetadata.tab looks like this:

S01$99.92$Week_28$B.3

Instead spike_genotypes.final.tab should look like this?

Sample  Coverage  Date     Lineage  22812 22813 22917
S01     99.92     Week_28  B.3      G     T     C

And for spike_depths.final.tab we would follow the same pattern?

wodanaz commented 3 years ago

@wodanaz For merging are you wanting to transpose and add in the coverage, date, and lineage from supermetadata.tab? So if spike_genotypes.final.tab looks something like this:

  S01 
22812 G   
22813 T   
22917 C

And supermetadata.tab looks like this:

S01$99.92$Week_28$B.3

Instead spike_genotypes.final.tab should look like this?

Yes, that's exactly how i make it look after excel processing.

Sample  Coverage  Date     Lineage  22812 22813 22917
S01     99.92     Week_28  B.3      G     T     C

And for spike_depths.final.tab we would follow the same pattern?

Yes, although I don't merge depths with the supermetadata table since it is goes to another sheet in the excel document.

Thanks!

johnbradley commented 3 years ago

If the goal is an excel document python has a library called pandas (which works similar to R dataframes) that can save as an excel spreadsheet with multiple sheets.

wodanaz commented 3 years ago

That's a great idea and it will help me to save a lot of time :)

johnbradley commented 3 years ago

Cool. I'll use python/pandas to transform/merge the genotype data into an excel file. Once that's done we can add additional sheets.

wodanaz commented 3 years ago

It sounds like it works as write.xls in R. But I am looking forward to learn the python/pandas way.