yukiteruono / pbsim3

PBSIM3: a simulator for all types of PacBio and ONT long reads
GNU General Public License v2.0
46 stars 5 forks source link

Question about tagging sam files with transcript ID #13

Closed nadjano closed 6 months ago

nadjano commented 6 months ago

Hello,

Thank you for developing pbsim!

I am using pbsim to simulate transcripts and I want to assess if they map back to the right genes and which transcripts map to other matching positions.

For this it would be very helpful if I could tag the output sam with the transcript id from which the read was simulated. Could you please provide recommendations on how I can achieve this?

One approach I'm considering is simulating each transcript individually, adding the corresponding transcript ID tag to the output SAM files, and subsequently merging all transcripts. Do you have any suggestions or alternative methods for achieving this goal?

Thank you for your assistance!

Best, Nadja

yukiteruono commented 6 months ago

Thank you for your using PBSIM3.

The maf file contains alignments between transcripts and reads, and you can create a correspondence list of transcripts and read IDs from the maf file. Your approach is also good. Change --seed each time you run it. Otherwise, the distribution of read start positions on the template and the read length distribution may be biased.

nadjano commented 6 months ago

Thanks a lot @yukiteruono for the quick response! I didn't check the maf before.

If anyone stumbles on this issue and wants to do the same:

1) Here is how I extracted the read ID, transcript pairs from the maf file.

        # Get a list of read and transcript ids 
        awk '/^a\$/ { getline; transcript_id=\$2; getline; read_id=\$2; print read_id, transcript_id }' *maf > read_gene_mapping
        # Get readname prefix
        prefix=\$(awk '{print \$1; exit}' read_gene_mapping | sed 's/\\/[0-9]*\\/[0-9]*\$//')
        # Replace pass number with "ccs" to match read names after running ccs
        sed -i "s/\$prefix\\/\\([0-9]*\\)\\/[0-9]* /\$prefix\\/\\1\\/ccs /g" read_gene_mapping
        # Only keep unique lines 
        uniq read_gene_mapping > read_gene_mapping

2) After mapping to reference: To tag the aligned sam file with the corresponding transcript id, I ran this:

while read -r read_name rg_tag; do
        samtools view -h $aligned_sam | grep \$read_name | awk -v read_name="\$read_name" -v rg_tag="\$rg_tag" 'BEGIN {OFS="\\t"} \$1 == read_name { \$0 = \$0"\\tRT:Z:"rg_tag } { print \$0 }' >> tagged.sam
    done < read_gene_mapping_filter