How do you deal with concatenated reads?

Jia-Xiu commented 1 month ago

concatenate_read_example.pdf Hi William, How do you define where do the barcode starts and ends in the read? I found that some reads can be assigned to more than one sample due to read concatenation during nanopore sequencing. I used Dorado Basecaller to do the basecalling, and kept duplex and simplex reads for demultiplexing. To avoid trimming some parts of the barcodes, I kept the adapters (--no-trim). A small proportion of my reads are concatenated, i.e. they have 16S amplicons from different samples (please see an example in the pdf file). In this case, if I set --barcode_start to 0 and -bc_end to 2200 (I set 2200 because of the different lengths of the adapters), the function/program can find 2 barcodes and assign my reads to two samples. What do you think about this issue? I am not sure if trimming adapters is a good idea, as Dorado Basecaller says “The --no-trim option will prevent the trimming of detected barcode sequences as well as the detection and trimming of adapter and primer sequences”. What did you do? Have you used reads after trimming the adapters? Any insights are welcome. Thanks, Xiu

willros commented 1 month ago

The --barcode_start and --barcode_end controls where each barcode is located at both ends. So, if the read is 1500bp, --barcode_start 0 and --barcode_end 300 look for barcodes in the read from position 0 to 300, and the other barcode at 1200 to 1500. Does that make sense?

Thanks for the input, William

willros commented 1 month ago

@Jia-Xiu I made another change which makes the program run faster.

For me, this works very well now. Try the new version out and let me know.

Thanks, William

Jia-Xiu commented 1 month ago

@willros Cool!

The greedy mode works very fast with the previous version, but the fuzzy mode does not. I will try the new version out. From the output of greedy (previous version, please see below), I still have 2985 reads found in more than one sample. I might make my read_len_max shorter, say 1800, to avoid concatenated reads.

PS: In the previous version, I noticed that you write the number of reads assigned to each sample in the fastq file names. Is it possible to keep the number of reads in a csv file, which I find is more useful?

Thanks, Xiu

Date: 2024-05-28 Command: nanomux fastx: results_dorado_no_demultiplexing/all_clean_fastq.gz output: results_nanomux_all_greedy_fat barcodes: 16S_barcodes_22-05-2024.csv mode: greedy mismatch: 1 barcode_start: 0 barcode_end: 600 read_len_min: 1400 read_len_max: 2000 minimum_reads: 1 parquet: True trim: False Demux information: Number of raw sequences: 12929742 Number of filtered sequences: 9436396 Number of sequences with barcodes: 1810348 Number of reads found in more than one sample: 2985 Number of barcodes found: 271 Time runned: 2:23:47.452883

willros commented 1 month ago

Thanks for the idea! It is now implemented.

Maybe it has to do with the number of samples in your pool. Do you pool 270 samples together and sequence?

Jia-Xiu commented 1 month ago

Yes, I pooled 272 samples together and then did sequencing.

willros commented 1 month ago

Ok, cool! Can i ask how how the balancing seem to work out for you?

Jia-Xiu commented 1 month ago

Sure, do you mean balance where the barcodes end in the read as well as minimum and maximum read length? If so, I will let you know later today or tomorrow morning when I have some results.

For reads that assigned to two different samples, they look like this: 5' ---- ADAPTER ---- FORWARD_BARCODE_SAMPLE_A ---- 27F_PRIMER --- READ --- 1492R_PRIMER --- REVERSE_BARCODE_SAMPLE _A ---- ADAPTER ---- FORWARD_BARCODE_SAMPLE_B ---- 27F_PRIMER --- READ ---- 3'

There are two barcodes at the end of the read. Is it possible to ask the program to detect only the first barcode from the end of a read ("REVERSE_BARCODE_SAMPLE _A" as in the example above)? If it is possible, I would like to trim the barcode and keep only the read before the first reverse barcode.

I hope what I have described is clear to you. Let me know if you need more details.

Thanks for all your updates, Xiu

willros commented 1 month ago

Actually, i meant how well balanced all the samples in your pool are. If there equal amount of every sample roughly.

Do you have some comments on the new version?

William

Jia-Xiu commented 1 month ago

My samples should be well balanced. Every sample expects the same number of reads.

Thanks for asking about my experience of the new version! I only tried 'greedy' so far. I didn't see any read found in more than one sample, which is good. When checking the reads after demultiplexing, I found that majority are correct (TP), also good. But I found a few reads are not (FP). For instance:

Sample A should have FORWARD_BARCODE_1 and REVERSE_BARCODE_1
Sample B should have FORWARD_BARCODE_1 and REVERSE_BARCODE_2

The read below was wrongly assigned to Sample B, which should belongs Sample A. 5' ---- ADAPTER ---- FORWARD_BARCODE_1 ---- 27F_PRIMER --- READ --- 1492R_PRIMER (complement) --- REVERSE_BARCODE_1 (complement) ---- ADAPTER ---- REVERSE_BARCODE_2---- 1492R_PRIMER --- READ ---- 3'

A real read example below:

FORWARD_BARCODE_1 AACGAGTCTCTTGGGACCCATAGA
REVERSE_BARCODE_1 (complement) CCTGGTAACTGGGACACAAGACTC
REVERSE_BARCODE_2 AAGAAAGTTGTCGGTGTCTTTGTG
27F_PRIMER AGGGTTCGATCCTGGCTCAG
1492R_PRIMER CGGTTACCTTGTTACGACTT
1492R_PRIMER (complement) AAGTCGTAACAAGGTAGCCG

full read >e74de677-0b4d-43b6-9d60-b419b267dc3f TGTCCTGTACCTCGTCGGTTGGTCTTTGTTAACGAGTCTCTTGGGACCCATAGAAGGGTTCGATCCTGGCTCAGATTGAACGCTGGCGGCATGCCTTACACATGCAAGTCGAACGGTAACAGGTTAAGCTGACGAGTGGCGAACGGGTGAGTAATATATCGGAACGTGCCCAGTTGTGGGGGATAACTACTCGAAAGAGTGGCTAATACCGCATGAGACCTGAGGGTGAAAGCGGGGGATCGCAAGACCTCGCGCGATTGGAGCGGCCGATGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCCACCAAGGCGACGATCTGTAGCTGGTCTGAGAGGACGACCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATGGACGAAAGTCTGATCCAGCCATGCCGCGTGCGGGAAGAAGGCCTTTGGGTTGTAAACCGCTTTTGTCAGGGAAGAAACGGGTTTCTCTAATACAGGGACCTAATAACAATCTTGCTGAAAGAAGTAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTACGGTGCGGCTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGCAAGACAGATGTGAAATCCCGGGCTAGAACCTCGGGAACTGCATTTGTAGCTGCATAGGCTAGAGTACGGTAGAGGGGGATGAAATCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCGATCCCCTGGACCTGTACTGACGCTCATGCACGGAGAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGTCAACTAGACTATGGGAGGGTTTCTTCTCAGTAACGAAGCTAACGCGTGAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGTTTAATTCGATGCAACGCGAAAAACCTTACCTACCCTTGACATGGACAGAATCCTGAAGAGATTTGGGAGTGCTCGAAAAGAGAACTGTCACACAGGTGCTGCATGGCCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCATTAGTTGCTACGAAAGGGCACTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAGGTCATCATGGCCCTTATGGGTAGGGCTACACACGTGCTACAATGGTACGTACGAGGGAGGCAAGCTGGCGACAGTGAGCGGATCTCTTAAAGCATATCGTAGTCCGGATCGCAGTGTCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCAAATCAGAATGTTGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTCAGGACTACAAAGAAGATAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGCCTGGTAACTGGGACACAAGACTCTAAGAAAGTTGTCGGTGTCTTTGTGCGGTTACCTTGTTACGACTTAGCCCTAGTTACCAGTTTTACCCTAGGCAGCTCCTTGCGGTCACCGACTTCAGGCACCCCCAGCTTCCATGGCTTGACGGGCGGTGTGTACAAGGCCGGGAACGTATTCACC

willros commented 1 month ago

Is this one of the concatenated reads?

Jia-Xiu commented 1 month ago

yes, it is a concatenated read. But the second part is incomplete.

willros commented 1 month ago

How do you identify the concatenated reads to start with? How did you find this one for example?

willros commented 1 month ago

Maybe I can implement a read splitting function, like a preprocessing step. What could be the mark for the algorithm to split a read? Adapter in the middle or something else?

Jia-Xiu commented 1 month ago

Splitting concatenated reads by adapter sounds like an option. I thought the --trim argument in Dorado basecalling might also trim adapters and thus split the concatenated reads. I will need to check. However, as mentioned, there is a potential risk involved “Note that if you intend to demultiplex the reads at some later time, trimming adapters and primers may result in some portions of the flanking regions of the barcodes being removed, which could interfere with correct demultiplexing”. (ref) Another option could be about reverse barcode searching, as I suggested earlier. (You may have already implemented this. Sorry to bring it up again.) I am not sure about the direction of barcode searching? For example, if the read is 1700bp, and we set barcode_start 0 and --barcode_end 300. When searching for a reverse barcode, is the search conducted from 1400 to 1700, or from 1700 to 1400? If it is the former (from 1400 to 1700), only the first barcode should be considered. Conversely, if it is the later (from 1700 to 1400), the second barcode should be considered. I am giving an example about reverse barcodes. The concatenate read can be attached before the forward barcode as well, and the same rule should apply to forward barcode searching. Does this make sense? If the barcodes are assigned correctly, do you think using the "-t --trim" argument of nanomux will effectively remove the barcodes and the concatenate reads attached to them? Thanks, Xiu

willros commented 3 weeks ago

Is the adapter sequence known to you? If so, can you provide it and the name of it.

Thanks!

willros / nanomux

How do you deal with concatenated reads? #4