Is it possible to consider primer reads during demultiplexing?

Jia-Xiu commented 4 months ago

Hi William, I tested this package, and it works. Very cool. Thanks a lot! However, I have a few questions regarding my output.

I noticed that barcodes are removed during the analysis. Can you please provide an option to keep the barcodes for further verification or for use the output in other platforms?
I found that many reads appear in more than one sample. Consequently, the "Number of sequences with barcodes" is higher than the "Number of raw sequences" (please see details below). This could be due to the duplicity of our forward and reverse barcodes (a design flaw in our barcodes). I was wondering if it is possible to incorporate primer reads during demultiplexing to reduce the number of reoccurrence and increase the accuracy of demutiplexing. For example, could we link our barcodes and primers as a single entity [fwBC’ = fwBC + fwPrimer; rvBC’ = rvBC + rvPrimer]. In this case, I have some degenerate bases in my primers. Can degenerate bases be considered during the analysis? Or is it possible to use the primers as a trailing flank for the front barcode as in Dorado?
When I submitted my job via slurm. I noticed that my dataset requires a lot of memory (the job was killed when I requested 96G memory for an input fastq.gz file of size 15Gb. Is there a way to optimize memory usage?

Thanks in advance! Xiu

mode: fuzzy
mismatch: 4
barcode_start: 0
barcode_end: 1900
read_len_min: 1400
read_len_max: 1700
minimum_reads: 1
parquet: True
Running nanomux on simplex_test.fastq
Number of raw sequences in simplex_test.fastq: 242093
Number of sequences between 1400bp and 1700bp: 183229
Barcode: PCRcontrol_r3, contained: 40 reads
Number of sequences with barcodes: 1153075
Number of barcodes found: 271
Number of reads found in more than one sample: 1152785
Saving parquet file to results_nanomux_test/simplex_test.parquet
Fasta files saved to: results_nanomux_test
Nanomux is done!

willros commented 4 months ago

Hi,

Thank you for your comments!

I will add the option to not remove the barcodes.
It is because you have choosen to have 4 mismatches in fuzzy mode. I will put a cap at 1 or 2 I think, because otherwise this will happen. How long barcodes do you have?
Nice to know! I will make the program work in batches instead for reading everything into memory at once!

I'll let you know when I am done!

Thanks! William

Jia-Xiu commented 4 months ago

Thanks for considering these options!

I will try to set a cap of 1 or 2 as you suggested and check the results. My barcodes are 24nt long.

How about incorporating barcodes with degenerate bases? For example, using Y to represent A or G, and R to represent C or T.

Thanks! Xiu

willros commented 4 months ago

I will try.

Please try it with greedy mode as well and check the results!

willros commented 4 months ago

I have now added the three points that you suggested. Thanks for the nice suggestions!

Please reinstall everything and try it out! See the new readme for usage.

Regards, William

willros commented 3 months ago

Hi @Jia-Xiu,

I did a lot of changes to the fuzzy searching algorithm now.

Mismatch of 3 works good for me now. It is quite much slower than if using 1 or 2, but it finds more barcodes and not as many duplicates as before.

Please download the newest version and try again!

Thanks, Ville

Jia-Xiu commented 3 months ago

Thank you @willros!

My job using the previous version is still running. It takes about 2 hrs to write the output of each sample (fuzzy with mismatch of 1). I will run the new version and check the results.

Cheers, Xiu

willros commented 3 months ago

Maybe kill the current job and try again with the newest version.

How many samples do you have to demux?

willros / nanomux

Is it possible to consider primer reads during demultiplexing? #3