zaneveld / organelle_removal

2 stars 2 forks source link

Are ref. sequences all V4? #20

Open lxsteiner opened 1 month ago

lxsteiner commented 1 month ago

Hi,

Thank you for tackling the organelle contamination problem (especially in corals and sponge microbiomes)!

I was wondering if all the reference sequences in your extended silva database files provided here (as .qza): https://github.com/zaneveld/organelle_removal/tree/main/Tutorial/qiime2_CLI_tutorial/taxonomy_references https://github.com/zaneveld/organelle_removal/tree/main/Tutorial/qiime2_API_tutorial/input/taxonomy_references

and in the manuscript (as FASTA and taxonomy.txt): https://www.biorxiv.org/content/10.1101/2021.02.23.431501v2.supplementary-material

all already for the extracted V4 or any other region? Have they been processed as described here: https://github.com/zaneveld/organelle_removal/blob/main/Tutorial/qiime2_API_tutorial/procedure/extended_taxonomy_construction_tutorial.ipynb

#import, select V4 region, merge, save
organelle_seqs = Artifact.import_data('FeatureData[Sequence]',
                                      refs_dir + '/organelle_sequences.fasta')
v4_organelle_seqs, = extract_reads(organelle_seqs, 'GTGYCAGCMGCCGCGGTAA',
                                   'GGACTACNVGGGTWTCTAAT', n_jobs = 24,
                                   read_orientation = 'forward')
silva_extended_seqs, = merge_seqs([v4_organelle_seqs,
                                   Artifact.load(refs_dir +
                                                 '/silva_sequences.qza')])
#save the sequence files for both extended files
silva_extended_seqs.save(refs_dir + '/silva_extended_sequences.qza')

I couldn't really find that information specified anywhere in the tutorials or the preprint.

I'm asking because I get very weird results using any of your extended silva database files. Instead of getting more chloroplast and mitochondria hits, I get ~50% less than with my normal Silva138 classifier. Also trying to extract the V3V4 region doesn't seem to work on the unprocessed sequences provided in the preprint.

This would lead me to assume they are preprocessed. Could you please confirm? Would also be helpful if this was specified with all the provided files and the preprint that you extended database is for V4 only ;)

Edit: Nvm, the answer is here:

if not os.path.isfile(refs_dir + '/silva_sequences.qza'):
    download_file('https://data.qiime2.org/2021.2/common/silva-138-99-seqs-515-806.qza', 
                  refs_dir + '/silva_sequences.qza')
if not os.path.isfile(refs_dir + '/silva_taxonomy.qza'):
    download_file('https://data.qiime2.org/2021.2/common/silva-138-99-tax-515-806.qza', 
                  refs_dir + '/silva_taxonomy.qza')

A general database from which individual HVRs could be extracted would be more useful. I'll try to generate one for V3V4 follwing https://github.com/zaneveld/organelle_removal/blob/main/Tutorial/qiime2_API_tutorial/procedure/extended_taxonomy_construction_tutorial.ipynb Thanks.

sonettd commented 1 month ago

Thank you for pointing this out, and the digging! I grabbed the QIIME2 preprocessed V4 sequences at some point as a shortcut, but we'd like this to be broadly applicable so I'll make some changes.