wrpearson / fasta36

Git repository for FASTA36 sequence comparison software
Apache License 2.0
117 stars 16 forks source link

ssearch36 library size limit of 45000 sequences? #60

Closed brush111111111111 closed 1 month ago

brush111111111111 commented 8 months ago

What I originally wanted to do was get all pairwise sequence identity of 2 relatively large sets of sequences (~100k seqs). What I found was that when I give ssearch36 a query file with 1 sequence, and a library file with 100k sequences, ssearch only returns (randomly?) the alignment seuqence identity for 45000 sequences from the library file? The exact command I use is as follows:

ssearch36 -s BL62 -E 1e+10 -C 10 -T 4 query_1_seq.fa library_100k_seq.fa

The solution I have now is simply to just chunk the sequence library file into sets of ~40k seqeunces.

wrpearson commented 8 months ago

Yes, ssearch36 will only return 45k alignment scores. It looks at all the sequences, and returns the 45K best. There is a parameter in defs.h that sets the 45K. You can make it any number you want.

Bill Pearson

On Jan 20, 2024, at 3:49 AM, brush111111111111 @.***> wrote:



What I originally wanted to do was get all pairwise sequence identity of 2 relatively large sets of sequences (~100k seqs). What I found was that when I give ssearch36 a query file with 1 sequence, and a library file with 100k sequences, ssearch only returns (randomly?) the alignment seuqence identity for 45000 sequences from the library file? The exact command I use is as follows:

ssearch36 -s BL62 -E 1e+10 -C 10 -T 4 query_1_seq.fa library_100k_seq.fa

— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/60, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNP5GB4VUUPY77JQFBZDYPOAKBAVCNFSM6AAAAABCDBKZDGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TCOJWHE3TINA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

brush111111111111 commented 8 months ago

Thank you very much for your prompt reply. May I ask what protocol would you recommend if I wanted to calculate the pairwise seqeunce identity of around 100k seqeucnes? What would be the most effecient way to go about it? Would you recommend changing defs.h limit? Would this potentially lead to memory problems? Or would you suggest any other solution?

Thank you very much.

wrpearson commented 8 months ago

Just modify defs.h. Changing MAXLIB_P to 150000 should do the trick, and should not cause memory problems on modern machines (fasta used to run on 300K (not Meg, not Gig) machines).

Bill Pearson

Begin forwarded message:

From: brush111111111111 @.**@.>>

Subject: Re: [wrpearson/fasta36] ssearch36 library size limit of 45000 sequences? (Issue #60)

Date: January 21, 2024 at 3:09:10 AM EST

To: wrpearson/fasta36 @.**@.>>

Cc: William Pearson @.**@.>>, Comment @.**@.>>

Reply-To: wrpearson/fasta36 @.**@.>>

Thank you very much for your prompt reply. May I ask what protocol would you recommend if I wanted to calculate the pairwise seqeunce identity of around 100k seqeucnes? What would be the most effecient way to go about it? Would you recommend changing defs.h limit? Would this potentially lead to memory problems? Or would you suggest any other solution?

Thank you very much.

— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/60#issuecomment-1902549165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNP4P5TLQ2D2SNWJCGWLYPTEKNAVCNFSM6AAAAABCDBKZDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2DSMJWGU. You are receiving this because you commented.Message ID: @.***>