Closed brush111111111111 closed 1 month ago
Yes, ssearch36 will only return 45k alignment scores. It looks at all the sequences, and returns the 45K best. There is a parameter in defs.h that sets the 45K. You can make it any number you want.
Bill Pearson
On Jan 20, 2024, at 3:49 AM, brush111111111111 @.***> wrote:
What I originally wanted to do was get all pairwise sequence identity of 2 relatively large sets of sequences (~100k seqs). What I found was that when I give ssearch36 a query file with 1 sequence, and a library file with 100k sequences, ssearch only returns (randomly?) the alignment seuqence identity for 45000 sequences from the library file? The exact command I use is as follows:
ssearch36 -s BL62 -E 1e+10 -C 10 -T 4 query_1_seq.fa library_100k_seq.fa
— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/60, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNP5GB4VUUPY77JQFBZDYPOAKBAVCNFSM6AAAAABCDBKZDGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TCOJWHE3TINA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you very much for your prompt reply. May I ask what protocol would you recommend if I wanted to calculate the pairwise seqeunce identity of around 100k seqeucnes? What would be the most effecient way to go about it? Would you recommend changing defs.h limit? Would this potentially lead to memory problems? Or would you suggest any other solution?
Thank you very much.
Just modify defs.h. Changing MAXLIB_P to 150000 should do the trick, and should not cause memory problems on modern machines (fasta used to run on 300K (not Meg, not Gig) machines).
Bill Pearson
Begin forwarded message:
From: brush111111111111 @.**@.>>
Subject: Re: [wrpearson/fasta36] ssearch36 library size limit of 45000 sequences? (Issue #60)
Date: January 21, 2024 at 3:09:10 AM EST
To: wrpearson/fasta36 @.**@.>>
Cc: William Pearson @.**@.>>, Comment @.**@.>>
Reply-To: wrpearson/fasta36 @.**@.>>
Thank you very much for your prompt reply. May I ask what protocol would you recommend if I wanted to calculate the pairwise seqeunce identity of around 100k seqeucnes? What would be the most effecient way to go about it? Would you recommend changing defs.h limit? Would this potentially lead to memory problems? Or would you suggest any other solution?
Thank you very much.
— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/60#issuecomment-1902549165, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNP4P5TLQ2D2SNWJCGWLYPTEKNAVCNFSM6AAAAABCDBKZDGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGU2DSMJWGU. You are receiving this because you commented.Message ID: @.***>
What I originally wanted to do was get all pairwise sequence identity of 2 relatively large sets of sequences (~100k seqs). What I found was that when I give ssearch36 a query file with 1 sequence, and a library file with 100k sequences, ssearch only returns (randomly?) the alignment seuqence identity for 45000 sequences from the library file? The exact command I use is as follows:
ssearch36 -s BL62 -E 1e+10 -C 10 -T 4 query_1_seq.fa library_100k_seq.fa
The solution I have now is simply to just chunk the sequence library file into sets of ~40k seqeunces.