wrpearson / fasta36

Git repository for FASTA36 sequence comparison software
Apache License 2.0
117 stars 17 forks source link

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

Open jonathandmoore opened 1 year ago

jonathandmoore commented 1 year ago

I searched for a particular ORF sequence against a given database, using ssearch36.

fasta36 -E 10 -f 10 -g 2 orf.fasta library.fasta 2

Then I searched for a multi-FASTA file containing many ORFs, including the original one, against the same database.

fasta36 -E 10 -f 10 -g 2 lots_of_orfs.fasta library.fasta 2

I get similar hits, but the hits have different bit scores and evalues, I think driven by the different statistics. Is this a 'feature'?

Example outputs illustrating the problem, this result for a single ORF:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.6726+/-0.00279; mu= -2.9805+/- 0.107
 mean_var=53.5146+/-12.132, 0's: 0 Z-trim(86.9): 21  B-trim: 21 in 1/42
 Lambda= 0.175323
 statistics sampled from 722 (728) to 722 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.559), E-opt: 0.2 (0.277), width:  16
 Scan time:  0.030

This result for the same ORF as part of a list of searches:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.7380+/-0.0029; mu= -3.1580+/- 0.110
 mean_var=64.3984+/-15.325, 0's: 0 Z-trim(91.3): 11  B-trim: 24 in 1/42
 Lambda= 0.159822
 statistics sampled from 725 (728) to 725 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.778), E-opt: 0.2 (0.49), width:  16
 Scan time:  0.030
wrpearson commented 1 year ago

The statistical estimates provided by the FASTA programs (including SSEARCH) are determined empirically, by sampling up to 60,000 scores from the library searched.

Since your library only has 2631 sequences, the estimates are based on all the scores that were calculated.

But since your library has different sequences, the distribution of scores is slightly different, and thus the statistical estimates are slightly different.

If you look at the numbers after the "Statistics:" line, you see that the rho, mu, mean_var, and Lambda are all slightly different, reflecting the different sets of scores that were found in the two searches. These parameters were determined by fitting the scores that were obtained, and are used to calculate the E()-value and bit score.

I do think of it as a "feature", since the estimates reflect the properties of the database that was searched.

Bill Pearson

Begin forwarded message:

From: Jay Moore @.**@.>>

Subject: [wrpearson/fasta36] Different results for single-sequence FASTA query than same sequene in multi-FASTA query (Issue #57)

Date: October 18, 2023 at 9:06:56 AM MDT

To: wrpearson/fasta36 @.**@.>>

Cc: Subscribed @.**@.>>

Reply-To: wrpearson/fasta36 @.**@.>>

I searched for a particular ORF sequence against a given database, using ssearch36.

Then I searched for a multi-FASTA file containing many ORFs, including the original one, against the same database.

I get similar hits, but the hits have different bit scores and evalues, I think driven by the different statistics. Is this a 'feature'?

Example outputs illustrating the problem, this result for a single ORF:

539214 residues in 2631 sequences

Statistics: Expectation_n fit: rho(ln(x))= 8.6726+/-0.00279; mu= -2.9805+/- 0.107 mean_var=53.5146+/-12.132, 0's: 0 Z-trim(86.9): 21 B-trim: 21 in 1/42 Lambda= 0.175323 statistics sampled from 722 (728) to 722 sequences Algorithm: FASTA (3.8 Nov 2011) [optimized] Parameters: BL50 matrix (15:-5), open/ext: -10/-2 ktup: 2, E-join: 1 (0.559), E-opt: 0.2 (0.277), width: 16 Scan time: 0.030

This result for the same ORF as part of a list of searches:

539214 residues in 2631 sequences

Statistics: Expectation_n fit: rho(ln(x))= 8.7380+/-0.0029; mu= -3.1580+/- 0.110 mean_var=64.3984+/-15.325, 0's: 0 Z-trim(91.3): 11 B-trim: 24 in 1/42 Lambda= 0.159822 statistics sampled from 725 (728) to 725 sequences Algorithm: FASTA (3.8 Nov 2011) [optimized] Parameters: BL50 matrix (15:-5), open/ext: -10/-2 ktup: 2, E-join: 1 (0.778), E-opt: 0.2 (0.49), width: 16 Scan time: 0.030

— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/57, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNPYYQN6AVA5AZJ6UPBTX77WBBAVCNFSM6AAAAAA6FV6HWSVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2DSOJYGI3TKNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonathandmoore commented 1 year ago

Thanks for the super-quick reply. My doubt was that both of my searches use exactly the same library, but one has a single query sequence and the other has multiple query sequences including the single sequence. It seems that the statistics are not just dependant on the library, but on the query sequences - the same query sequence gets different statistics depending on which other query sequences it is submitted with, even though the library does not change. Hope this is clear.

If protein A is searched against library P, it gets different scores than it gets if protein A and protein B are searched against library P.

wrpearson commented 1 year ago

The statistics do depend on both the query and the library sequences, but searching the same library with a single query sequence or that query sequence, included in a library with other separate sequences, should produce the same results (subject to sampling the library, which only takes place with more than 60,000 sequences).

If a query sequence is embedded in another sequence, the the statistics will be different. But if one search has 1 query A, and another search has 10 queries including A, then the results for A should be the same. The statistics for each query in a multi-query search are calculated independently.

However, I just did a test and learned that I am mistaken -- I get slightly different results when the query sequence is part of a multi-query library. I will look into it.

Bill Pearson

Begin forwarded message:

From: Jay Moore @.**@.>>

Subject: Re: [wrpearson/fasta36] Different results for single-sequence FASTA query than same sequene in multi-FASTA query (Issue #57)

Date: October 18, 2023 at 1:02:10 PM EDT

To: wrpearson/fasta36 @.**@.>>

Cc: William Pearson @.**@.>>, Comment @.**@.>>

Reply-To: wrpearson/fasta36 @.**@.>>

Thanks for the super-quick reply. My doubt was that both of my searches use exactly the same library, but one has a single query sequence and the other has multiple query sequences including the single sequence. It seems that the statistics are not just dependant on the library, but on the query sequences - the same query sequence gets different statistics depending on which other query sequences it is submitted with, even though the library does not change. Hpoe this is clear.

— Reply to this email directly, view it on GitHubhttps://github.com/wrpearson/fasta36/issues/57#issuecomment-1768973925, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQYNP3XUEIV43TUE23W5VDYAADRFAVCNFSM6AAAAAA6FV6HWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYHE3TGOJSGU. You are receiving this because you commented.Message ID: @.***>