xjtu-omics / msisensor-pro

Microsatellite Instability (MSI) detection using high-throughput sequencing data.
Other
98 stars 22 forks source link

Different results for v1.2.0 and v1.2.2 and v1.3.0 #74

Open Stikus opened 5 days ago

Stikus commented 5 days ago

Hello, thanks for updates for your tool.

We're trying to update our docker containers with your tool but registered different results for v1.2.0 and v1.2.2: Run commands:

/soft/msisensor-pro-1.2.0/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -o /outputs/output/test_pair

/soft/msisensor-pro-1.2.2/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -p 8 -o /outputs/output/test_pair

First of all - number of sites and windows changed:

Number of total sites: 1786895 Number of sites with enough coverage: 58 Number of MSI sites: 0


- `v1.2.2`:

Total loading windows: 5696 Total loading homopolymer and microsatellites: 363133 ... Summary information

Number of total sites: 363133 Number of sites with enough coverage: 0 Number of MSI sites: 0


And more important - we have empty germline result now:
- `v1.2.0`:

chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR 19 2008302 CACTC 13 T CTAGG 13|12 0.21678 0.54982 1.063 19 2011861 TTTCC 10 T GAGAC 10|10 0 1 1.3488 19 2017120 CGTCT 6 CAAA ACAAA 6|6 0.08 0.28932 1.1986 19 2017151 AAAAC 10 A CTAGG 10|10 0.076167 0.57701 1.0458 19 2018642 TGGAC 12 T AAAGA 12|12 0.10526 0.33506 0.97168 19 2018961 ACAGG 6 CA CGGGA 6|6 0.076923 0.37455 0.90517 19 2019980 CTTTT 6 TTCC TTCTT 6|6 0 1 1.0175 19 2022596 TTTCC 11 T CCCTG 11|11 0.082596 0.81839 1.1577 19 2025188 AATAA 12 T GAGAC 12|12 0.42105 0.037295 2.1631 19 2025522 CCTTC 5 CT CCTTC 5|5 0 1 1.234 19 2027127 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.1688 19 2027144 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.1988 19 2027161 ACCAG 5 AC AGGGC 5|5 0 1 1.3182 19 2027495 GAGCC 10 AG AAAGA 10|10 0.21053 0.30538 1.1808 19 2028158 CCTCT 11 A TTACA 11|11 0 1 1.2609 19 2031306 ATCTC 12 A TAAAA 11|11 0.40209 0.1242 1.2006 19 2032700 GTGCC 5 TTCA ATCTG 5|5 0 1 1.2083 19 2036371 AAAGG 12 A TACAA 12|12 0.34862 0.48119 1.0734 19 2039434 ACAGG 10 A CCCAA 11|10 0.13397 0.46656 1.0824 19 2046939 ACTCA 5 GT GGATT 5|5 0 1 1.16 19 2049907 CAGAG 7 AC AAACA 7|7 0.097561 0.21243 1.369 19 2049923 CACAA 5 AC ATACA 5|5 0 1 1.0943 19 2052598 TAAGA 7 GT GCACG 7|7 0.071429 0.22779 1.101 19 2058828 GGCTT 5 TTTG TTTTG 4|5 0.28571 0.21417 1.1293 19 2068016 CGTTT 12 A TGTGT 12|12 0.19048 0.23128 1.0319 19 2069060 CCTTT 11 A TGGCA 11|11 0.09482 0.59809 1.0203 19 2069767 CTCAG 5 AACA ACTAG 5|5 0 1 1.0741 19 2074245 GAGAT 5 CAAA AAAAC 5|5 0 1 1.0545 19 2074718 GCTAA 11 T CCCCT 11|11 0 1 1.0357 19 2075031 TTTTC 15 T GAGAC 15|15 0.6255 0.12035 1.7451 19 2077400 GTCTC 11 A GAAAA 11|11 0.38462 0.18621 1.35 19 2077787 AATCC 13 T GAGAC 13|13 0.16667 0.51553 1.0679 19 2080383 ACCTC 11 T GAGGC 11|11 0 1 1 19 2083003 ATTTT 5 TTG TATTT 5|4 0.35897 0.071532 2.0744 19 2084890 CCTAG 10 T ATTTA 10|10 0 1 1.1373 19 2090669 CGTTA 11 T AAGAC 11|11 0.0625 0.30973 1.1228 19 2092000 AACAG 10 A GAGAA 10|10 0.076923 0.33953 0.93774 19 2093160 GGGAA 13 T CGAGA 13|13 0.24561 0.60897 1.0092 19 2095176 CTCAA 7 AAAC AAAAA 7|7 0.08 0.6604 1.0352 19 2095756 TGTTC 12 T CCCCC 12|12 0.094845 0.57166 1.0696 19 2103219 TTATT 11 A TGGGA 11|11 0.25 0.10777 2.0835 19 2125834 GCTAG 11 A CAAAA 12|12 0.27338 0.37271 0.93987 19 2133510 CCAGC 12 T GAGAC 12|12 0.086957 0.61346 0.98836 19 2137313 ATGTG 13 T CTTGG 13|13 0.13333 0.31937 1.0291 19 2140479 TTGAC 13 T GAGAC 13|14 0.32558 0.32124 0.98061 19 2142754 TTCTG 12 T CTTTT 11|11 0.10526 0.5009 1.076 19 2143763 GCCTC 10 A TAAAA 10|10 0.094708 0.57759 1.0152 19 2144904 TCCAA 5 AAAC AAAAC 5|5 0 1 1.1154 19 2145758 ACCTC 6 CA CCAAG 6|6 0 1 1.381 19 2147882 GTCTC 11 A GAAAA 11|11 0.13421 0.52212 1.0442 19 2152539 CGTCT 10 A GAGAG 10|10 0 1 1.1837 19 2157097 CCCAT 5 CATC CACCC 5|5 0.086957 0.67485 1.03 19 2158999 TATAA 11 T AATGG 11|10 0.27869 0.31511 1.0751 19 2159210 TTCCA 5 TC TTTTT 5|5 0.30769 0.12186 1.4136 19 2166076 ATTTA 5 TATGT TGTTA 5|5 0 1 1.2889 19 2172506 CTTTC 13 T GAAGA 13|13 0.2439 0.35051 0.92408 19 2177298 AACAA 12 T GGGAG 12|12 0.16 0.17247 1.4291 19 2183629 AGTGC 14 T AAGGC 14|14 0.26087 0.21288 1.2347


- `v1.2.2`:

chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR



Is this intended? Is this due to `MinMicrosate` and `MaxMicrosate` change in commit https://github.com/xjtu-omics/msisensor-pro/commit/3d3d7b8105168e522619b86b240623599dabd9bd ? Looks like now zero sites have enough coverage but `covCutoff` is same (15). 
Stikus commented 5 days ago

We've found reason for empty result - we used /ref/human_g1k_v37.list from v1.2.0 scan command for v1.2.2 (we don't know that they will change after update).

==> human_g1k_v37.list_v1.2.0 (71504K size) <==
chromosome      location        repeat_unit_length      repeat_times    repeat_unit_bases       left_flank_bases        right_flank_bases
1       10108   6       6       AACCCT  AACCC   AACCC
1       10147   6       5       CCCTAA  CTAAC   CCTAA
1       10177   6       9       CCTAAC  CCTAA   CCCTA
1       10255   6       5       AACCCT  CCCTA   AACCC
1       10285   6       6       AACCCC  ACCCT   AACCC
1       10352   6       6       ACCCTA  ACCCT   ACCCC
1       10397   6       7       CCCTAA  CTAAC   CCCCT
1       26453   2       6       GT      TGTTA   TTGCT
1       28588   1       15      T       GGTGG   GCATC

==> human_g1k_v37.list_v1.2.2 (54029K size) <==
chromosome      location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases     left_flank_bases        right_flank_bases
1       26453   2       11      6       956     999     GT      TGTTA   TTGCT
1       28588   1       3       15      698     589     T       GGTGG   GCATC
1       30867   2       7       12      885     319     CT      TCTCC   CATTT
1       30910   2       13      6       847     627     TC      TCATT   GCTAT
1       30935   2       13      6       255     1015    TC      ATTTT   TTTCT
1       31719   1       0       14      466     711     A       CTCAG   GTACT
1       31807   2       1       5       51      275     AC      AATAT   CACAT
1       33449   1       0       15      335     645     A       CCATT   GGACC
1       33530   1       3       11      1021    204     T       TTTTC   ATATA

After rescan we got new results:

Summary information

Number of total sites: 1789437 Number of sites with enough coverage: 63 Number of MSI sites: 0

chromosome location left_flank_bases repeat_times repeat_unit_bases right_flank_bases genotype difference P_value FDR 19 2008302 CACTC 13 T CTAGG 13|12 0.21678 0.54982 0.96218 19 2011861 TTTCC 10 T GAGAC 10|10 0.16667 0.13451 1.6948 19 2017120 CGTCT 6 CAAA ACAAA 6|6 0.08 0.28932 1.0126 19 2017151 AAAAC 10 A CTAGG 10|10 0.1875 0.52981 0.95366 19 2018642 TGGAC 12 T AAAGA 12|12 0.10526 0.33506 0.84436 19 2018961 ACAGG 6 CA CGGGA 6|6 0.076923 0.37455 0.78656 19 2019980 CTTTT 6 TTCC TTCTT 6|6 0 1 1.125 19 2022596 TTTCC 11 T CCCTG 11|11 0.082596 0.81839 1.0312 19 2025188 AATAA 12 T GAGAC 12|12 0.42105 0.037295 2.3496 19 2025522 CCTTC 5 CT CCTTC 5|5 0 1 1.2353 19 2027127 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.0364 19 2027144 ACCAG 7 AC CAGAC 7|7 0.028169 0.80606 1.058 19 2027161 ACCAG 5 AC AGGGC 5|5 0 1 1.0678 19 2027495 GAGCC 10 AG AAAGA 10|10 0.21053 0.30538 0.96194 19 2028158 CCTCT 11 A TTACA 11|11 0 1 1.05 19 2031306 ATCTC 12 A TAAAA 11|11 0.40209 0.1242 2.6082 19 2032700 GTGCC 5 TTCA ATCTG 5|5 0 1 1.0862 19 2036371 AAAGG 12 A TACAA 12|12 0.36184 0.50028 0.95507 19 2039434 ACAGG 10 A CCCAA 11|10 0.15982 0.4243 0.86228 19 2046939 ACTCA 5 GT GGATT 5|5 0 1 1.0328 19 2049907 CAGAG 7 AC AAACA 7|7 0.097561 0.21243 1.1153 19 2049923 CACAA 5 AC ATACA 5|5 0 1 1.1455 19 2052598 TAAGA 7 GT GCACG 7|7 0.071429 0.22779 0.89691 19 2058828 GGCTT 5 TTTG TTTTG 4|5 0.28571 0.21417 0.96376 19 2068016 CGTTT 12 A TGTGT 12|12 0.19048 0.23128 0.85711 19 2069060 CCTTT 11 A TGGCA 11|11 0.09482 0.59809 0.942 19 2069767 CTCAG 5 AACA ACTAG 5|5 0 1 1.1667 19 2074245 GAGAT 5 CAAA AAAAC 5|5 0.069767 0.29665 0.98364 19 2074718 GCTAA 11 T CCCCT 11|11 0 1 1.1887 19 2075031 TTTTC 15 T GAGAC 15|15 0.64407 0.14965 1.3468 19 2077400 GTCTC 11 A GAAAA 11|11 0.46667 0.21991 0.92363 19 2077787 AATCC 13 T GAGAC 13|13 0.16667 0.51553 0.95525 19 2080383 ACCTC 11 T GAGGC 11|11 0.16667 0.19761 1.2449 19 2083003 ATTTT 5 TTG TATTT 5|4 0.37014 0.12715 2.0027 19 2084890 CCTAG 10 T ATTTA 10|10 0 1 1.2115 19 2090669 CGTTA 11 T AAGAC 11|11 0.0625 0.30973 0.9292 19 2092000 AACAG 10 A GAGAA 10|10 0.076923 0.33953 0.8227 19 2093160 GGGAA 13 T CGAGA 13|13 0.24561 0.60897 0.93574 19 2095176 CTCAA 7 AAAC AAAAA 7|7 0.14815 0.6567 0.88025 19 2095756 TGTTC 12 T CCCCC 12|12 0.14894 0.55354 0.94251 19 2103219 TTATT 11 A TGGGA 11|11 0.25 0.10777 3.3947 19 2125834 GCTAG 11 A CAAAA 12|12 0.27338 0.37271 0.80967 19 2133510 CCAGC 12 T GAGAC 12|12 0.086957 0.61346 0.92019 19 2137313 ATGTG 13 T CTTGG 13|13 0.13333 0.31937 0.8748 19 2140479 TTGAC 13 T GAGAC 13|14 0.32558 0.32124 0.84324 19 2142754 TTCTG 12 T CTTTT 11|11 0.13481 0.58299 0.94176 19 2143763 GCCTC 10 A TAAAA 10|10 0.25 0.19776 1.1326 19 2144904 TCCAA 5 AAAC AAAAC 5|5 0 1 1 19 2145758 ACCTC 6 CA CCAAG 6|6 0 1 1.0161 19 2147882 GTCTC 11 A GAAAA 11|11 0.17647 0.57594 0.95484 19 2152539 CGTCT 10 A GAGAG 10|10 0.045195 0.36771 0.82735 19 2157097 CCCAT 5 CATC CACCC 5|5 0.20134 0.45914 0.90394 19 2158999 TATAA 11 T AATGG 11|10 0.27869 0.31511 0.90236 19 2159210 TTCCA 5 TC TTTTT 5|5 0.34545 0.16105 1.2683 19 2162681 CACTC 10 A CCCCA 10|9 0.47898 0.13651 1.4333 19 2166076 ATTTA 5 TATGT TGTTA 5|5 0 1 1.1053 19 2172506 CTTTC 13 T GAAGA 13|13 0.2439 0.35051 0.81787 19 2177298 AACAA 12 T GGGAG 12|12 0.16 0.17247 1.2073 19 2183629 AGTGC 14 T AAGGC 14|14 0.26087 0.21288 1.0317 19 2467148 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.85063 19 2467159 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.86953 19 2532160 AAAAC 11 A CAAAA 7|7 0.31579 0.62109 0.88929 19 2567130 AAAAC 10 A CAAAA 7|7 0.31579 0.62109 0.90997



Visual comparison (`v1.2.0` - left, `v1.2.2` - right):
![image](https://github.com/user-attachments/assets/6ce1d269-4d05-4964-a2f8-6ff309584524)

Looks like major differences are numerical, but we have 5 new lines (as you can see on attached screenshot).

So we have 2 questions:

1. Are these differences expected?
2. Should we rescan our reference genomes every new msisensor version? 
PengJia6 commented 5 days ago

Thank you very much for your report. In versions prior to v1.2.2, if the repeat count of a read at a given site was less than the minimum repeat count (which can be set via the command line), the read would be filtered out, which is an unreasonable operation. As shown in your screenshot, all these records added in v1.2.2 have genotypes with fewer than 10 repeats of A repeat.

PengJia6 commented 5 days ago

@Stikus For each version, we recommend that you use the same scan and msi, pro commands. Additionally, the latest version of msisensor-pro is v1.3.0. This version will be a long-term support version, and we welcome you to test it.

Stikus commented 5 days ago

We'll test v1.3.0 today, when we test it we got error about same genome for scan and msi commands and decided to downgrade to 1.2.2 first. We completely forgot about separate scan step - now I understand it.

Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.

PengJia6 commented 5 days ago

We'll test v1.3.0 today, when we test it we got error about same genome for scan and msi commands and decided to downgrade to 1.2.2 first. We completely forgot about separate scan step - now I understand it.

Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.

I would be very grateful. Please feel free to post here.

Stikus commented 5 days ago

Can you help me with new naming?

Before we have "${outputPrefix}_somatic" and "${outputPrefix}_germline" - now we have "${outputPrefix}_unstable" and "${outputPrefix}_all". Is this correct match?

According to https://github.com/xjtu-omics/msisensor-pro/blob/35c39bb122e579c8fe97d405990a3d126030b850/cpp/sample.cpp#L40 - looks like correct.

PengJia6 commented 5 days ago

Can you help me with new naming?

Before we have "${outputPrefix}_somatic" and "${outputPrefix}_germline" - now we have "${outputPrefix}_unstable" and "${outputPrefix}_all". Is this correct match?

According to

https://github.com/xjtu-omics/msisensor-pro/blob/35c39bb122e579c8fe97d405990a3d126030b850/cpp/sample.cpp#L40

  • looks like correct.

Yes, to align with the results produced by the pro command, the _all table includes all microsatellite sites with sufficient coverage, while the _unstable table indicates unstable sites, meaning sites where somatic mutations have occurred.

Stikus commented 5 days ago

Here is a comparison (v1.3.0 - left, v1.2.2 - right): image

Total loading windows:  5812 
Total loading homopolymer and microsatellites:  2678803 

*** Summary information ***

Number of total sites: 2678803
Number of sites with enough coverage: 116
Number of MSI sites: 0

As you can see - genotype field is missing (but header have leftover tab - here is the reason) - is this expected? New output contain twice more lines - 116 instead of 63.

PengJia6 commented 5 days ago

Here is a comparison (v1.3.0 - left, v1.2.2 - right): image

Total loading windows:  5812 
Total loading homopolymer and microsatellites:  2678803 

*** Summary information ***

Number of total sites: 2678803
Number of sites with enough coverage: 116
Number of MSI sites: 0

As you can see - genotype field is missing (but header have leftover tab - here is the reason) - is this expected? New output contain twice more lines - 116 instead of 63.

Yes, we removed the display of genotypes because many sites have undergone somatic mutations, making the genotypes here inaccurate. Additionally, for those sites, we adjusted the default minimum repeat count for homopolymers from 10 to 8, as in our experience, homopolymers with repeat counts of 8 and 9 also have many somatic mutations. The tab in the header is not as expected; I will remove it.

Stikus commented 5 days ago

Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?

Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?

PengJia6 commented 5 days ago

Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?

Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?

In the scan command, the -l and -r parameters control the minimum repeat count for single nucleotide repeats and other sites (where the repeat unit length is greater than 2), respectively. The default values for -l and -r are 8 and 5, respectively.

Stikus commented 5 days ago

Great thx for answers, we'll use latest version now. The issue can be closed, I think.

FYI - we are using your tool with latest samtools without any problems with this patch before make: sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile"

Our full build part from Dockerfile:

ENV MSISENSOR_PRO_VERSION='1.3.0'
RUN cd "$SOFT" \
    && wget -q "https://github.com/xjtu-omics/msisensor-pro/archive/refs/tags/v${MSISENSOR_PRO_VERSION}.tar.gz" -O "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
    && tar -xzf "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
    && mv "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
    && cd "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp" \
    && sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
    && sed -ie '/export LD_LIBRARY_PATH/s/^/#/' "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
    && make -j"$(($(nproc)+1))" \
    && mkdir -p "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
    && cp "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/msisensor-pro" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
    && cd "$SOFT" \
    && rm -r "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
    && rm "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz"
PengJia6 commented 5 days ago

Thank you very much for providing the Docker file and for testing it. I will keep this issue open for a while; if you have any other questions, please feel free to post.