yezhengSTAT / CUTTag_tutorial

Tutorial Website
https://yezhengstat.github.io/CUTTag_tutorial/
53 stars 18 forks source link

Cannot Picard SamSort and/or MarkDuplicate? #8

Open kiddo18 opened 1 year ago

kiddo18 commented 1 year ago

Hi Ye,

Not sure if this is appropriate to ask, but for some reason, on Step 3.3 Removing duplicates, Picard MarkDuplicate command reports 0 duplicates. I'm using the same dataset as the tutorial and on IgG_rep1_bowtie2.sam. Would you happen to know how to fix this? Thank you

12:54:18.412 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib [Wed Dec 07 12:54:18 EST 2022] SortSam --INPUT ./alignment/sam/IgG_rep1_bowtie2.sam --OUTPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --SORT_ORDER coordinate --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false [Wed Dec 07 12:54:18 EST 2022] Executing as sethilab@nilays-mac-mini.dfci.partners.org on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT INFO 2022-12-07 12:54:19 SortSam Seen many non-increasing record positions. Printing Read-names as well. INFO 2022-12-07 12:54:42 SortSam Finished reading inputs, merging and writing to output now. [Wed Dec 07 12:54:57 EST 2022] picard.sam.SortSam done. Elapsed time: 0.65 minutes. Runtime.totalMemory()=536870912 12:54:58.671 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib [Wed Dec 07 12:54:58 EST 2022] MarkDuplicates --INPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --OUTPUT ./alignment/removeDuplicate/IgG_rep1_bowtie2.sorted.dupMarked.sam --METRICS_FILE ./alignment/removeDuplicate/picard_summary/IgG_rep1_picard.dupMark.txt --ASSUME_SORT_ORDER coordinate --VERBOSITY WARNING --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false [Wed Dec 07 12:54:58 EST 2022] Executing as sethilab@nilays-mac-mini.dfci.partners.org on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT WARNING 2022-12-07 12:54:59 AbstractOpticalDuplicateFinderCommandLinePrograA field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11923224.1466798.2. Cause: String 'SRR11923224.1466798.2' did not start with a parsable number. [Wed Dec 07 12:55:36 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.63 minutes. Runtime.totalMemory()=536870912 12:55:37.913 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/sethilab/opt/anaconda3/envs/cutruntools2.1/share/picard-2.27.4-0/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib [Wed Dec 07 12:55:38 EST 2022] MarkDuplicates --INPUT ./alignment/sam/IgG_rep1_bowtie2.sorted.sam --OUTPUT ./alignment/removeDuplicate/IgG_rep1_bowtie2.sorted.rmDup.sam --METRICS_FILE ./alignment/removeDuplicate/picard_summary/IgG_rep1_picard.rmDup.txt --REMOVE_DUPLICATES true --ASSUME_SORT_ORDER coordinate --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false [Wed Dec 07 12:55:38 EST 2022] Executing as sethilab@nilays-mac-mini.dfci.partners.org on Mac OS X 13.0 x86_64; OpenJDK 64-Bit Server VM 17.0.3+7-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.27.4-SNAPSHOT INFO 2022-12-07 12:55:38 MarkDuplicates Start of doWork freeMemory: 529478472; totalMemory: 536870912; maxMemory: 2147483648 INFO 2022-12-07 12:55:38 MarkDuplicates Reading input file and constructing read end information. INFO 2022-12-07 12:55:38 MarkDuplicates Will retain up to 7780737 data points before spilling to disk. WARNING 2022-12-07 12:55:38 AbstractOpticalDuplicateFinderCommandLinePrograA field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11923224.1466798.2. Cause: String 'SRR11923224.1466798.2' did not start with a parsable number. INFO 2022-12-07 12:55:44 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 6s. Last read position: chr5:69,153,488 INFO 2022-12-07 12:55:44 MarkDuplicates Tracking 1000000 as yet unmatched pairs. 67092 records in RAM. INFO 2022-12-07 12:55:50 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 5s. Last read position: chr11:75,183,841 INFO 2022-12-07 12:55:50 MarkDuplicates Tracking 2000000 as yet unmatched pairs. 85106 records in RAM. INFO 2022-12-07 12:55:55 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 5s. Last read position: chrM:855 INFO 2022-12-07 12:55:55 MarkDuplicates Tracking 3000000 as yet unmatched pairs. 24278 records in RAM. INFO 2022-12-07 12:55:57 MarkDuplicates Read 3386886 records. 3386886 pairs never matched. INFO 2022-12-07 12:55:57 MarkDuplicates After buildSortedReadEndLists freeMemory: 652251184; totalMemory: 966787072; maxMemory: 2147483648 INFO 2022-12-07 12:55:57 MarkDuplicates Will retain up to 67108864 duplicate indices before spilling to disk. INFO 2022-12-07 12:55:58 MarkDuplicates Traversing read pair information and detecting duplicates. INFO 2022-12-07 12:55:58 MarkDuplicates Traversing fragment information and detecting duplicates. INFO 2022-12-07 12:55:58 MarkDuplicates Sorting list of duplicate records. INFO 2022-12-07 12:55:58 MarkDuplicates After generateDuplicateIndexes freeMemory: 958545280; totalMemory: 1503657984; maxMemory: 2147483648 INFO 2022-12-07 12:55:58 MarkDuplicates Marking 0 records as duplicates. INFO 2022-12-07 12:55:58 MarkDuplicates Found 0 optical duplicate clusters. INFO 2022-12-07 12:55:58 MarkDuplicates Reads are assumed to be ordered by: coordinate INFO 2022-12-07 12:56:18 MarkDuplicates Writing complete. Closing input iterator. INFO 2022-12-07 12:56:18 MarkDuplicates Duplicate Index cleanup. INFO 2022-12-07 12:56:18 MarkDuplicates Getting Memory Stats. INFO 2022-12-07 12:56:18 MarkDuplicates Before output close freeMemory: 529395968; totalMemory: 536870912; maxMemory: 2147483648 INFO 2022-12-07 12:56:18 MarkDuplicates Closed outputs. Getting more Memory Stats. INFO 2022-12-07 12:56:18 MarkDuplicates After output close freeMemory: 529395968; totalMemory: 536870912; maxMemory: 2147483648 [Wed Dec 07 12:56:18 EST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.68 minutes. Runtime.totalMemory()=536870912

yezhengSTAT commented 1 year ago

Hello, I do see some error message like "AbstractOpticalDuplicateFinderCommandLinePrograA field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR11923224.1466798.2. Cause: String 'SRR11923224.1466798.2' did not start with a parsable number." Does it read name trigger any thoughts from you? I did not observe such sequences on my side neither in IgG_rep1_bowtie2.sam nor in IgG_rep1_bowtie2.sorted.sam.

Thanks, Ye