ydLiu-HIT / Psi-caller

a lightweight short read-based variant caller with high speed and accuracy
MIT License
6 stars 2 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2: invalid start byte #8

Open Mahmoudbassuoni opened 6 months ago

Mahmoudbassuoni commented 6 months ago

Hi @ydLiu-HIT, I am trying to run the variant calling using multiple threads using this script,

#!/bin/bash

set -e

# Constants and paths
REFERENCE_GENOME="/Data/dataflash/Benchmarking/hs37d5.fa"
PYTHON_SCRIPT="/home/mbassyouni/packages/Psi-caller-1.0.1/separeted_task.py"
BASE_WORKSPACE_DIR="/Data/dataflash/Benchmarking/analysis/variant_calling/Psicaller"
NUM_THREADS=16

# Additional central log file definition
CENTRAL_LOG="/Data/dataflash/Benchmarking/analysis/variant_calling/Psicaller/central_time_log.txt"

# Clear or create the central log file at the start
> "${CENTRAL_LOG}"

# BAM files and corresponding workspace directories
declare -A BAM_FILES=(
    ["speedseq_NA24385"]="/Data/dataflash/Benchmarking/analysis/preprocessing/HG002/speedseq_align/speedseq_NA24385.bam"
    ["BWA-GATK"]="/Data/dataflash/Benchmarking/analysis/preprocessing/HG002/BWA_GATK/recal_reads.bam"
)

# Process each BAM file
for SAMPLE in "${!BAM_FILES[@]}"; do
    BAM_FILE="${BAM_FILES[$SAMPLE]}"
    WORKSPACE_DIR="${BASE_WORKSPACE_DIR}/${SAMPLE}"
    TIME_LOG="${WORKSPACE_DIR}/time_measurement.log"
    TASK_PREFIX="${SAMPLE}"

    # Setup workspace
    mkdir -p "${WORKSPACE_DIR}"
    > "${TIME_LOG}"

    # Generating separated subtasks
    START_TIME=$(date +%s)
    echo "Generating subtasks for ${SAMPLE}" >> "${TIME_LOG}"
    python3 "${PYTHON_SCRIPT}" "${BAM_FILE}" "${REFERENCE_GENOME}" "${WORKSPACE_DIR}/" --task_prefix "${TASK_PREFIX}"

    # Candidate recognition
    echo "Running candidate recognition for ${SAMPLE}" >> "${TIME_LOG}"
    cat "${WORKSPACE_DIR}/${TASK_PREFIX}_extract.sh" | parallel -j ${NUM_THREADS}

    # Variants calling
    echo "Running variants calling for ${SAMPLE}" >> "${TIME_LOG}"
    cat "${WORKSPACE_DIR}/${TASK_PREFIX}_call.sh" | parallel -j ${NUM_THREADS}

    END_TIME=$(date +%s)
    ELAPSED_TIME=$((END_TIME - START_TIME))
    echo "${SAMPLE} Total Processing Time: $((ELAPSED_TIME / 60)) minutes, $((ELAPSED_TIME % 60)) seconds" >> "${TIME_LOG}"
done

echo "Variant calling completed for all samples on $(date)." >> "${CENTRAL_LOG}"

where I am intending to run the pipeline over 2 bam files as mentioned in the scripts but I am getting into this error

Traceback (most recent call last):
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 644, in <module>
    run()
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 638, in run
    main_ctrl(args)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 544, in main_ctrl
    if indL > 20: checksoft_and_realign(reference_sequence, refStart, queries, Start, End, indL, Sig)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 228, in checksoft_and_realign
    reali = ksw2_aligner(SEQ, tseq, 4)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2: invalid start byte

gzip: Traceback (most recent call last):
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 644, in <module>
    run()
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 638, in run
    main_ctrl(args)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 548, in main_ctrl
    canList, cluster_n, query, qual, target = run_poa(candidate, reference_sequence, queries, refStart, flanking, useAllRead
s, indL, useBaseQuality, args, multiCan, IsID, Sig)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 457, in run_poa
    canList = ksw_core(res[1], target, args.chrName, name, reference_sequence, refStart, args.shift, args.mismatch2 if multi
Can else args.mismatch, multiCan, args.max_merge_dis, Sig)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 378, in ksw_core
    alignment = ksw2_aligner(msa, target, x_score)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

gzip: stdout: Broken pipe
Traceback (most recent call last):
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 644, in <module>
    run()
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 638, in run
    main_ctrl(args)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 548, in main_ctrl
    canList, cluster_n, query, qual, target = run_poa(candidate, reference_sequence, queries, refStart, flanking, useAllReads, indL, useBaseQuality, args, multiCan, IsID, Sig)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 457, in run_poa
    canList = ksw_core(res[1], target, args.chrName, name, reference_sequence, refStart, args.shift, args.mismatch2 if multiCan else args.mismatch, multiCan, args.max_merge_dis, Sig)
  File "/home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py", line 378, in ksw_core
    alignment = ksw2_aligner(msa, target, x_score)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte

gzip: stdout: Broken pipe

which I am not the sure for the reason behind. N.B, I am running this on a 96 threads, 125 GB RAM server, so I started with 90 threads first but I found that there were intensive memory usage and then tried it with 30 threads and finally with 16 but still showing the same error, can you tell me your opinion about this? Thanks,

ydLiu-HIT commented 5 months ago

Hi Mahmoudbassuoni,

Are you using pypy3? The error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 0: invalid start byte" occurred may be because of different encoding methods of two variables msa and target, you can print the type of msa and target in the localMSA.py at line 378 to determine the type first. Meanwhile, you can also print the length of msa and target, if one of the lengths equal 0, the error may occur.

Best.

Mahmoudbassuoni commented 5 months ago

Hi @ydLiu-HIT as I told you I am using the parallel method, so I edited the localMSA.py file to print the type and the length like this

# Debugging print statements
print(f"Type of msa: {type(msa)}, Length of msa: {len(msa)}")
print(f"Type of target: {type(target)}, Length of target: {len(target)}")

alignment = ksw2_aligner(msa, target, x_score)

and I got one line of the *_call.sh

 pypy3 /home/mbassyouni/packages/Psi-caller-1.0.1/localMSA.py --fin_bam "/Data/dataflash/Benchmarking/analysis/preprocessing/HG002/BWA_GATK/recal_reads.bam" --fin_ref "/Data/dataflash/Benchmarking/hs37d5.fa" --minMQ "10" --minCNT "3" --perror_for_snp "0.1" --perror_for_indel "0.1" --ratio_identity_snp "0.2" --ratio_identity_indel "0.2" --max_merge_dis "5" --shift "5" --flanking "50" --useBaseQuality --chrName "1" --chrStart "90000001" --chrEnd "100000001" --fin_can "/Data/dataflash/Benchmarking/analysis/variant_calling/Psicaller/BWA-GATK/var.1_90000001_100000001.can" --fout_vcf "/Data/dataflash/Benchmarking/analysis/variant_calling/Psicaller/BWA-GATK/var.1_90000001_100000001.vcf"

and this was part of the output

Type of msa: <class 'str'>, Length of msa: 51
Type of target: <class 'str'>, Length of target: 52
Type of msa: <class 'str'>, Length of msa: 64
Type of target: <class 'str'>, Length of target: 51
Type of msa: <class 'str'>, Length of msa: 44
Type of target: <class 'str'>, Length of target: 51
Type of msa: <class 'str'>, Length of msa: 51
Type of target: <class 'str'>, Length of target: 51
Type of msa: <class 'str'>, Length of msa: 64
Type of target: <class 'str'>, Length of target: 64
Type of msa: <class 'str'>, Length of msa: 56
Type of target: <class 'str'>, Length of target: 64
Type of msa: <class 'str'>, Length of msa: 114
Type of target: <class 'str'>, Length of target: 114
Type of msa: <class 'str'>, Length of msa: 106
Type of target: <class 'str'>, Length of target: 114
Type of msa: <class 'str'>, Length of msa: 51
Type of target: <class 'str'>, Length of target: 51
Type of msa: <class 'str'>, Length of msa: 43
Type of target: <class 'str'>, Length of target: 51

both were strings and non was 0 in length

leedchou commented 4 months ago

Hi @Mahmoudbassuoni, I'm getting into the same situation, have you solved this problem?

Best regards

Mahmoudbassuoni commented 4 months ago

@leedchou Unfortunately not. I have tried reaching @ydLiu-HIT multiple times on his email, but he is not answering.