zhengdafangyuan / HiPore-C

We developed a protocol of in situ high throughput multi-way contact long read Pore-C sequencing (in situ HiPore-C), a strategy that integrated multi-fragment ligates preparation with third-generation sequencing technology. With HiPore-C approach, we could explore higher-order chromatin interaction genome-widely.
MIT License
17 stars 4 forks source link

Syntax error when running the generate_pairwise_contact python script on DpnII digested Pore-C data. #4

Closed callumparr closed 2 years ago

callumparr commented 2 years ago
❯ inputfile=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
❯ juice_matrix=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt
❯ python ./Scripts/Generate_Pairwise_contact_juicematrix.py $inputfile $juice_matrix
  File "./Scripts/Generate_Pairwise_contact_juicematrix.py", line 93
    selectchrs = [ "chr%d"%i for i in range(1,22+1) ]
    ^
SyntaxError: invalid syntax

For contex this is the head of my output table from the first two bash scripts.

read_name,read_length,read_start,read_end,strand,chrom,chrom_length,start,end,Matches,AlignBlock_length,MapQual,subread_length,Identity,align_idx,Note,LvdF_id,LvdF_start,LvdF_end,RvdF_id,RvdF_start,RvdF_end,r
F_start,rF_end,LvdF_pdist,RvdF_pdist,LRvdF_pdist,LvdF_pfix,RvdF_pfix,LRvdF_pfix
00000511-0662-4336-8c27-8ba1b83cadb1,3412,31,1416,+,chr9,138394717,100703633,100705034,1354,1419,60,1385,0.954,0,FirstFilter,3856619,100703635,100705032,3856619,100703635,100705032,100703633,100705034,2,2,4,True,True,True
00000511-0662-4336-8c27-8ba1b83cadb1,3412,1415,2104,+,chr9,138394717,101208881,101209575,644,711,60,689,0.906,1,FirstFilter,3857757,101208880,101209577,3857757,101208880,101209577,101208881,101209575,1,2,3,True,True,True
zhengdafangyuan commented 2 years ago

I have updated the Generate_Pairwise_contact_juicematrix.py script so that, in addition to running faster, it could generate contact matrices from adjacent and non-adjacent fragments. (for example, a reads A-B-C-D: A-B, B-C, C-D adjacent contacts; A-C, B-D non-adjacent ) . Type : python Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000

callumparr commented 2 years ago

there is anothe requirement, pandasparallel . once installed that is fine but from the readme file it seems a little confusion.

❯ inputfile=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
❯ juice_matrix=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt
❯ python Scripts/Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Nonadj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/generate_pair_contact.stat'
Generate Contact Matrix for /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
Traceback (most recent call last):
  File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 212, in <module>
    from pandarallel import pandarallel
ModuleNotFoundError: No module named 'pandarallel'
callumparr commented 2 years ago

Data frames from python cannot be written out as the directory structure is not consistent. Sorry if I misunderstood the readme but it says to give explicitly output file path as a variable to pass to python.

From README

inputfile="Merge_Align_Fragment_RvdF.csv"
juice_matrix="juice_matrix.txt"
python ./Scripts/Generate_Contact_juiceMatrix.py -p ${inputpaf} -o ${contact_matrix} -s 0 -t 20 -c 1000000 &
I guess from behaviour of script we should instead provide a directory and not a filename. The script will create new file names within this new directory .
❯ python Scripts/Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Nonadj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/generate_pair_contact.stat'
Generate Contact Matrix for /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
INFO: Pandarallel will run on 20 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
2022--10--05 11:15:15
Loading 210146 reads and 954266 fragments
Processing reads: 0 - 210145
Traceback (most recent call last):
  File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 248, in <module>
    status_1 = ExportFun(Exportfile_1, adjlist)
  File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 159, in ExportFun
    with open(Exportfile, "a") as fileID:
FileNotFoundError: [Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'

Now I have changed it to only provide a directory

mkdir /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix

juice_matrix="/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix

A the script runs this produces the empty place files under this directory.

❯ ls -lh Pore-C_test2/vdFAnnotation/juice_matrix
total 916M
-rw-rw-r-- 1 minion minion 272M 10月  5 11:28 Adj_contact_matrix.txt
-rw-rw-r-- 1 minion minion   60 10月  5 11:28 generate_pair_contact.stat
-rw-rw-r-- 1 minion minion 1.2K 10月  5 11:30 juic_matrix.summary
-rw-rw-r-- 1 minion minion 645M 10月  5 11:28 Nonadj_contact_matrix.txt
callumparr commented 2 years ago

I guess now the juice matrix is split into adjacent and non-adjacent instead of just one output.

Is it OK to catenate the two adjacent and non-adjacent?

zhengdafangyuan commented 2 years ago

My recent analysis found that both types of contact can reproduce known chromatin conformation and are very similar in terms of the structure of Compartment, TAD, and loops. So in the analysis of these classical structures, they can be combined. However, genome contact distance of non-adjacent is larger than adjacent, so if you consider that, they're different.