Closed callumparr closed 2 years ago
I have updated the Generate_Pairwise_contact_juicematrix.py script so that, in addition to running faster, it could generate contact matrices from adjacent and non-adjacent fragments. (for example, a reads A-B-C-D: A-B, B-C, C-D adjacent contacts; A-C, B-D non-adjacent ) . Type : python Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000
there is anothe requirement, pandasparallel . once installed that is fine but from the readme file it seems a little confusion.
❯ inputfile=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
❯ juice_matrix=/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt
❯ python Scripts/Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Nonadj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/generate_pair_contact.stat'
Generate Contact Matrix for /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
Traceback (most recent call last):
File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 212, in <module>
from pandarallel import pandarallel
ModuleNotFoundError: No module named 'pandarallel'
Data frames from python cannot be written out as the directory structure is not consistent. Sorry if I misunderstood the readme but it says to give explicitly output file path as a variable to pass to python.
inputfile="Merge_Align_Fragment_RvdF.csv"
juice_matrix="juice_matrix.txt"
python ./Scripts/Generate_Contact_juiceMatrix.py -p ${inputpaf} -o ${contact_matrix} -s 0 -t 20 -c 1000000 &
I guess from behaviour of script we should instead provide a directory and not a filename. The script will create new file names within this new directory .
❯ python Scripts/Generate_Pairwise_contact_juicematrix.py -p $inputfile -o $juice_matrix -s 1 -t 20 -c 1000000
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Nonadj_contact_matrix.txt'
[Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/generate_pair_contact.stat'
Generate Contact Matrix for /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/Merge_Align_Fragment_RvdF.csv
INFO: Pandarallel will run on 20 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
2022--10--05 11:15:15
Loading 210146 reads and 954266 fragments
Processing reads: 0 - 210145
Traceback (most recent call last):
File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 248, in <module>
status_1 = ExportFun(Exportfile_1, adjlist)
File "Scripts/Generate_Pairwise_contact_juicematrix.py", line 159, in ExportFun
with open(Exportfile, "a") as fileID:
FileNotFoundError: [Errno 2] No such file or directory: '/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix.txt/Adj_contact_matrix.txt'
Now I have changed it to only provide a directory
mkdir /home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix
juice_matrix="/home/minion/HiPore-C/Pore-C_test2/vdFAnnotation/juice_matrix
A the script runs this produces the empty place files under this directory.
❯ ls -lh Pore-C_test2/vdFAnnotation/juice_matrix
total 916M
-rw-rw-r-- 1 minion minion 272M 10月 5 11:28 Adj_contact_matrix.txt
-rw-rw-r-- 1 minion minion 60 10月 5 11:28 generate_pair_contact.stat
-rw-rw-r-- 1 minion minion 1.2K 10月 5 11:30 juic_matrix.summary
-rw-rw-r-- 1 minion minion 645M 10月 5 11:28 Nonadj_contact_matrix.txt
I guess now the juice matrix is split into adjacent and non-adjacent instead of just one output.
Is it OK to catenate the two adjacent and non-adjacent?
My recent analysis found that both types of contact can reproduce known chromatin conformation and are very similar in terms of the structure of Compartment, TAD, and loops. So in the analysis of these classical structures, they can be combined. However, genome contact distance of non-adjacent is larger than adjacent, so if you consider that, they're different.
For contex this is the head of my output table from the first two bash scripts.