non-human data - Githubissues

distilledchild commented 2 years ago

Hi, I am using rat data with 4 mixture of enzymes from HiC Pro (*.bwt2pairs.bam). (fragment file is also from HiC Pro) And I found that chromosome numbers are little bit different. Can I use this for non-human data? (A rat has 20 autosomes + X, Y)

Thank you.

yufanzhouonline commented 2 years ago

Yes, of course you can use HiSIF for any kinds of organism.

Please note the following two points for other organism:

All chromosomes need to be changed to numbers: 1, 2, 3, ...
If use non-human data, the bed files of enzyme digestion sites (like hindIII.Hg19.HiCPLD.bed for human data placed under the folder of resources) have to be prepared before HiSIF is used. HiC-Pro Digest Genome tool could make bed files from genome, please refer to: https://nservant.github.io/HiC-Pro/UTILS.html#digest-genome-py

Thank you.

distilledchild commented 2 years ago

@yufanzhouonline thank you for your answer! Just I am curious in which file should the chromosomes be changed into numbers: 1, 2, 3 ? I mean I am using the bam file which has 'chr1, chr2...' and it is aligned with them (e.g. chr1, chr2) in HiC pro already.

There are 2 input files, ref.fasta and cutting fragment file. Do you mean I have to change 'chr' into only numbers in the both files ?? That is because I ran runhisif.sh with config file, and after that I was NOT able to see any "chr" in .pairs, '.bwt2pairs.bam.pairs', and 'chrXX.tmp' files.

Even all chromosome tmp files are generated automatically, so it seems I don't need to make other chromosomes' tmp files. (FOR troubleshooting #5) I just found that my strain has 20 + X, Y chromosomes, but it has total 25 tmp files.(From chr1.tmp to chr25.tmp). Everything seems automated, right?

Also, I keep encountering the error, Error of “Segmentation fault” (troubleshooting #3) even though I am just running a chromosome 12. (I do have all other chromosome tmp files) All params are defaults from the setting but read length = 150, and reference genome is also chr12 only. Could you give me some suggestions please?

yufanzhouonline commented 2 years ago

It looks like you are running HiSIF as "Quick Start" mentioned. runhisif.sh can only be used for human genome.

If you run other organism, you have to follow the instruction of "Customized Running".

There are three steps to run HiSIF:

Pre-Processing
Creating the chr-by-chr files
Running HiSIF

On the first step of preprocessing, transfer BAM/SAM files to 6-column text file by yourself:

chr1 pos1 strand1 chr2 pos2 strand2

Strand is 1 for positive strand and 0 for negative strand. Each chromosome need only the number and chrX is 23 and chrY is 24 for human. Similarly, 1, 2, 3, ... for other organism.

Please refer to the section of "Customized Running" on the links: https://github.com/yufanzhouonline/HiSIF

Thanks.

distilledchild commented 2 years ago

@yufanzhouonline Thank you for your guidance. I followed your advice and I found that it keeps making errors for “Segmentation fault” (# 3) and only make (SAMPLE)_t1_PerChr.txt even though I run only one chromosome (the shortest one, 12th). Even I tried 1/10 of chr12.tmp and still making the memory error, so I think there is something wrong because I tried to do that in 1T RAM.

Case1 working process for chr12:

making reference file for chr12 only (index for chr12 only too)
running HiC Pro pipeline
making 4 enzyme fragment file for chr12 only
run HiSIF followed by proc (made empty chr1.tmp~chr22.tmp except for chr12.tmp created by proc)

HiSIF -g /user/bowtie2_ref -c /user/HiSIF_V1.00/resources/hic_pro_edited_for_hisif.bed -w 36 500 3000 -p 1 29 -t 1 -i 2 ./SHR Error: could not read the first line : Is a directory wc: /user/bowtie2_ref/.: Is a directory Error: could not read the first line : Is a directory wc: /user/bowtie2_ref/.: Is a directory (=:...........Start processing files...........:=) cuttingSiteTotal == 381448 <-----Parsed enzyme cutting site map-----> Segmentation fault (core dumped)

Case2 working process for chr12:

running HiC Pro pipeline with a reference having all chromosomes
converting bam to sam using the function in proc and filtering only chr12 (awk '{if $1 == 12 && $4==12 print $0}' )
making 4 enzyme fragment file for chr12 only
HiSIF -g /user/bowtie2_ref -c /user/HiSIF_V1.00/resources/hic_pro_edited_for_hisif.bed -w 36 500 3000 -p 1 29 -t 1 -i 2 ./SHR Error: could not read the first line : Is a directory wc: /user/bowtie2_ref/.: Is a directory Error: could not read the first line : Is a directory wc: /user/bowtie2_ref/.: Is a directory (=:...........Start processing files...........:=) cuttingSiteTotal == 381448 <-----Parsed enzyme cutting site map-----> Segmentation fault (core dumped)

so, in both cases that have different process to create chr12.tmp, they have same errors.

I am thinking something wrong in CUTTING_FRAGMENTS file from digest-genome-py. The number of lines in enzyme cutting fragment file is 381448 in 12th chromosome.

First 10 lines of the file are here. chr12 31 36 HIC_chr12_1 0 + chr12 56 61 HIC_chr12_2 0 + chr12 78 83 HIC_chr12_3 0 + chr12 103 108 HIC_chr12_4 0 + chr12 225 230 HIC_chr12_5 0 + chr12 235 240 HIC_chr12_6 0 + chr12 300 305 HIC_chr12_7 0 + chr12 424 429 HIC_chr12_8 0 + chr12 449 454 HIC_chr12_9 0 + chr12 471 476 HIC_chr12_10 0 +

Do you have any suggestions for this situation?

Also, one more additional questions. If I use multiple enzymes, what fragment size would be good? ^GATC, ^ANTC, C^TNAG, T^TAA are the enzymes I am using.

If possible could I contact you? If possible, can I get an email please?

Thank you.

I think the reason why memory shortage issue on 1T RAM happens is too many enzyme fragments. Mine is 43344947 from 4 enzymes in 22 chromosomes. Also, 3353860 pairs in chr12.

yufanzhouonline commented 2 years ago

If you run only one chromosome, please refer to #5 of "Troubleshooting" section of the link:

https://github.com/yufanzhouonline/HiSIF

Please contact me via email: zhouy4@uthscsa.edu if you have any further questions.

Thank you.

yufanzhouonline / HiSIF

non-human data #4