xzhang2016 / DeepHE

A deep learning framework for essential gene prediction
MIT License
9 stars 5 forks source link

Data is used for the paper experiment #1

Open conglb opened 3 years ago

conglb commented 3 years ago

Dear author,

I am a newbie in the field of bioinfo, so I don't have much experience in data finding. May you give me the download link for the data you used? I would be really appreciated that.

Thanks!

xzhang2016 commented 3 years ago

Hi,

You can find the links from "Data Availability Statement" (left section on page 1) in the published paper at https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008229&type=printable

Good luck!

At 2020-11-22 18:05:54, "Lê Bá Công" notifications@github.com wrote:

Dear author,

I am a newbie in the field of bioinfo, so I don't have much experience in data finding. May you give me the download link for the data you used? I would be really appreciated that.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

AmirUCR commented 3 years ago

Thank you for your research. In your paper, Fig 2. The distribution of essential genes across the 16 datasets shows that one of the datasets contains 3870 essential genes after the removal of duplicate genes across the datasets. I have been looking at the essential gene database http://tubic.tju.edu.cn/deg/organism.php?db=e and can not find any dataset that contains 3870 genes. I was wondering, what is the cause of this discrepancy? I appreciate your clarification.

xzhang2016 commented 3 years ago

I think you misunderstand that figure. The 3870 genes you mentioned are those genes which are contained only in one dataset, but there are not necessary in the same dataset.

At 2021-01-07 07:23:35, "Amirsadra Mohseni" notifications@github.com wrote:

Thank you for your research. On Fig 2. The distribution of essential genes across the 16 datasets shows that one of the datasets contains 3870 essential genes after the removal of duplicate genes across the datasets. I have been looking at the essential gene database http://tubic.tju.edu.cn/deg/organism.php?db=e and can not find any dataset that contains 3870 genes. I was wondering, what is the cause of this discrepancy? Thank you for your clarification.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

AmirUCR commented 3 years ago

Thank you for getting back to me. I am a Ph.D. student from the University of California, Riverside and I am trying to replicate your experiment for a class project so I appreciate you helping me with this.

So from what I understand, some datasets were merged into one after duplicate genes were removed throughout the datasets, is this correct?

Also, I tried getting DNA and peptide sequence data from Ensembl. I noticed there are multiple isoforms of the same genes in that dataset. I was wondering, how did you decide which sequence to use in the DeepHE model?

xzhang2016 commented 3 years ago

Fig2 is the distribution of essential genes across the 16 datasets from DEG database. It includes the information about how many genes from the 16 datasets are only contained in 1 of the 16 datasets, and how many genes are only contained in 2 of the 16 datasets, and so on. That paper only used genes contained at least in 5 of the 16 datasets as the essential genes.

At 2021-01-14 00:33:59, "Amirsadra Mohseni" notifications@github.com wrote:

Thank you for getting back to me. So from what I understand, some datasets were merged into one after duplicate genes were removed throughout the datasets, is this correct? I appreciate your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

AmirUCR commented 3 years ago

Thank you for your explanation. I understand what fig 2 means now and have made a similar plot on my end.

I tried getting DNA and peptide sequence data from Ensembl. I noticed there are multiple isoforms of the same genes in that dataset (multiple sequences exist for the same gene). I was wondering, how did you decide which sequence to use in the DeepHE model? I appreciate your help and patience.

xzhang2016 commented 3 years ago

I downloaded the DAN and protein sequences via the FTP site. DNA file (Homo_sapiens.GRCh38.cds.all.fa.gz) from ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/cds/,

protein file (Homo_sapiens.GRCh38.pep.all.fa.gz) from ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/

DEG database also contains the sequences for essential genes.

At 2021-01-15 03:14:43, "Amirsadra Mohseni" notifications@github.com wrote:

Thank you for your explanation. I understand what fig 2 means now and have made a similar plot on my end.

I tried getting DNA and peptide sequence data from Ensembl. I noticed there are multiple isoforms of the same genes in that dataset (multiple sequences exist for the same gene). I was wondering, how did you decide which sequence to use in the DeepHE model? I appreciate your help and patience.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

AmirUCR commented 3 years ago

Thank you for your reply and for linking the datasets. Yes, I have the DNA sequence from Ensembl. Let me rephrase my question: Each gene in the Ensembl CDS dataset has multiple alternative splicing sequences. Do we use all of them for DeepHE? Or do we choose only 1 sequence for each gene? Here is what I am referring to:

image

For example, for gene FOS, there is more than 1 sequence. Which one do we use, if not all?

Ahmed-Ali88 commented 1 month ago

Thank you for your research. In your paper, Fig 2. The distribution of essential genes across the 16 datasets . I understand .. That paper only used genes contained at least in 5 of the 16 datasets as the essential genes. How you label essential genes or not essential genes? can you provide me by sample ?

Ahmed-Ali88 commented 1 month ago

Thank you for your research. In your paper, Fig 2. The distribution of essential genes across the 16 datasets . I understand .. That paper only used genes contained at least in 5 of the 16 datasets as the essential genes. How you label essential genes or not essential genes? can you provide me by sample ?

Ahmed-Ali88 commented 1 month ago

Dear author,

I am a newbie in the field of bioinfo, Thank you for your research.I am trying to replicate your experiment for a class project so I appreciate you helping me with this.

so I don't have much experience in data finding. May you give me the download link for the data you used from DEG

essential gene database http://tubic.tju.edu.cn/deg/organism.php?db=e this link give me error

I appreciate your help.