zhouyunyan / PIGC

The construction of reference gene catalog and metagenome-assembled genomes of pig gut microbiome.
25 stars 16 forks source link

Issue in 06_Abundance.sh #2

Open adarobot opened 2 years ago

adarobot commented 2 years ago

In this 06_Abundance.sh file, there is a file assigned to SAF(SAF=./06_Abundance/PIGC90_cds.fna.saf). Could you please explain where it is from and how it is generated? I only found PIGC90_cds.fna in 02_cdhit_cluster. Thanks.

zhouyunyan commented 2 years ago

The .SAF file was generated based on the output from prodigal using a additional script. The SAF file has fixed formats, including "GeneID", "Chr", '"Start", “End”,“Strand”.

------------------ 原始邮件 ------------------ 发件人: "zhouyunyan/PIGC" @.>; 发送时间: 2022年6月18日(星期六) 上午10:45 @.>; @.***>; 主题: [zhouyunyan/PIGC] Issue in 06_Abundance.sh (Issue #2)

In this 06_Abundance.sh file, there is a file assigned to SAF(SAF=./06_Abundance/PIGC90_cds.fna.saf). Could you please explain where it is from and how it is generated? I only found PIGC90_cds.fna in 02_cdhit_cluster. Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

adarobot commented 2 years ago

Thanks. Will you be able to share the  script to generate .SAF using the output from  prodigal? Also, what do you mean by "nbsp"?

zhouyunyan commented 2 years ago

Sorry, I didn't save the script generated .SAF and can't find it now. The messages were sent through QQ mailbox, maybe the "nbsp" is an incompatibility problem that leads to garbled characters?

------------------ 原始邮件 ------------------ 发件人: "zhouyunyan/PIGC" @.>; 发送时间: 2022年6月18日(星期六) 晚上9:42 @.>; @.**@.>; 主题: Re: [zhouyunyan/PIGC] Issue in 06_Abundance.sh (Issue #2)

Thanks. Will you be able to share the  script to generate .SAF using the output from  prodigal? Also, what do you mean by "nbsp"?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jusail commented 2 years ago

can we use allSample500_gff file which got generated from prodigal instead of PIGC90_cds.fna.saf file? beacuse i also have no idea how to create a .saf file to run feature count

zhouyunyan commented 2 years ago

Featurecount seems to only support GTF/SAF files, maybe you can try to convert gff to gtf format.

jusail commented 2 years ago

I have tried using package called agat to convert the gff output file from prodigal to gtf format, but unfornately it didnt work. it shows a error !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !Your file has less than 8 columns (1). It cannot be a GTF/GFF file. Please ver! ! ify your file ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! therefore it will really helpful if you can suggest any alternative way to run feature count

zhouyunyan commented 2 years ago

I think it is more convenient to extract the information from the FASTA file (nucleotide or protein sequence) generated by Prodigal to obtain the .SAF file. All needed information including "GeneID", "Chr", '"Start", “End”,“Strand” can be found in the header of FASTA file.

jusail commented 2 years ago

Thanks, I'll check it out.

On Thu, Jul 21, 2022 at 9:39 AM yyz @.***> wrote:

I think it is more convenient to extract the information from the FASTA file (nucleotide or protein sequence) generated by Prodigal to obtain the .SAF file. All needed information including "GeneID", "Chr", '"Start", “End”,“Strand” can be found in the header of FASTA file.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1191568251, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZVKQMBRZIMJVVZDF23VVFOKVANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.

jusail commented 2 years ago

I have tried learning how to convert the file nucleotide or protein file from prodigal to a SAF format, but i couldn't find a solution. So it will be really helpful if you can provide the command how you have done it.

zhouyunyan commented 2 years ago

I have graduated and did not keep the data and codes for this article. If you can provide a test file, I can try to re-write a script when I am free.

jusail commented 2 years ago

Hi, what kind of test file do you mean i can provide you to help me with?. I didn't get you thats why

On Thu, Aug 4, 2022 at 9:17 PM yyz @.***> wrote:

I have graduated and did not keep the data and codes for this article. If you can provide a test file, I can try to re-write a script when I am free.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1205965711, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZTXQ4HZQ55U3PQGDEDVXR2UDANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.

zhouyunyan commented 2 years ago

Using the following code, you can try. grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt

jusail commented 2 years ago

allSample_nucl.fna https://drive.google.com/file/d/1dfVtzn3fLu6WSkmVmEC3EIvfJ3kq2Cgi/view?usp=drive_web allSample_protein.faa https://drive.google.com/file/d/16juv47sonz0OFU4cPh4ERD9zEJOy_Jc6/view?usp=drive_web filter.allSample_protein.faa https://drive.google.com/file/d/1L6-1TBmofLJAA4HMlaREIoCy8zexVsPX/view?usp=drive_web test.out.saf https://drive.google.com/file/d/1XOZOHD9CLWy8pOu4kk9BIA69JsoJWoJr/view?usp=drive_web Hi ,

I have tried using the above mentioned command for the {sampleID}_protein.faa got generated from prodigal. I have done variation in the above command by changing test.out.txt to test.out.saf. When I used this for feature count it showed as invalid parameter. If you could help me once again with this it would be really helpful, i would like to share my nucleotide and protein fasta file generated from prodigal with this mail as a reference .Please let me know if I am doing something wrong.

Thank You,

JUSAIL CP

On Wed, Aug 10, 2022 at 7:47 AM yyz @.***> wrote:

Using the following code, you can try. grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1210626331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZU44RFCYAAW5BD5YCDVYOQG7ANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.

Adamyazori commented 2 years ago

Hello, I am wondering if you have finally had a way to convert prodigal predicted nucleotide or protein to SAF format? if not could you assist me on how to annotated prodigal predicted ORF? I tried using prokka on the prodigal annotated nucleotide sequence but it run on for days and just gave me same .fna file.

Adamyazori commented 2 years ago

Well, just as @jusail mentioned I used the code you provided above grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt

but I had an this: ERROR: no features were loaded in format SAF. The annotation format can be specified by the '-F' option. The porgram has to terminate. Thank you

zhouyunyan commented 2 years ago

A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr". image

Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.

Adamyazori commented 2 years ago

Alright. I have done that but the question how to make a .SAF file?

I saved the file with a .SAF extension and it looks like the featureCount isn't recognizing it.

Thanks for getting back to me.

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:34:10 PM

On Wed, Aug 17, 2022 at 9:31 PM yyz @.***> wrote:

A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr". [image: image] https://user-images.githubusercontent.com/42016367/185278976-30edca4e-393c-4807-adff-79bc23553579.png

Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218929290, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQF6TBE52RBUXV54CDVZWOBJANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

Adamyazori commented 2 years ago

My prodigal predicted protein.faa does look like below, could you explain to me how you annotated it to have Gene ID and Chr information? thanks

k127_235550740_1 # 116 # 919 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.388 MDGLETLLKEEEHRLLAIKQVVNARLDTAPKGTLRITNCAGRPQFMLCTDETTKIKPYGE YISKSKHELISQLAQKSYDKKVKKLVERRIKQLETICSEYKDTEIMQIYDSMNDIRKSMV IPVEPTWEQRLEKWKSVPYIGKSFGKDVLEIYTKKGERVRSKSEKIIADTFYEMGIEYKY ECPISLSGYGTVYPDFTILSMKLNRVIYWEHEGRMDDPKYVEKAVRKIDSYTKNGIIPGK QLILTYETSTYALSDNTIKELIRQFII k127_235550740_2 # 1086 # 1358 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.418 MADNKNIMLTDEMMAQITGGTEGGGEQPFAFAVGDRVMLQYYDQIGIITEAYRYQWQDNF WNMYAVHWLPDPYNEEHDNRDVLECNLRHA k127_235550740_3 # 1377 # 1622 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.423

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:46:26 PM

On Wed, Aug 17, 2022 at 9:36 PM Adam Seidu @.***> wrote:

Alright. I have done that but the question how to make a .SAF file?

I saved the file with a .SAF extension and it looks like the featureCount isn't recognizing it.

Thanks for getting back to me.

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:34:10 PM

On Wed, Aug 17, 2022 at 9:31 PM yyz @.***> wrote:

A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr". [image: image] https://user-images.githubusercontent.com/42016367/185278976-30edca4e-393c-4807-adff-79bc23553579.png

Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218929290, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQF6TBE52RBUXV54CDVZWOBJANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

zhouyunyan commented 2 years ago

Please refer to SAF format requirements (https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts) image

Adamyazori commented 2 years ago

Got that, thanks!

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:59:17 PM

On Wed, Aug 17, 2022 at 9:56 PM yyz @.***> wrote:

Please refer to SAF format requirements ( https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts ) [image: image] https://user-images.githubusercontent.com/42016367/185282522-9c3b18aa-b53d-4dbc-b3e0-6bdd26ab6f0e.png

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218973168, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBATOBMHNAJJUCREZ7P3VZWQ6BANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

zhouyunyan commented 2 years ago

The gene "k127_235550740_1" is from contig "k127_235550740", I guess. However, I'm not sure if your contigID or gene ID is unique. In order to ensure their uniqueness, we generally rename contig ID and gene ID.

Adamyazori commented 2 years ago

Ok, thanks. I will rename it.

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 10:25:26 PM

On Wed, Aug 17, 2022 at 10:16 PM yyz @.***> wrote:

The gene "k127_235550740_1" is from contig "k127_235550740", I guess. However, I'm not sure if your contigID or gene ID is unique. In order to ensure their uniqueness, we generally rename contig ID and gene ID.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218998215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAWAGIO52T2UPR4G5Z3VZWTJFANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

jusail commented 2 years ago

Hi Seidu,

Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.

Adamyazori commented 2 years ago

Hello Zhouyunyan,

Thanks for checking on me. I did rename but couldn’t get it to work. May be it has to do with the conversion to.saf Thank you

On Mon, Aug 22, 2022 at 12:20 PM JUSAIL CP @.***> wrote:

Hi Seidu,

Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1222664956, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAVAPK4VLUNP3G6A5GDV2OZHNANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

Adamyazori commented 2 years ago

I have one question! Renaming the ContigID to Chr; would that affect the recount? Since the gene catelog I mapped the reads to has the ContigID.

On Mon, Aug 22, 2022 at 12:23 PM Adam Seidu @.***> wrote:

Hello Zhouyunyan,

Thanks for checking on me. I did rename but couldn’t get it to work. May be it has to do with the conversion to.saf Thank you

On Mon, Aug 22, 2022 at 12:20 PM JUSAIL CP @.***> wrote:

Hi Seidu,

Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1222664956, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAVAPK4VLUNP3G6A5GDV2OZHNANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

Adamyazori commented 2 years ago

@jusail Hello Jusail,

Sorry for address the name.

jusail commented 2 years ago

Okay, thanks for the update from your side. I also think that if column is renamed then it should also match with bam file used for abundance calculation.

Adamyazori commented 2 years ago

When I run the code above it gives me only the ContigID

k127_235550740_1 k127_235550740_2 k127_235550740_3 k127_235550740_4 k127_235550740_5 k127_235550740_6 k127_170119981_1 k127_19629233_1 k127_45801541_1 k127_45801541_2

jusail commented 2 years ago

I got something like this without a column heading

k87_38_1 1 108 1 k87_38_2 534 881 1 k87_40_1 3 860 -1 k87_42_1 1 534 -1 k87_44_1 1 942 -1 k87_44_2 987 1169 -1 k87_45_1 1 672 1 k87_51_1 3 74 -1 k87_51_2 119 589 -1 k87_51_3 744 821 -1

Adamyazori commented 2 years ago

Alright, got you. Did you use prodigal for your ORF prediction?

jusail commented 2 years ago

yup, i am new to computational stuffs. This PIG paper really helped me to get a good pipeline in one go. So that i have just done similar the pipeline mentioned in this page, nothing extra

Adamyazori commented 2 years ago

I see! yeah. it is an amazing pipeline. I dont know why my gene prediction only has the ContigID without extra information

Adamyazori commented 2 years ago

@jusail I had it now. Do you have any code to renaming the contigID to Chr? I have about 1.5 million genes. Thanks

jusail commented 2 years ago

no, i am just trying to do using sed command

Adamyazori commented 2 years ago

Ok, thanks.

On Mon, Aug 22, 2022 at 4:36 PM JUSAIL CP @.***> wrote:

no, i am just trying to do using sed command

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1223114823, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAWYFFZRIFJQXEQ7KSDV2PXGFANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

zhouyunyan commented 2 years ago

I renamed contig IDs and gene IDs at the beginning (when generating contig and gene sequences, before any other analysis). If you rename it in .SAF file, other related files need to be consistent.

Adamyazori commented 2 years ago

Ok, thanks. Did you use any code to rename it or you did it manually?

On Mon, Aug 22, 2022 at 9:37 PM yyz @.***> wrote:

I renamed contig IDs and gene IDs at the beginning (when generating contig and gene sequences, before any other analysis). If you rename it in .SAF file, other related files need to be consistent.

— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1223454157, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQPPAJIRBHTNLRRCSTV2Q2OBANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>

jusail commented 2 years ago

Did anyone figured out to prepare the annotation file for feature count. I have tried different methods to get a saf file or gtf file from prodigal outputs, but none of them works. If anyone were able to write the script for create the annotation file please let me know. It would be really helpful.