Open adarobot opened 2 years ago
The .SAF file was generated based on the output from prodigal using a additional script. The SAF file has fixed formats, including "GeneID", "Chr", '"Start", “End”,“Strand”.
------------------ 原始邮件 ------------------ 发件人: "zhouyunyan/PIGC" @.>; 发送时间: 2022年6月18日(星期六) 上午10:45 @.>; @.***>; 主题: [zhouyunyan/PIGC] Issue in 06_Abundance.sh (Issue #2)
In this 06_Abundance.sh file, there is a file assigned to SAF(SAF=./06_Abundance/PIGC90_cds.fna.saf). Could you please explain where it is from and how it is generated? I only found PIGC90_cds.fna in 02_cdhit_cluster. Thanks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks. Will you be able to share the script to generate .SAF using the output from prodigal? Also, what do you mean by "nbsp"?
Sorry, I didn't save the script generated .SAF and can't find it now. The messages were sent through QQ mailbox, maybe the "nbsp" is an incompatibility problem that leads to garbled characters?
------------------ 原始邮件 ------------------ 发件人: "zhouyunyan/PIGC" @.>; 发送时间: 2022年6月18日(星期六) 晚上9:42 @.>; @.**@.>; 主题: Re: [zhouyunyan/PIGC] Issue in 06_Abundance.sh (Issue #2)
Thanks. Will you be able to share the script to generate .SAF using the output from prodigal? Also, what do you mean by "nbsp"?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
can we use allSample500_gff file which got generated from prodigal instead of PIGC90_cds.fna.saf file? beacuse i also have no idea how to create a .saf file to run feature count
Featurecount seems to only support GTF/SAF files, maybe you can try to convert gff to gtf format.
I have tried using package called agat to convert the gff output file from prodigal to gtf format, but unfornately it didnt work. it shows a error !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !Your file has less than 8 columns (1). It cannot be a GTF/GFF file. Please ver! ! ify your file ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! therefore it will really helpful if you can suggest any alternative way to run feature count
I think it is more convenient to extract the information from the FASTA file (nucleotide or protein sequence) generated by Prodigal to obtain the .SAF file. All needed information including "GeneID", "Chr", '"Start", “End”,“Strand” can be found in the header of FASTA file.
Thanks, I'll check it out.
On Thu, Jul 21, 2022 at 9:39 AM yyz @.***> wrote:
I think it is more convenient to extract the information from the FASTA file (nucleotide or protein sequence) generated by Prodigal to obtain the .SAF file. All needed information including "GeneID", "Chr", '"Start", “End”,“Strand” can be found in the header of FASTA file.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1191568251, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZVKQMBRZIMJVVZDF23VVFOKVANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.
I have tried learning how to convert the file nucleotide or protein file from prodigal to a SAF format, but i couldn't find a solution. So it will be really helpful if you can provide the command how you have done it.
I have graduated and did not keep the data and codes for this article. If you can provide a test file, I can try to re-write a script when I am free.
Hi, what kind of test file do you mean i can provide you to help me with?. I didn't get you thats why
On Thu, Aug 4, 2022 at 9:17 PM yyz @.***> wrote:
I have graduated and did not keep the data and codes for this article. If you can provide a test file, I can try to re-write a script when I am free.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1205965711, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZTXQ4HZQ55U3PQGDEDVXR2UDANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.
Using the following code, you can try.
grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt
allSample_nucl.fna https://drive.google.com/file/d/1dfVtzn3fLu6WSkmVmEC3EIvfJ3kq2Cgi/view?usp=drive_web allSample_protein.faa https://drive.google.com/file/d/16juv47sonz0OFU4cPh4ERD9zEJOy_Jc6/view?usp=drive_web filter.allSample_protein.faa https://drive.google.com/file/d/1L6-1TBmofLJAA4HMlaREIoCy8zexVsPX/view?usp=drive_web test.out.saf https://drive.google.com/file/d/1XOZOHD9CLWy8pOu4kk9BIA69JsoJWoJr/view?usp=drive_web Hi ,
I have tried using the above mentioned command for the {sampleID}_protein.faa got generated from prodigal. I have done variation in the above command by changing test.out.txt to test.out.saf. When I used this for feature count it showed as invalid parameter. If you could help me once again with this it would be really helpful, i would like to share my nucleotide and protein fasta file generated from prodigal with this mail as a reference .Please let me know if I am doing something wrong.
Thank You,
JUSAIL CP
On Wed, Aug 10, 2022 at 7:47 AM yyz @.***> wrote:
Using the following code, you can try. grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1210626331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQVVZU44RFCYAAW5BD5YCDVYOQG7ANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
-- Jusail C P Research Scholar Vaccine Immunology Laboratory, Infectious Disease Department, National Institute of Animal Biotechnology(NIAB), Hyderabad, India.
Hello, I am wondering if you have finally had a way to convert prodigal predicted nucleotide or protein to SAF format? if not could you assist me on how to annotated prodigal predicted ORF? I tried using prokka on the prodigal annotated nucleotide sequence but it run on for days and just gave me same .fna file.
Well, just as @jusail mentioned I used the code you provided above grep "^>" test.protein.fa |awk -v FS="[>#]" -v OFS="\t" '{print $2,$3,$4,$5}' >test.out.txt
but I had an this: ERROR: no features were loaded in format SAF. The annotation format can be specified by the '-F' option. The porgram has to terminate. Thank you
A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr".
Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.
Alright. I have done that but the question how to make a .SAF file?
I saved the file with a .SAF extension and it looks like the featureCount isn't recognizing it.
Thanks for getting back to me.
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:34:10 PM
On Wed, Aug 17, 2022 at 9:31 PM yyz @.***> wrote:
A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr". [image: image] https://user-images.githubusercontent.com/42016367/185278976-30edca4e-393c-4807-adff-79bc23553579.png
Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218929290, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQF6TBE52RBUXV54CDVZWOBJANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
My prodigal predicted protein.faa does look like below, could you explain to me how you annotated it to have Gene ID and Chr information? thanks
k127_235550740_1 # 116 # 919 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.388 MDGLETLLKEEEHRLLAIKQVVNARLDTAPKGTLRITNCAGRPQFMLCTDETTKIKPYGE YISKSKHELISQLAQKSYDKKVKKLVERRIKQLETICSEYKDTEIMQIYDSMNDIRKSMV IPVEPTWEQRLEKWKSVPYIGKSFGKDVLEIYTKKGERVRSKSEKIIADTFYEMGIEYKY ECPISLSGYGTVYPDFTILSMKLNRVIYWEHEGRMDDPKYVEKAVRKIDSYTKNGIIPGK QLILTYETSTYALSDNTIKELIRQFII k127_235550740_2 # 1086 # 1358 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.418 MADNKNIMLTDEMMAQITGGTEGGGEQPFAFAVGDRVMLQYYDQIGIITEAYRYQWQDNF WNMYAVHWLPDPYNEEHDNRDVLECNLRHA k127_235550740_3 # 1377 # 1622 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.423
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:46:26 PM
On Wed, Aug 17, 2022 at 9:36 PM Adam Seidu @.***> wrote:
Alright. I have done that but the question how to make a .SAF file?
I saved the file with a .SAF extension and it looks like the featureCount isn't recognizing it.
Thanks for getting back to me.
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:34:10 PM
On Wed, Aug 17, 2022 at 9:31 PM yyz @.***> wrote:
A .SAF file needs "GeneID", "Chr", '"Start", “End”,“Strand”. The “test.out.txt” is not a .SAF file,you need to make .SAF file based on the information extracted from "test.protein.fa" by the code I provided above. In fact, just need to add one column "Chr". [image: image] https://user-images.githubusercontent.com/42016367/185278976-30edca4e-393c-4807-adff-79bc23553579.png
Here, “Chr” recommends using the contig ID from which the gene is derived. If this is difficult, gene ID is alternative.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218929290, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQF6TBE52RBUXV54CDVZWOBJANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
Please refer to SAF format requirements (https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts)
Got that, thanks!
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 09:59:17 PM
On Wed, Aug 17, 2022 at 9:56 PM yyz @.***> wrote:
Please refer to SAF format requirements ( https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts ) [image: image] https://user-images.githubusercontent.com/42016367/185282522-9c3b18aa-b53d-4dbc-b3e0-6bdd26ab6f0e.png
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218973168, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBATOBMHNAJJUCREZ7P3VZWQ6BANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
The gene "k127_235550740_1" is from contig "k127_235550740", I guess. However, I'm not sure if your contigID or gene ID is unique. In order to ensure their uniqueness, we generally rename contig ID and gene ID.
Ok, thanks. I will rename it.
[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 08/17/22, 10:25:26 PM
On Wed, Aug 17, 2022 at 10:16 PM yyz @.***> wrote:
The gene "k127_235550740_1" is from contig "k127_235550740", I guess. However, I'm not sure if your contigID or gene ID is unique. In order to ensure their uniqueness, we generally rename contig ID and gene ID.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1218998215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAWAGIO52T2UPR4G5Z3VZWTJFANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
Hi Seidu,
Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.
Hello Zhouyunyan,
Thanks for checking on me. I did rename but couldn’t get it to work. May be it has to do with the conversion to.saf Thank you
On Mon, Aug 22, 2022 at 12:20 PM JUSAIL CP @.***> wrote:
Hi Seidu,
Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1222664956, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAVAPK4VLUNP3G6A5GDV2OZHNANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
I have one question! Renaming the ContigID to Chr; would that affect the recount? Since the gene catelog I mapped the reads to has the ContigID.
On Mon, Aug 22, 2022 at 12:23 PM Adam Seidu @.***> wrote:
Hello Zhouyunyan,
Thanks for checking on me. I did rename but couldn’t get it to work. May be it has to do with the conversion to.saf Thank you
On Mon, Aug 22, 2022 at 12:20 PM JUSAIL CP @.***> wrote:
Hi Seidu,
Are you able run the feature count after renaming the "ContigID" into "Chr" and was able to run the feature count.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1222664956, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAVAPK4VLUNP3G6A5GDV2OZHNANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
@jusail Hello Jusail,
Sorry for address the name.
Okay, thanks for the update from your side. I also think that if column is renamed then it should also match with bam file used for abundance calculation.
When I run the code above it gives me only the ContigID
k127_235550740_1 k127_235550740_2 k127_235550740_3 k127_235550740_4 k127_235550740_5 k127_235550740_6 k127_170119981_1 k127_19629233_1 k127_45801541_1 k127_45801541_2
I got something like this without a column heading
k87_38_1 1 108 1 k87_38_2 534 881 1 k87_40_1 3 860 -1 k87_42_1 1 534 -1 k87_44_1 1 942 -1 k87_44_2 987 1169 -1 k87_45_1 1 672 1 k87_51_1 3 74 -1 k87_51_2 119 589 -1 k87_51_3 744 821 -1
Alright, got you. Did you use prodigal for your ORF prediction?
yup, i am new to computational stuffs. This PIG paper really helped me to get a good pipeline in one go. So that i have just done similar the pipeline mentioned in this page, nothing extra
I see! yeah. it is an amazing pipeline. I dont know why my gene prediction only has the ContigID without extra information
@jusail I had it now. Do you have any code to renaming the contigID to Chr? I have about 1.5 million genes. Thanks
no, i am just trying to do using sed command
Ok, thanks.
On Mon, Aug 22, 2022 at 4:36 PM JUSAIL CP @.***> wrote:
no, i am just trying to do using sed command
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1223114823, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAWYFFZRIFJQXEQ7KSDV2PXGFANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
I renamed contig IDs and gene IDs at the beginning (when generating contig and gene sequences, before any other analysis). If you rename it in .SAF file, other related files need to be consistent.
Ok, thanks. Did you use any code to rename it or you did it manually?
On Mon, Aug 22, 2022 at 9:37 PM yyz @.***> wrote:
I renamed contig IDs and gene IDs at the beginning (when generating contig and gene sequences, before any other analysis). If you rename it in .SAF file, other related files need to be consistent.
— Reply to this email directly, view it on GitHub https://github.com/zhouyunyan/PIGC/issues/2#issuecomment-1223454157, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASEJBAQPPAJIRBHTNLRRCSTV2Q2OBANCNFSM5ZEAUNVA . You are receiving this because you commented.Message ID: @.***>
Did anyone figured out to prepare the annotation file for feature count. I have tried different methods to get a saf file or gtf file from prodigal outputs, but none of them works. If anyone were able to write the script for create the annotation file please let me know. It would be really helpful.
In this 06_Abundance.sh file, there is a file assigned to SAF(SAF=./06_Abundance/PIGC90_cds.fna.saf). Could you please explain where it is from and how it is generated? I only found PIGC90_cds.fna in 02_cdhit_cluster. Thanks.