wyp1125 / MCScanX

MCScanX: Multiple Collinearity Scan toolkit X version. The most popular synteny analysis tool in the world!
http://chibba.pgml.uga.edu/mcscan2/
218 stars 60 forks source link

File format gff or bed and what format? #40

Open Dearbhaile opened 3 years ago

Dearbhaile commented 3 years ago

I am a PhD student doing a collaborative project between Trinity College Dublin and UCLondon and am using MCScanX. I am just getting to know the tool and have been trying to use the test data available in the package ( at_vv.gff, at_vv.blast etc.) and am running into a few problems.

Some of the information online says to use .gff and others say .bed, and the format of these files is also conflicting in the documentation. I was wondering if you could tell me how exactly the files needed for MCScan to run should be formatted and is it advisable to have just two (the .bed/.gff and .blast) in a single folder when running ./MCScan command.

I hope you can help me out.

jannafierst commented 3 years ago

I am just getting this running but I had to create a file with this information:

Chromosome Gene_name Start End

This is not actually a 'bed' file because .bed format is (my understanding at least) constrained to Chromosome; Start; End as the first 3 columns and gene names and other information can be added in columns 4+. I had

Chromosome Start End Gene_name

and rewrote it with this awk command

awk -F '\t' '{print $1,$4,$2,$3}' OFS='\t' "xyz.bed" > xyz.gff

If you check the test data the format is like this. Good luck!

somnya commented 2 years ago

I had the same problem and that definitively solved it!

Thank you so much @jannafierst!

gunjanpandey commented 1 year ago

@somnya Glad to know that you were able to make this program work. Could you please share few lines of you input file so people know the format? A lot of people are having trouble with file formats. Thanks in advance.

1251531750 commented 2 months ago

@gunjanpandey "I'll explain this issue in detail. The strange part is that the author requires a BED file, but this BED file is not standard. The standard format is chromosome name, start, end, gene name. However, what he needs is chromosome name, gene name, start, end, and this file needs to have a .gff suffix. It took me an entire day to figure this out."