weizhongli / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
634 stars 127 forks source link

How to dealing with big gene dataset? #136

Open B-1991-ing opened 1 year ago

B-1991-ing commented 1 year ago

Dear cdhit development team,

I have a big dataset. Because I predicted genes for each metagenome fasta using the prodigal software, then I got 16 prodigal annotated gene files, each file size is around 40GB. So if I merge all 16 faa files into one faa file, it would be around 640GB. Is it possible to use cd-hit to remove the duplicated genes from this kind of big dataset?

If not possible, could you give me some suggestions on removing duplicate genes from big gene files?

Thank you very much.

Best,

Bing

B-1991-ing commented 1 year ago

Update

I tried to remove duplicated genes on another big dataset --- 589G. But , finally an error occurred after two hours' running.

Screenshot 2023-01-27 at 23 56 59
unavailable-2374 commented 1 year ago

hello

why don't you try to split the file, one by one to use cd-hit, then cat them together?

In my opinion, there is no bias if you do so.

B-1991-ing commented 1 year ago

Thanks for the reply.

Even if I split my 600GB big files, do the dereplication, then combine the duplicated files, the combined file will still need to be dereplicated, but the file is still too big, let's say 200GB.

Bing

KJ-Ma commented 10 months ago

Hello

I meet the same trouble with you, the total geneset is about 200GB, although afters 12h there is no error occurred, but it was too slow to run it out, only 1500000/275299588 sequence have been done.

So, have you soloved this trouble? And how?

Thanks