How does the #ID field work?

zhangwei2015 / IMonitor

This script use to analyze the immune repertoire sequenced by high throughtput sequencing

24 stars 13 forks source link

How does the #ID field work? #1

Closed darth-donut closed 7 years ago

darth-donut commented 7 years ago

In .structure.gz file, the first field #ID has values cp. Initially, I've assumed that since all values of cp starts from cp1 .. n, cp1 implies first read and cpN implies the Nth read.

However, there's a point in the file where cpN actually means N-1th read, and another point in the file where cpN implies the N+1th read. I got to know this because I had a reference file, where I can compare the VDJ and CDR3 assignments.

Question is, how does the #ID field work? The numbers don't work as I think they do.

Cheers, Harry

zhangwei2015 commented 7 years ago

Dear Harry,

In .structure.gz file, the first field #ID has values cp. cp# , starts from cp1..n, cp1 implies first read and cpN implies the Nth read. For example, if you have 1000 (paired-end) reads, the #ID will be cp1,...,cp1000.

Actually, you can check the file "*.change_id.backup.gz"(in the Result directory) to find the relationship between raw reads ID and the new ID(cp#).

If you still have some problems, please let me know!

Best Regards! Wei

darth-donut commented 7 years ago

I see, I'll look at *.change-id.backup.gz for the ids then. Because my initial assumtion was true, as you've confirmed that cpN implies Nth read, but using my reference data, it seems like at some point, it switches to N+1 (i.e. cp1000 was actually the 1001th read), and later on N+2. I have yet to see N+3, but it might be because my data isn't large enough (there were 500k reads).