wdecoster / nanocomp

Comparison of multiple long read datasets
MIT License
113 stars 9 forks source link

Got negative values on weighted histogram plot #73

Closed najohink closed 4 weeks ago

najohink commented 1 year ago

Hello,

I am using NanoComp v1.23.1 and got a weird plot after filtering my input fastq files (see attached image). Screenshot 2023-09-22 163016

When I did the same command on input fastq which were not filtered, I got normal plots. But after filtering my fastq files to only keep 1-27kb reads, I now get negative values in the weighted plots. Is this "normal"?

Can you also explain the difference between weighted and normalized?

best, S

najohink commented 1 year ago

I forgot to add the photo of the unfiltered fastq output plot:

Screenshot 2023-09-22 164308

wdecoster commented 1 year ago

I am very confused and will need to think about this.

najohink commented 1 year ago

I filtered my dataset with FiltLong before running NanoComp and getting the weird result.

In the meantime, I figured out how to do what I wanted by running this:

df3 = pickle.load(open('barcode03_1-27kb_NanoComp-data.pickle', 'rb'))

bins = numpy.arange(0, 30000, 500)
h3 = numpy.histogram(df3['lengths'], bins=bins)

plt.bar(h3[1][:-1], height = h3[0], width=450)

xdata3 = (h3[1][:-1] + h3[1][1:])/2
ydata3 = xdata3 * h3[0]
plt.bar(xdata3, ydata3, width=450)
ydata3[xdata3 > 25000].sum() / ydata3.sum()

I was interested in knowing what percent of the total bases my full length sequence was. So I wanted to divide the 26kb bases by the total number of bases, but wanted to also keep out the weird long stuff from the dataset, hence filtering with FiltLong.

wdecoster commented 1 year ago

Does the plot without weighted look normal? I will explain later what those mean when I'm at the computer...

najohink commented 1 year ago

Yes, the others look normal. Only the two weighted plots have negative values.

wdecoster commented 1 year ago

So normalized plots mean that every dataset in the plot adds up to "1" - so datasets with significant differences in yield can still be compared on length. Without normalization, just the number of reads is used. And weighted means that instead of the number of reads per bin, the number of bases per bin is used (as is also the case in the minKNOW interface). As such, a read of 25000 bases in the bin of 24000-26000 will increase the count on the y-axis for 25000 rather than just 1.

wdecoster commented 1 year ago

Do you think it would be possible to share the data that caused this?

wdecoster commented 4 weeks ago

So I haven't been able to replicate this. Please let me know if someone runs into a similar issue.