Open luidale opened 6 years ago
Dear @luidale, wow you are quick and deeper into the code than I am. I can use the width of the parameter file (instead of the currently hard-coded 15), but I could also add a separate parameter. I think the first option makes most sense, doesn't it? Any preference or suggestion(s)?
This whole v3.0.0 is actually for testing purpose, to see if the person who requested the feature can now solve the corresponding biological questions. From that perspectice -15 is quite 'safe' and doesn't result in unexpected behaviour for others.
I was also surprised that I notice the update just 2 hours later (without any tools to follow changes/updates).
I am not sure what is the best. Perhaps default by the width of the parameter file but also a parameter where you can change it? My logic: A)The idea or benefit having a wide parameter file width (so that the extreme positions from the peak has very low procentage) is that low abundant fragments will be filtered out when they are next to high abundant fragment. Right? B)At the same time when the length difference would be set by the parameter file width (e.g. 15) then the fragments sharing one end and another end is different by smaller step (e.g. 10 - enough to be biologically significant) there will be reported only one fragment (but other fragment(s) will be lost). C)Option would be to adjust (decrease width) the parameter file but then (if A is correct) you increase noise by increasing detected low abundant fragments. If the logic (A,B,C) makes sense it would be good if you can set the parameter also smaller then the width of the parameter file. But I am wondering if the length difference would be set smaller (e.g. 6 or 10) than the width (e.g. 15) does it influence the behavior of output of the script. (I am not so deep in the code).
A)The idea or benefit having a wide parameter file width (so that the extreme positions from the peak has very low procentage) is that low abundant fragments will be filtered out when they are next to high abundant fragment. Right?
That is completely correct. But I never wrote a method that determined these parameters based on training data. Back when I wrote it, I just played with some numbers until it worked out quite well on our datasets.
At the same time when the length difference would be set by the parameter file width (e.g. 15) then the fragments sharing one end and another end is different by smaller step (e.g. 10 - enough to be biologically significant) there will be reported only one fragment (but other fragment(s) will be lost).
Yes, then only one will be reported like it did before; typically the one that is supported with the highest number of reads. We ran into this issue with a dataset in which snoRNAs and tRNAs were sequenced, including the fragments. And certain fragments have the same start or end as the host but obviously differ in the other position. What distance is kind of acceptable? I honestly don't know.
C)Option would be to adjust (decrease width) the parameter file but then (if A is correct) you increase noise by increasing detected low abundant fragments. If the logic (A,B,C) makes sense it would be good if you can set the parameter also smaller then the width of the parameter file. But I am wondering if the length difference would be set smaller (e.g. 6 or 10) than the width (e.g. 15) does it influence the behavior of output of the script. (I am not so deep in the code).
Your way of reasoning has guided me to my answer :+1: . Even if your parameter file has a very broad width, lets say 25, it is technically possible that with a difference of less than 25 (e.g. 15 bp) another fragments still pass the filter. I think this has convinced me that an additional parameter is a better solution than determining it from the filter-parameter file.
Thanks !
Hi I just noticed that Flaimapper is converted to Python3 and you can detect fragments with the same ends. Thats cool. Is it possible to add parameter to change the length difference? I am actually interested in length differences 5+´. If I understand correctly than currently the 15 nt length difference comes from the distribution in parameter file used. It would mean that in case of shorter length difference the different parameter file (narrower distribution range should be used). I am using anyway narrower distribution.