Issue with Aberrant Gene Detection

BrendanDee commented 9 months ago

Were there any outliers or aberrant genes in the sample data df_cts that axo_pred,py found ? I recreated the environment according to the instructions and the "label" column in df_cts.axo_xy was always 0 (the prediction of the model) after running demo.sh on ubuntu.

xuwenjian85 commented 9 months ago

df_cts.axo_xy is a prediction from the testdata df_cts.txt. df_cts.txt was a truncated table from read data table. It is possible there no true outliers were included. I didn't notice this problem before you tell me. I will remake the testdata next week.

BrendanDee commented 9 months ago

I'm unsure if it's an issue with my implementation or the test data you mentioned. In regards to outlier detection , I know it's primarily done using axo_pred.py, with the LOF, IF, and OC-SVM model where the script uses the outrider normalization counts and p values from Outsingle .ogs . So far axo_pred.py selects the LOF model for other datasets (GTEx data) but it doesn't seem to detect any outliers, they all return "0" in data.axo_xy "label" for non-aberrant and currently I'm unsure of what the model deems as aberrant genes / outliers. There are genes that are confirmed aberrant by outsingle and outrider but axo_pred.py seems to detect nothing.

I'm aware in axo_pred,py that the script calculates rank_devi and cts_devi which axolotl compares them to regular gene expression to detect outliers, then generates df_cts.axo_xy which includes the stats and whether the samples are non-outlier and outlier. I could use more information on the outlier detection itself and how axo_pred.py works to detect outliers.

xuwenjian85 commented 9 months ago

First, the script axo_pred.py mainly creates a feature table with five features, named *.axo_xy.txt. The column "label" here was only used for myself during method development. The 0s in the column 'label' is ignored by LOF model later. The outlier scores is in axo.txt rather than axo_xy.txt. I keep axo.txt as the same shape as input file df_cts.txt.

Axo doesn't provide hard threshold value. The purpose of axo is to prioritization rather than classification. You could refer to details in our preprint at https://www.biorxiv.org/content/10.1101/2024.01.07.574502v1

Thank you for your feedback.

BrendanDee commented 9 months ago

I see ,so the outlier scores are obtained from axo_pred.py and saved in the axo file. So the LOF model takes into account the devi variables (deviations) like cts_devi then uses that to generate the anomaly scores saved in axo. This then determine the likelihood of a gene to illicit aberrant gene expression. Then you manually checked if they were spots for aberrant/outlier gene events. I hope I understood correctly.

xuwenjian85 commented 9 months ago

Exactly

BrendanDee commented 9 months ago

I'm interested in how you manually detected the genes which are responsible for aberrant outlier gene events, did you have a database you had to manually look through?

BrendanDee commented 9 months ago

1 last question if an anomaly score is higher or lower does it increase the likelihood of a gene being aberrant

xuwenjian85 commented 9 months ago

I'm interested in how you manually detected the genes which are responsible for aberrant outlier gene events, did you have a database you had to manually look through?

No database. I have searched related publications and collected the testdata

xuwenjian85 commented 9 months ago

1 last question if an anomaly score is higher or lower does it increase the likelihood of a gene being aberrant

LOF outputs are negatives. The lower score means more aberrant.

NoNameAvailable-Tang commented 9 months ago

Hi Xu, I'm on the same project as Brendan and we are trying to re-create your methods on about 50 GTEx datasets. Would you mind sharing the distribution of anomaly stores you obtained from your GTEx datasets? Just to make sure we implemented it correctly - we are getting almost all anomaly scores between [-2, 0]. Thank you.

xuwenjian85 commented 9 months ago

Hi Xu, I'm on the same project as Brendan and we are trying to re-create your methods on about 50 GTEx datasets. Would you mind sharing the distribution of anomaly stores you obtained from your GTEx datasets? Just to make sure we implemented it correctly - we are getting almost all anomaly scores between [-2, 0]. Thank you.

fig_score_distribution_gtex 2024-02-07

xuwenjian85 / axolotl

Issue with Aberrant Gene Detection #1