How to understand the parameter "Inactive_Gene_ratio_THR"

ZixiangPAN commented 5 months ago

Dear Dr Tomofuji:

Many thanks for your reply for my last issue. This time I would like to inquire about the parameter "Inactive_Gene_ratio_THR" in the function run_scLinaX. I have applied both scRNA-seq and Multiome datasets to the QC function run_RefGeneQC, however, all the minimum values of "Mean_AR_target" and "Mean_AR_reference" in my datasets are no less than 0.2 (And it is not possible to pass the threshold as you mentioned in the BioRxiv manuscript (0.05, 0.075 or 0.1)), may I ask:

(1) How to understand the the parameter "Inactive_Gene_ratio_THR"? Does it mean that the lower threshold you select, the more potential escapees will be filtered out? (2) Do you have a certain criterion of setting this parameter? Do you have any suggestion for the parameter setting on 10x scRNA-seq and Multiome data? (3) I noticed that, the higher threshold (the looser standard you set), the more grouped cells (in the output) you will get, however, the higher proportion of unassigned cells (on those cells that SNPs will be phased on both Xa on Allele 1 and Allele 2) will appear. Does that mean the strict criterion I set, the more cell data will be wasted? How to interpret this phenomena and how to understand those "Unassigned cells" biologically? Is there a dilemma that, on the one hand, the lower threshold (a tougher standard) you set on this parameter, the less genes (and SNPs) you will get for the following steps, which means that less escapees will be included and the reliability of the results increases. However, on the other hand, from a statistics point of view, the tougher standard you set, the less SNPs will be feed to the phasing step, which means it is more unreliable?

Thank you very much.

Zixiang

ytomofuji commented 5 months ago

Hi Zixiang,

Thank you for using scLinaX!

(1) The Inactive_Gene_ratio_THR is a threshold used to define reliable inactive genes that can be considered as reference. As you mentioned, selecting a lower threshold will filter out more potential escape genes.

(2) Typically, this value is determined by examining the distribution of values for each dataset. This is because dataset-specific factors like doublets or the soup effect can cause this value to increase, leading to variations in Inactive_Gene_ratio_THR across datasets.

(3) Loosening the threshold may result in escapee genes becoming part of the reference, which can interfere with phasing and assigment of the inactive X chromosomes. If you have too many unassigned cells, it could indicate that the definition of the reference gene set is not appropriate, and I would recommend tightening the threshold.

It seems a bit high for both "Mean_AR_target" and "Mean_AR_reference" to have minimum values of 0.2 or higher. Could your data be influenced by a strong soup effect (typically observed in snRNA-seq) or a high number of doublets? Given the nature of scLinaX, it can be affected by factors like these. Trying pre-defined reference genes available within the package might be helpful.

Please feel free to reach out if you have any further questions or concerns!

Thank you so much.

Yoshi

ZixiangPAN commented 4 months ago

Dear Dr Tomofuji:

Thank you for your interpretation. I still have two additional questions:

(1) How do you specify the Inactive_Gene_ratio_THR according to the distribution of genes' Mean_AR_target and Mean_AR_reference values (mean value, median value, q25/q75/q90 value or something else to guarantee the number of genes preserved)?

(2) Given that scLinaX will filter quite a large proportion of cells (cells failed to pass QC and unassigned cells), is it appropriate to analyse XCI skewing condition on a group of cells based on scLinaX? Will it generate some kind of bias when doing skewing analysis? Do you have any suggestion for doing this kind of analysis with this method?

Thank you very much.

Zixiang

ytomofuji commented 4 months ago

Dear Zixiang,

(1) Regarding how to set the thresholds for Mean_AR_target and Mean_AR_reference, I suggest determining these values based on their distributions. We actually created plots of Mean_AR_target and Mean_AR_reference to guide us in setting the thresholds. In cases like the AIDA dataset, which is relatively well QC'd with fewer doublets and ambient RNA, most inactive genes aggregate in the low Mean_AR_target and low Mean_AR_reference regions. We set our thresholds accordingly based on these observations. I recommend plotting Mean_AR_target and Mean_AR_reference values to assist in setting your thresholds.

(2) While we haven't used scLinaX for XCI skewing analysis ourselves, if a large number of cells are QC failed, it might suggest that there are data quality issues (e.g. doublets/umbient RNA) affecting the scLinaX analysis. Therefore, it might be better to avoid using it for skewing analysis if a substantial number of cells are QC failed.

Best, Yoshi

ytomofuji / scLinaX

How to understand the parameter "Inactive_Gene_ratio_THR" #2