Implementing normalized gene-score calculation

Xieeeee commented 8 months ago

The gene-score module is very useful and to allow integration with RNA-seq dataset. The current implementation calculates the raw sum of interactions without normalization, although normalized metric shows better performance, which may worth adding in the gene-score module.

DavidWarrenKatz commented 4 months ago

I agree, this pipeline should include total contact normalization for each cell.

Xieeeee commented 4 months ago

Some follow up: I finally did the normalization manually with the raw GAD, following the scGAD paper description of global GAD:

gene_hdf = pd.read_hdf("../hicluster/imputed_matrix/10kb_resolution/genescore/geneimputescore.hdf", key='data')

### Filter long gene
lgene = gene_meta.loc[gene_meta['bin_len'] >= 10].index
gene_hdf_filt = gene_hdf[np.intersect1d(lgene, gene_hdf.columns)]

### norm read depth
row_norm = gene_hdf_filt.div(gene_hdf_filt.sum(axis=1), axis=0)
mean_arr = row_norm.sum(axis=0)/row_norm.shape[0]
sd_arr = np.sqrt(((row_norm - mean_arr) ** 2).sum(axis=0) / (row_norm.shape[0] - 1))
norm_gene_hdf = (row_norm - mean_arr).div(sd_arr, axis=1)

DavidWarrenKatz commented 4 months ago

Thank you. Just normalizing the imputed matrices from scHicluster so that the Frobenius norm of the imputed matrices are 1 improved the clustering a lot for me.

zhoujt1994 / scHiCluster

Implementing normalized gene-score calculation #22