Closed varsh1090 closed 7 years ago
Files in - /ufrc/zhou/share/projects/bioinformatics/SCLC/sclc-scripts/SCLCFileTransform
I used - library(qdapTools) X <- t(mtabulate(lapply(split(df$V2, df$V1), :, length(unique(df$V2)))))
Reference - http://stackoverflow.com/questions/36381402/convert-2-column-dataframe-into-completed-binary-matrix
It worked but I could not get it to print the row names.
Input -
Gene Patient Gene1 1 Gene2 2 Gene3 3
Output -
Gene1 Gene2 0 1 1 1 0 0 Desired output -
Patient Gene1 Gene2 1 0 1 2 1 1 3 1 1
I used this:
import pandas as pd
mutations = pd.read_table('data/4dataset_nonsilent.txt',
names=['gene', 'patient', 'effect', 'categ'])
mat = pd.crosstab(mutations['patient'], mutations['gene'])
mat = mat.apply(lambda x: x > 0) * 1
mat.to_csv('gene_sample_crosstab.tsv', sep='\t')
The output file is /ufrc/zhou/share/projects/bioinformatics/SCLC/sclc-scripts/results/gene_sample_crosstab.tsv
.
Thanks, does this count each gene mutated for each sample only once? Just making sure.
@varsh1090 yes, that is done by this line:
mat = mat.apply(lambda x: x > 0) * 1
Got it, can we verify the results by checking a few samples and the genes mutated?
I'm unsure what you want to test specifically, but here is a test to verify that all (patient, gene) pairs in the original data yield a nonzero value in the final matrix:
patients = mutations['patient'].values
genes = mutations['gene'].values
for patient, gene in zip(patients, genes):
assert mat.loc[patient, gene] != 0
(no errors)
Ok thanks!
Note: Please count each gene mutated for each sample only once, so for each sample (row) and each gene, the value is either 0 (absence of mutation) or 1 (presence of mutation).