zhoulab / sclc-scripts

scripts for "Significantly mutated genes and regulatory pathways in SCLC—a meta-analysis"
https://doi.org/10.1016/j.cancergen.2017.05.003
2 stars 0 forks source link

Transform nonsilent mutation data #10

Closed varsh1090 closed 7 years ago

varsh1090 commented 7 years ago
  1. Input file - 4dataset_nonsilent.txt
  2. Output file format -

image

Note: Please count each gene mutated for each sample only once, so for each sample (row) and each gene, the value is either 0 (absence of mutation) or 1 (presence of mutation).

varsh1090 commented 7 years ago

Files in - /ufrc/zhou/share/projects/bioinformatics/SCLC/sclc-scripts/SCLCFileTransform

varsh1090 commented 7 years ago

I used - library(qdapTools) X <- t(mtabulate(lapply(split(df$V2, df$V1), :, length(unique(df$V2)))))

Reference - http://stackoverflow.com/questions/36381402/convert-2-column-dataframe-into-completed-binary-matrix

It worked but I could not get it to print the row names.

Input -

Gene Patient Gene1 1 Gene2 2 Gene3 3

Output -

Gene1 Gene2 0 1 1 1 0 0 Desired output -

Patient Gene1 Gene2 1 0 1 2 1 1 3 1 1

victorlin commented 7 years ago

I used this:

import pandas as pd
mutations = pd.read_table('data/4dataset_nonsilent.txt',
                          names=['gene', 'patient', 'effect', 'categ'])
mat = pd.crosstab(mutations['patient'], mutations['gene'])
mat = mat.apply(lambda x: x > 0) * 1
mat.to_csv('gene_sample_crosstab.tsv', sep='\t')

The output file is /ufrc/zhou/share/projects/bioinformatics/SCLC/sclc-scripts/results/gene_sample_crosstab.tsv.

varsh1090 commented 7 years ago

Thanks, does this count each gene mutated for each sample only once? Just making sure.

victorlin commented 7 years ago

@varsh1090 yes, that is done by this line:

mat = mat.apply(lambda x: x > 0) * 1
varsh1090 commented 7 years ago

Got it, can we verify the results by checking a few samples and the genes mutated?

victorlin commented 7 years ago

I'm unsure what you want to test specifically, but here is a test to verify that all (patient, gene) pairs in the original data yield a nonzero value in the final matrix:

patients = mutations['patient'].values
genes = mutations['gene'].values

for patient, gene in zip(patients, genes):
    assert mat.loc[patient, gene] != 0

(no errors)

varsh1090 commented 7 years ago

Ok thanks!