Input to impute function

valentinaOpazo commented 1 year ago

Hello @xindong95 , I'm interested in using SCRIP software to identify gene targets of some transcription factors that apparently have an important activity on my samples. I processed my scATAC-seq data with Signac pipeline, so I exported peaks by cell-matrix from a Signac object and then I covert it to .h5 format. I planned to use impute function to later use the function target but when I use this code SCRIP impute -i $input_path/peaks_by_cell.h5 -s hs -f mtx -p Target_FLi1 --factor FLI1 I got the next error:

Traceback (most recent call last): File "/home/vale/.local/bin/SCRIP", line 8, in sys.exit(main()) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/start.py", line 39, in main run_impute(args) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 187, in run_impute feature_matrix = sc.read_10x_h5(feature_matrix_path, gex_only=False) File "/home/vale/.local/lib/python3.8/site-packages/scanpy/readwrite.py", line 195, in read_10x_h5 adata = _read_legacy_10x_h5(filename, genome=genome, start=start) File "/home/vale/.local/lib/python3.8/site-packages/scanpy/readwrite.py", line 221, in _read_legacy_10x_h5 _collect_datasets(dsets, f[genome]) File "/home/vale/.local/lib/python3.8/site-packages/scanpy/readwrite.py", line 253, in _collect_datasets for k, v in group.items(): AttributeError: 'Dataset' object has no attribute 'items'

I don't understand if the error is because the input file isn't correct or something else. Its possible to use another format as an input file for this function?

xindong95 commented 1 year ago

Thanks for your interest in our tool. It seems that the h5 file that you generated failed to be read by scanpy. You can have a try on whether the h5 file can be read directly by the scanpy.read_10x_h5 function.

Another way is to change your Signac object to a SingleCellExperiment object and store it in rds, then read it in Python by following the function:

def read_SingleCellExperiment_rds(input_RDS):
    import anndata2ri
    from rpy2.robjects import r
    anndata2ri.activate()
    rscript = 'readRDS("{RDS_file_path}")'.format(RDS_file_path = input_RDS)
    adata = r(rscript)
    adata.var.columns = [str(i) for i in adata.var.columns]
    adata.obs.columns = [str(i) for i in adata.obs.columns]
    return data

Then you can store the adata to h5ad format that SCRIP impute function supports as well. You might need to install the anndata2ri and rpy2 libraries first.

Sorry for the inconvenience and I will add MTX format support ASAP.

valentinaOpazo commented 1 year ago

Thank you for your quick answer, it is very useful! I'm gonna do what you proposed it

valentinaOpazo commented 1 year ago

Hi again @xindong95 ! I followed your advice and I converted my Signac object to SingleCellExperiment object and then saved it as an h5ad file. But now, when I run the impute function I have another error, specifically with _search_reffactor function (the imputed cell beds files are correctly created). Apparently, the problem is with giggle

My code: SCRIP impute -i combined.subsampled3.h5ad -s hs -p Test2 --factor FLI1

The output message:

INFO 2023-09-13 17:05:52 Generating beds...
INFO 2023-09-13 17:05:52 Start generating cells beds ... /home/vale/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1117: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_sub[k] = df_sub[k].cat.remove_unused_categories() INFO 2023-09-13 17:06:33 Finished generating cells beds! INFO 2023-09-13 17:06:33 Start searching beds from index ... giggle: Unknown option -i Try `giggle --help' for more information. multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 31, in search_ref_factor subprocess.run(cmd, shell=True, check=True) File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt" ' returned non-zero exit status 1. """

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/vale/.local/bin/SCRIP", line 8, in sys.exit(main()) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/start.py", line 39, in main run_impute(args) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 218, in run_impute impute(input_mat_adata=feature_matrix, File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 97, in impute search_ref_factor_batch(f'{path}/imputed_beds/', f'{path}/{impute_factor}/imputedresults{impute_factor}/', ref_path, impute_factor, n_cores) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 45, in search_ref_factor_batch p.starmap(search_ref_factor, args) File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value subprocess.CalledProcessError: Command 'giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt" ' returned non-zero exit status 1.

I would like to mention that I downloaded the TR_human database from zenodo and extracted it with tar xvzf just as the documentation indicates. Also, I run the config function with the proper path (the same path that appears in "giggle search -i ...").

xindong95 commented 1 year ago

Hi, can you try giggle search in the terminal and paste the results?

The giggle version of SCRIP built-in is giggle v0.6.3 (the newest), and the command should output:

giggle, v0.6.3
usage:   giggle search -i <index directory> [options]
         options:
             -i giggle index directory
             -r <regions (CSV)>
             -q <query file>
             -o give reuslts per record in the query file (omits empty results)
             -c give counts by indexed file
             -s give significance by indexed file (requires query file)
             -v give full record results
             -f print results for files that match a pattern (regex CSV)
             -g genome size for significance testing (default 3095677412)
             -l list the files in the index

also, you can try the following command and look through if there are any errors.

giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1_AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt"

Looking forward to your any feedbacks. Thanks!

valentinaOpazo commented 1 year ago

Thanks for your answer, I really appreciate your help. When I ran giggle search got this message:

Command 'giggle' not found,

So, I assumed that I needed to install giggle independently to SCRIP (it could be helpful to add it to the SCRIP requirement) Finally SCRIP is running well !!

xindong95 commented 1 year ago

Wonderful!

Installing SCRIP with GitHub will install the giggle automatically, but it may not with pip. I'll consider removing the giggle in installation via GitHub and adding a step-by-step giggle install instruction in the README, or making installing giggle as a function of SCRIP.

Thanks for all your feedback and I hope SCRIP is useful in your work!

valentinaOpazo commented 1 year ago

It's me again. Once impute function ended I ran target and everything worked fine, so I obtained a .h5ad file. But now I don't understand what exactly information have this file and what structure it has. It's possible for example have a list of top100 target genes of FLI1(based on some parameters calculated by the target function). Also, I saw that in the example of human fetal organs you converted the .h5ad file into 10x mtx format file to easily read into R, it is possible that you could share how you did it?

Sorry for all the questions, but I'm really interested in the results that I can obtain from SCRIP

xindong95 commented 1 year ago

The .h5ad file contains the values of the targets that you inferred with TF. You can read it in Python using the scanpy.read_h5ad function or in R using the seuratdisk package. It is similar to a single-cell RNA-seq file, where the features are genes and the observations are cells. You can easily check the top targets in each cell or perform some clustering analysis, or even differential targets analysis using variable gene functions. You can use the functions in Seurat or scanpy for these purposes.

If you would like to save .h5ad as 10X MTX files, you can use the following functions:

def write_to_mtx(data, path):
    if not os.path.exists(path):
        os.makedirs(path)
    pd.DataFrame(data.var.index, data.var.index).to_csv(os.path.join(path, "genes.tsv"), sep="\t", header=False)
    pd.DataFrame(data.obs.index).to_csv(os.path.join(path, "barcodes.tsv"), sep = "\t", index=False, header=False)
    data.obs.to_csv(os.path.join(path, "metadata.tsv"), sep = "\t", index=False, header=False)
    io.mmwrite(os.path.join(path, "matrix.mtx"), data.X.T)

or simply from SCRIP.utilities.utils import write_to_mtx to use the function.

If you have any questions or feedback about SCRIP, please feel free to contact me! I am happy to assist you and improve SCRIP to make it more user-friendly. I would appreciate it if you could use SCRIP for your future analysis and recommend it to others who might be interested. Our goal is to make SCRIP a valuable tool for scATAC-seq data analysis and help the communities discover new biological insights.

valentinaOpazo commented 1 year ago

Thank you !! I analyzed my data in parallel with Python (scanpy package) and R (after to convert h5ad to mtx format file), and both showed similar results. However, I ran SCRIP (impute and target) for seven different transcription factors (each one run independently), and the top 20 target genes are very similar between these seven TFs. Also, the top20 targets of two TFs differ only in six genes, even have the same order and value (the six different target genes are marked with "*").

Top20 target genes for Transcription factor "A"

Target Value

RRM2B 1,57402666

SYNCRIP 1,49569685

NTAQ1 1,43606113

UBR5-AS1 1,28829231

PPP2R5A 1,24412231

HBP1* 1,23833557

NUDCD1 1,23052868

PRRC2A* 1,21214999

CDK13 1,19023322

HSP90AB1 1,17828139

SUCO 1,16173209

UBE2D3 1,15968754

DNAI4 1,14542942

RALY 1,14261838

RNF139-AS1 1,13604039

ATF3 1,13588482

TOB1-AS1* 1,13249802

CGGBP1 1,12623713

RBM15-AS1 1,11772619

ZBTB20 1,11714044

Target	Value
RRM2B	1,57402666
SYNCRIP	1,49569685
NTAQ1	1,43606113
UBR5-AS1	1,28829231
PPP2R5A	1,24412231
HBP1*	1,23833557
NUDCD1	1,23052868
PRRC2A*	1,21214999
CDK13	1,19023322
HSP90AB1	1,17828139
SUCO	1,16173209
UBE2D3	1,15968754
DNAI4	1,14542942
RALY	1,14261838
RNF139-AS1	1,13604039
ATF3	1,13588482
TOB1-AS1*	1,13249802
CGGBP1	1,12623713
RBM15-AS1	1,11772619
ZBTB20	1,11714044

Top20 target genes for Transcription factor "B"

Target Value

RRM2B 1,57402666

SYNCRIP 1,49569685

NTAQ1 1,43606113

UBR5-AS1 1,28829231

ZNF687-AS1* 1,25078829

PPP2R5A 1,24412231

NUDCD1 1,23052868

FBRS* 1,20798469

ELL* 1,19485175

CDK13 1,19023322

HSP90AB1 1,17898713

SUCO 1,16173209

UBE2D3 1,15968754

DNAI4 1,14542942

RALY 1,14261838

RNF139-AS1 1,13604039

ATF3 1,13585091

CGGBP1 1,12623713

RBM15-AS1 1,11772619

ZBTB20 1,11714044

Target	Value
RRM2B	1,57402666
SYNCRIP	1,49569685
NTAQ1	1,43606113
UBR5-AS1	1,28829231
ZNF687-AS1*	1,25078829
PPP2R5A	1,24412231
NUDCD1	1,23052868
FBRS*	1,20798469
ELL*	1,19485175
CDK13	1,19023322
HSP90AB1	1,17898713
SUCO	1,16173209
UBE2D3	1,15968754
DNAI4	1,14542942
RALY	1,14261838
RNF139-AS1	1,13604039
ATF3	1,13585091
CGGBP1	1,12623713
RBM15-AS1	1,11772619
ZBTB20	1,11714044

What do you think about this result? It is an intrinsic property of my data or maybe some artifact is interfering with my results? How can I validate that SCRIP ran well and these are correct results? maybe a transcription factor that works like control positive

xindong95 commented 1 year ago

The results you got are reasonable if you selected only one cell of different TFs. The accessibility landscape of a cell is fixed and reflects its intrinsic state. TFs are expected to bind the accessible regions that match their sequence features. SCRIP eliminates the regions that lack the motif or validated ChIP-seq peaks.

These results indicate that many accessible regions in the cell can be bound by multiple TFs of your interest, but there are also some differences among them (which you marked with *). The remaining target genes all have potential binding regions for those TFs, which may suggest coregulation patterns. You can double-check the result at the cluster level and it may generate more reliable results. Also, You can consult some papers, check known databases, or perform some experimental validation.

valentinaOpazo commented 1 year ago

The input .h5ad data that I used to SCRIP included 1000 cells and then when I analyzed the target output file I calculated the mean of all these 1000 cells. In Python, I also checked the scanpy's function highest_expr_genes. In R I followed the example of Human Fetal Organs, specifically these lines

use_qc_GATA3<-use_GATA3[,colnames(use_GATA3)%in%rownames(as.data.frame(sort(colSums(use_GATA3),decreasing=TRUE)[1:500]))]  #cell qc
Lymphoid_use_GATA3<-as.data.frame(sort(rowMeans(use_qc_GATA3),decreasing = TRUE)[1:1000]) #gene qc
eg_gata3_ly <- bitr(rownames(Lymphoid_use_GATA3), fromType="SYMBOL", toType=c("ENTREZID"), OrgDb="org.Hs.eg.db")

Considering that the values are the mean of 1000 cells and not only one selected cell, it's still a reasonable result? Is the mean of all cells an appropriate metric? Additionally, I checked databases of target TF, and 30-40% of the targets predicted by SCRIP are deposited in databases (a value similar to that obtained for GATA3 in LC of the SCRIP paper)

xindong95 commented 1 year ago

Considering that the values are the mean of 1000 cells and not only one selected cell, it's still a reasonable result?

The answer may depend on the cell population and identity. If the 1000 cells share the same cell identity, the result could be reasonable. However, if the cells include multiple cell types, the result may be inaccurate, since the analysis is based on mixed data.

Additionally, I checked databases of target TF, and 30-40% of the targets predicted by SCRIP are deposited in databases (a value similar to that obtained for GATA3 in LC of the SCRIP paper)

The targets of a TF may vary, but I would not expect them to change much. The sequence feature restricts the binding sites of a specific TF. Therefore, if you look at a database such as Cistrome DB, the targets will differ depending on the treatment, condition, or tissue, but there will also be many targets that remain unchanged.

I hope this answers your question.

valentinaOpazo commented 1 year ago

I really appreciate your advice! Thanks for your answers and for developing this useful tool, I'm using it a lot

xindong95 / SCRIP

Input to impute function #2