Closed valentinaOpazo closed 6 months ago
Thanks for your interest in our tool. It seems that the h5 file that you generated failed to be read by scanpy. You can have a try on whether the h5 file can be read directly by the scanpy.read_10x_h5 function.
Another way is to change your Signac object to a SingleCellExperiment object and store it in rds, then read it in Python by following the function:
def read_SingleCellExperiment_rds(input_RDS):
import anndata2ri
from rpy2.robjects import r
anndata2ri.activate()
rscript = 'readRDS("{RDS_file_path}")'.format(RDS_file_path = input_RDS)
adata = r(rscript)
adata.var.columns = [str(i) for i in adata.var.columns]
adata.obs.columns = [str(i) for i in adata.obs.columns]
return data
Then you can store the adata to h5ad format that SCRIP impute function supports as well. You might need to install the anndata2ri and rpy2 libraries first.
Sorry for the inconvenience and I will add MTX format support ASAP.
Thank you for your quick answer, it is very useful! I'm gonna do what you proposed it
Hi again @xindong95 ! I followed your advice and I converted my Signac object to SingleCellExperiment object and then saved it as an h5ad file. But now, when I run the impute function I have another error, specifically with _search_reffactor function (the imputed cell beds files are correctly created). Apparently, the problem is with giggle
My code: SCRIP impute -i combined.subsampled3.h5ad -s hs -p Test2 --factor FLI1
The output message:
INFO 2023-09-13 17:05:52 Generating beds...
INFO 2023-09-13 17:05:52 Start generating cells beds ... /home/vale/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1117: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_sub[k] = df_sub[k].cat.remove_unused_categories() INFO 2023-09-13 17:06:33 Finished generating cells beds! INFO 2023-09-13 17:06:33 Start searching beds from index ... giggle: Unknown option -i Try `giggle --help' for more information. multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 31, in search_ref_factor subprocess.run(cmd, shell=True, check=True) File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt" ' returned non-zero exit status 1. """
The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/vale/.local/bin/SCRIP", line 8, in
sys.exit(main()) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/start.py", line 39, in main run_impute(args) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 218, in run_impute impute(input_mat_adata=feature_matrix, File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 97, in impute search_ref_factor_batch(f'{path}/imputed_beds/', f'{path}/{impute_factor}/imputedresults{impute_factor}/', ref_path, impute_factor, n_cores) File "/home/vale/.local/lib/python3.8/site-packages/SCRIP/imputation/impute.py", line 45, in search_ref_factor_batch p.starmap(search_ref_factor, args) File "/usr/lib/python3.8/multiprocessing/pool.py", line 372, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value subprocess.CalledProcessError: Command 'giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt" ' returned non-zero exit status 1.
I would like to mention that I downloaded the TR_human database from zenodo and extracted it with tar xvzf
just as the documentation indicates. Also, I run the config function with the proper path (the same path that appears in "giggle search -i ...").
Hi, can you try giggle search
in the terminal and paste the results?
The giggle version of SCRIP built-in is giggle v0.6.3 (the newest), and the command should output:
giggle, v0.6.3
usage: giggle search -i <index directory> [options]
options:
-i giggle index directory
-r <regions (CSV)>
-q <query file>
-o give reuslts per record in the query file (omits empty results)
-c give counts by indexed file
-s give significance by indexed file (requires query file)
-v give full record results
-f print results for files that match a pattern (regex CSV)
-g genome size for significance testing (default 3095677412)
-l list the files in the index
also, you can try the following command and look through if there are any errors.
giggle search -i "/mnt/c/Users/56966/OneDrive - Universidad de Concepción/FCV/single-BC/MergeAllBreast/TR_human" -q "Test2/imputation//imputed_beds/BC1_AAACTGCAGAATCAAC-1.bed.gz" -s -f JUNB > "Test2/imputation//JUNB/imputed_results_JUNB/BC1_AAACTGCAGAATCAAC-1.txt"
Looking forward to your any feedbacks. Thanks!
Thanks for your answer, I really appreciate your help. When I ran giggle search
got this message:
Command 'giggle' not found,
So, I assumed that I needed to install giggle independently to SCRIP (it could be helpful to add it to the SCRIP requirement) Finally SCRIP is running well !!
Wonderful!
Installing SCRIP with GitHub will install the giggle automatically, but it may not with pip. I'll consider removing the giggle in installation via GitHub and adding a step-by-step giggle install instruction in the README, or making installing giggle as a function of SCRIP.
Thanks for all your feedback and I hope SCRIP is useful in your work!
It's me again. Once impute function ended I ran target and everything worked fine, so I obtained a .h5ad file. But now I don't understand what exactly information have this file and what structure it has. It's possible for example have a list of top100 target genes of FLI1(based on some parameters calculated by the target function). Also, I saw that in the example of human fetal organs you converted the .h5ad file into 10x mtx format file to easily read into R, it is possible that you could share how you did it?
Sorry for all the questions, but I'm really interested in the results that I can obtain from SCRIP
The .h5ad file contains the values of the targets that you inferred with TF. You can read it in Python using the scanpy.read_h5ad function or in R using the seuratdisk package. It is similar to a single-cell RNA-seq file, where the features are genes and the observations are cells. You can easily check the top targets in each cell or perform some clustering analysis, or even differential targets analysis using variable gene functions. You can use the functions in Seurat or scanpy for these purposes.
If you would like to save .h5ad as 10X MTX files, you can use the following functions:
def write_to_mtx(data, path):
if not os.path.exists(path):
os.makedirs(path)
pd.DataFrame(data.var.index, data.var.index).to_csv(os.path.join(path, "genes.tsv"), sep="\t", header=False)
pd.DataFrame(data.obs.index).to_csv(os.path.join(path, "barcodes.tsv"), sep = "\t", index=False, header=False)
data.obs.to_csv(os.path.join(path, "metadata.tsv"), sep = "\t", index=False, header=False)
io.mmwrite(os.path.join(path, "matrix.mtx"), data.X.T)
or simply from SCRIP.utilities.utils import write_to_mtx
to use the function.
If you have any questions or feedback about SCRIP, please feel free to contact me! I am happy to assist you and improve SCRIP to make it more user-friendly. I would appreciate it if you could use SCRIP for your future analysis and recommend it to others who might be interested. Our goal is to make SCRIP a valuable tool for scATAC-seq data analysis and help the communities discover new biological insights.
Thank you !! I analyzed my data in parallel with Python (scanpy package) and R (after to convert h5ad to mtx format file), and both showed similar results. However, I ran SCRIP (impute and target) for seven different transcription factors (each one run independently), and the top 20 target genes are very similar between these seven TFs. Also, the top20 targets of two TFs differ only in six genes, even have the same order and value (the six different target genes are marked with "*").
Top20 target genes for Transcription factor "A"
Target Value RRM2B 1,57402666 SYNCRIP 1,49569685 NTAQ1 1,43606113 UBR5-AS1 1,28829231 PPP2R5A 1,24412231 HBP1* 1,23833557 NUDCD1 1,23052868 PRRC2A* 1,21214999 CDK13 1,19023322 HSP90AB1 1,17828139 SUCO 1,16173209 UBE2D3 1,15968754 DNAI4 1,14542942 RALY 1,14261838 RNF139-AS1 1,13604039 ATF3 1,13588482 TOB1-AS1* 1,13249802 CGGBP1 1,12623713 RBM15-AS1 1,11772619 ZBTB20 1,11714044
Top20 target genes for Transcription factor "B"
Target Value RRM2B 1,57402666 SYNCRIP 1,49569685 NTAQ1 1,43606113 UBR5-AS1 1,28829231 ZNF687-AS1* 1,25078829 PPP2R5A 1,24412231 NUDCD1 1,23052868 FBRS* 1,20798469 ELL* 1,19485175 CDK13 1,19023322 HSP90AB1 1,17898713 SUCO 1,16173209 UBE2D3 1,15968754 DNAI4 1,14542942 RALY 1,14261838 RNF139-AS1 1,13604039 ATF3 1,13585091 CGGBP1 1,12623713 RBM15-AS1 1,11772619 ZBTB20 1,11714044
What do you think about this result? It is an intrinsic property of my data or maybe some artifact is interfering with my results? How can I validate that SCRIP ran well and these are correct results? maybe a transcription factor that works like control positive
The results you got are reasonable if you selected only one cell of different TFs. The accessibility landscape of a cell is fixed and reflects its intrinsic state. TFs are expected to bind the accessible regions that match their sequence features. SCRIP eliminates the regions that lack the motif or validated ChIP-seq peaks.
These results indicate that many accessible regions in the cell can be bound by multiple TFs of your interest, but there are also some differences among them (which you marked with *). The remaining target genes all have potential binding regions for those TFs, which may suggest coregulation patterns. You can double-check the result at the cluster level and it may generate more reliable results. Also, You can consult some papers, check known databases, or perform some experimental validation.
The input .h5ad data that I used to SCRIP included 1000 cells and then when I analyzed the target
output file I calculated the mean of all these 1000 cells. In Python, I also checked the scanpy's function highest_expr_genes
. In R I followed the example of Human Fetal Organs, specifically these lines
use_qc_GATA3<-use_GATA3[,colnames(use_GATA3)%in%rownames(as.data.frame(sort(colSums(use_GATA3),decreasing=TRUE)[1:500]))] #cell qc
Lymphoid_use_GATA3<-as.data.frame(sort(rowMeans(use_qc_GATA3),decreasing = TRUE)[1:1000]) #gene qc
eg_gata3_ly <- bitr(rownames(Lymphoid_use_GATA3), fromType="SYMBOL", toType=c("ENTREZID"), OrgDb="org.Hs.eg.db")
Considering that the values are the mean of 1000 cells and not only one selected cell, it's still a reasonable result? Is the mean of all cells an appropriate metric? Additionally, I checked databases of target TF, and 30-40% of the targets predicted by SCRIP are deposited in databases (a value similar to that obtained for GATA3 in LC of the SCRIP paper)
Considering that the values are the mean of 1000 cells and not only one selected cell, it's still a reasonable result?
The answer may depend on the cell population and identity. If the 1000 cells share the same cell identity, the result could be reasonable. However, if the cells include multiple cell types, the result may be inaccurate, since the analysis is based on mixed data.
Additionally, I checked databases of target TF, and 30-40% of the targets predicted by SCRIP are deposited in databases (a value similar to that obtained for GATA3 in LC of the SCRIP paper)
The targets of a TF may vary, but I would not expect them to change much. The sequence feature restricts the binding sites of a specific TF. Therefore, if you look at a database such as Cistrome DB, the targets will differ depending on the treatment, condition, or tissue, but there will also be many targets that remain unchanged.
I hope this answers your question.
I really appreciate your advice! Thanks for your answers and for developing this useful tool, I'm using it a lot
Hello @xindong95 , I'm interested in using SCRIP software to identify gene targets of some transcription factors that apparently have an important activity on my samples. I processed my scATAC-seq data with Signac pipeline, so I exported peaks by cell-matrix from a Signac object and then I covert it to .h5 format. I planned to use impute function to later use the function target but when I use this code
SCRIP impute -i $input_path/peaks_by_cell.h5 -s hs -f mtx -p Target_FLi1 --factor FLI1
I got the next error:I don't understand if the error is because the input file isn't correct or something else. Its possible to use another format as an input file for this function?