Replicate Figure2A and Figure3A on manuscript

yzhong36 commented 3 weeks ago

Hi,

I was trying to replicate your results using same set of fastq files as your paper used. But I failed to reproduce the same trend of results. I was wondering what specific parameters you used in configure file? I was using: sites_initial: 1 runs_initial: 20 sites_quality: 10 runs_quality: 1000

It seems like Figure3A is more likely to be reproduced (still show discrepancy but ARI still can be accepted), but Figure2A is hugely different by using thereshold as above. All processed junction reads were carefuly handled (I used both STAR and HISAT2 for benchmarking purpose before using LeafCutter). Thanks.

kokox10 commented 3 weeks ago

Thank you for your interest, the filtering parameters we used are as follows: sites_initial: 10, runs_initial: 5, sites_quality: 10, and runs_quality: 1000. In addition to the filtering parameters, for Figure2A, the imputation parameters we iterated with were knn=20 and iteration=3. We also adjusted the weighting ratio between imputation information and sparsity information used in clustering based on data characteristics to 0.6 (this parameter is typically stable and not included in the readme, but users can adjust it according to their data requirements at line 108 in normalize.py). For Figure3A, our iteration imputation parameters were knn=10 and iteration=3. Furthermore, the parameter settings from the initial mapping step to the subsequent clustering step (including seed numbers) could also influence the final clustering results. We hope the above information proves helpful to you.

yzhong36 commented 3 weeks ago

Hi, thanks for your reply. I dont quite understand weighting ratio at line 108 in normalize.py. There is no parameter adjustable there, but only imputed values assigned back to its dimension accordingly.

Also, I was using all Fastq files (756 in total) with only one patient for Figure2A, which is same as you used. And all Fastq files for Figure3A (1533 in total). I'm not sure if you only reported filtered fastqs in table 1 after running SCASL, because it seems like the paper did not explicitly mention the processing step.

kokox10 commented 3 weeks ago

The line 108 in normalize.py defines the imputation information used during clustering. Since the weighting ratio typically doesn't need adjustment, we haven't separately defined this parameter. In the code mat[:, :, 0] = df_fillna.values.T * 1, the 1 represents the weight, which can be modified to adjust the weighting ratio.

The involvement of other cells in clustering can indeed have a significant impact on the clustering results. Initially, we did not perform any specific data processing; it was mostly standard filtering methods such as quality screening and selection of cell types (information about cell types is mentioned in the original literature of public data). For Figure2A, we only used cells from patient H010, retaining 422 cells after rigorous cell quality control and splicing profile filtering. In the case of Figure3A, as our main focus was on studying tumor cells, we specifically chose epithelial cells for analysis. The 1533 files you mentioned likely contain cells of all other cell types. Hope this information can help you.

yzhong36 commented 3 weeks ago

Thanks for your reply. If I understand correctly, numeric change for 1 will apply for both imputed and original values, which sounds weird to me, unless you trace back to its locations corresponding to imputed value (in this case, it is not a simple scalar anymore).

You reported 422 cells for Figure2A in Table 1, but there was actually only 405 cells for Figure2A (see supply data). That's why I'm wondering if:

You basically used all fastq files for H010 (756) and utilized your SCASL filtering to get those final number? OR
You pre-selected cells somehow before running SCASL and used those to run SCASL?

I dont worry about too much for Figure3A cause it seems like more reproducible. Thanks.

kokox10 commented 2 weeks ago

For the first question, while 1 indeed applies to imputed and original values, what we adjusted here was the weighting between the overall AS probability matrix information after imputation and the sparsity information of the original matrix used during clustering. Therefore, by keeping the 1 unchanged at line 109 (which returns the sparsity information of the matrix), adjusting only line 108 changes the balance of using these two types of information in the final result.

Regarding the second question, we did not perform any additional preprocessing; instead, we filtered all fastq files through SCASL. The 422 cells used in clustering (as shown in Table 1 and Figure 2A). However, a few cells within this set were not utilized in the results of the data source paper and were not labeled as "tumor" or "metastasis". Consequently, in the final supply data, I removed these cells (leaving only 405 cells, as displayed in the source data). I apologize for any confusion this may have caused. I hope this information proves helpful to you.

sky1-max commented 1 week ago

Hi, I have a question that has been bothering me for a long time, according to S9D in the literature, is it to map the clustering results based on gene expression to the AS-based clustering map (or map the clustering results of AS to the AS-based clustering map)? Are there conditions for reference? Or do you debug multiple times according to the parameters of the SCASL environment? Thank you very much!

yzhong36 commented 5 days ago

Hi,

Thanks for your reply. I modified the parameter to balance the relative contribution as you said (0.6), but it did not show too much difference compared to the default one, which means still showed a signifcantly dfferent results compared to Fig2A. Also, you said only 405 cells being used becasue of unavailable labels, which sounds not true to me because its original GEO resource listed labels for every cells in the meta file.

xryanglab / SCASL

Replicate Figure2A and Figure3A on manuscript #8