yafeng / pan-cancer-proteogenomics-analysis

this repo contains workflow and scripts used for pan-cancer Proteogenomics analysis
0 stars 1 forks source link

About the module used in svmToTSV #1

Open ymatts opened 2 years ago

ymatts commented 2 years ago

I am afraid to contact you at such short notice.

I am interested in the pipeline that you kindly published, and I am running it using label free data, but I am having trouble resolving some of the errors (in the svmToTSV section).

If possible, I would be very happy if you could give me some suggestions on how to solve the problem.

The error seemed to be happening at svmToTSV.

I am trying to test the Python code written there first to confirm that it works.

My question is, are all of the following modules available? I couldn't find them by searching, so if you could tell me how to get them, that would be great.

// The above modules were imported in the python code in svmToTSV in ipaw.hg38.nf, but I could not find them.

L349. from app.readers import pycolator, xml, tsv, mzidplus

Thank you for your time and patience. Best regards Yusuke

For your reference, I would also like to describe the analysis environment, the overall execution status and the error status.

The analysis environment is Ubuntu 18.04 LTS and the version of Nextflow is 21.04.1. The overall execution was as follows

Parameter setup:

export NXF_WORK="~/pan-cancer-proteogenomics-analysis-master/work2"
export NXF_OPTS="-Xms120G -Xmx120G"
path="~/proteogenomics-analysis-workflow/"
path2="~/pan-cancer-proteogenomics-analysis-master/"

Executed command and parameters:

nextflow run ipaw.hg38.nf \
--msgfjar $path2/MSGF/MSGFPlus/MSGFPlus.jar \
--tdb $path2/vardb2/VarDB2_cescMut_ENSEMBL98.revCat.fa \
--gtf $path2/vardb2/VarDB2.gtf \
--inst 1 \
--mods $path2/Mods_220408abe.txt \
--mzmldef $path2/filelist_setnames.txt \
--knownproteins $path2/ENSEMBL98/Homo_sapiens.GRCh38.pep.all.fa \
--blastdb $path2/CANONICALDB/UNIPROT_GENCODE41_ENSEMBL98.fa \
--snpfa $path/MSCanProVar_ensemblV79.filtered.fasta \
--genome $path2/hg38.chr1-22.X.Y.M.fa.masked \
--outdir $path2/results \
--activationFilter HCD \
--MS2error 0.02 \
--qval 0.01 \
-profile docker,standard \
--tasks 4 \
--thread 1 \
--PrecursorMassTolerance 10ppm \
--FragmentMethodID 3 \
--mzid2tsvConverter $path2/MSGF/MzidToTsvConverter/MzidToTsvConverter.exe \
--mono $path2/mono-6.12.0.182/bin/mono \
--verbose 1 \
--tda 1

Output:

executor > local (2)
[91/e2918f] process > makeProtSeq [100%] 1 of 1, cached: 1 ✔
[5d/44709f] process > makeTrypSeq [100%] 1 of 1, cached: 1 ✔
[- ] process > IsobaricQuant -
[9e/274d37] process > createSpectraLookup (1) [100%] 1 of 1, cached: 1 ✔
[0c/6d41da] process > msgfPlus (1) [100%] 1 of 1, cached: 1 ✔
[f3/2c55fc] process > ConvertmzidTotsv (1) [100%] 1 of 1, cached: 1 ✔
[3a/7fe273] process > percolator (1) [100%] 1 of 1, cached: 1 ✔
[ef/a72804] process > getVariantPercolator (1) [100%] 1 of 1, cached: 1 ✔
[88/a4e474] process > getNovelPercolator (1) [100%] 1 of 1, cached: 1 ✔[bb/b3fdf2] process > filterPercolator (1) [100%] 2 of 2, cached: 2 ✔
**[e1/236e28] process > svmToTSV (2) [100%] 1 of 1, failed: 1**
[- ] process > createPSMTables -
[- ] process > mergeSetPSMtable -
[- ] process > prePeptideTable -
[- ] process > createFastaBedGFF -
[- ] process > BlastPNovel -
[- ] process > ParseBlastpOut -
[- ] process > ValidateSingleMismatchNovpeps - [- ] process > novpepSpecAIOutParse -
[- ] process > BLATNovel -
[- ] process > parseBLATout -
[- ] process > labelnsSNP -
[- ] process > phastcons -
[- ] process > phyloCSF -
[- ] process > scanBams -
[- ] process > annovar -
[- ] process > parseAnnovarOut -
[- ] process > combineResults -
[- ] process > prepSpectrumAI -
[- ] process > prepSpectrumAI -
[- ] process > generateVariantPepTable - [- ] process > mergeSetPeptidetable -

Here is the error message.

Error executing process > 'svmToTSV (1)'

Caused by:
  Process `svmToTSV (1)` terminated with an error exit status (1)

Command executed:

  #! /usr/bin/env python3
  from glob import glob
  mzidtsvfns = sorted(glob('mzidtsv*'))
  mzidfns = sorted(glob('mzident*'))
  **from app.readers import pycolator, xml, tsv, mzidplus**
  import os
  ns = xml.get_namespace_from_top('filtprot', None) 
  psms = {p.attrib['{%s}psm_id' % ns['xmlns']]: p for p in pycolator.generate_psms('filtprot', ns)} decoys = {True: 0, False: 0}
  for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1])
      pdecoy = psm[2].attrib['{%s}decoy' % ns['xmlns']] == 'true'
      decoys[pdecoy] += 1
      try:
          psms[psm[0]] = {'decoy': pdecoy, 'svm': psm[1], 'qval': decoys[True]/decoys[False]} # T-TDC
      except ZeroDivisionError:
          psms[psm[0]]] = {'decoy': pdecoy, 'svm': psm[1], 'qval': 1.0} # T-TDC  
  decoys = {'true': 0, 'false': 0}
  for svm, pep in sorted([(float(x.find('{%s}svm_score' % ns['xmlns']).text), x) for x in pycolator.generate_peptides('filtprot', ns)] reverse=True, key=lambda x:x[0]):
      decoys[pep.attrib['{%s}decoy' % ns['xmlns']]] += 1 try:
          [psms[pid.text].update({'pepqval': decoys['true']/decoys['false']}) for pid in pep.find('{%s}psm_ids' % ns['xmlns'])]
      except ZeroDivisionError: [psms[pid.text].update({'pepqval': 1.0}) for pid in pep.find('{%s}psm_ids' % ns['xmlns'])]   
  oldheader = tsv.get_tsv_header(mzidtsvfns[0])
  header = oldheader + ['percolator svm-score', 'PSM q-value', 'peptide q-value']]
  with open('mzidperco', 'w') as fp:
      fp.write('\t'.join(header))
      for fnix, mzidfn in enumerate(mzidfns):
          mzns = mzidplus.get_mzid_namespace(mzidfn)
          siis = (sii for sir in mzidplus.mzid_spec_result_generator(mzidfn, mzns) for sii in sir.findall('{%s}SpectrumIdentificationItem' %) mzns['xmlns']))
          for specidi, psm in zip(siis, tsv.generate_tsv_psms(mzidtsvfns[fnix], oldheader))
              # percolator psm ID is: samplename_SII_scannr_rank_scannr_charge_rank
              print(specidi)
              print(psm)
              scan, rank = specidi.attrib['id'].replace('SII_', '').split('_')
              outpsm = {k: v for k,v in psm.items()}
              spfile = os.path.splitext(psm['#SpecFile'])[0].
              try:
                  percopsm = psms['{fn}_SII_{sc}_{rk}_{sc}_{ch}_{rk}'.format(fn=spfile, sc=scan, rk=rank, ch=psm['Charge'])]
              except KeyError:
                  continue
              if percopsm['decoy']:
                  continue
              fp.write('\n')
              outpsm.update({'percolator svm-score': percopsm['svm'], 'PSM q-value': percopsm['qval'], 'peptide q-value': percopsm['pepqval']})
              fp.write('\t'.join([str(outpsm[k]) for k in header])))

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 10, in <module
      for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1]):
    File ".command.sh", line 10, in <listcomp
      for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1])
  AttributeError: 'NoneType' object has no attribute 'text'
yafeng commented 2 years ago

@ymatts Hi, it seems the input file given to svmToTSV command caused the error. I suggest you locate to the work directory for the specific process **[e1/236e28] process > svmToTSV (2) check the input file, see it looks normal. Try run this command separately in this directory, see if you get same error.

Yafeng

ymatts commented 2 years ago

@yafeng Thank you very much for your response. I have tried executing as you suggested.

I would like to ask your suggestion this time about the following error output when I run $bash .command.sh in the directory where the file in issue is located.


from: can't read /var/mail/glob
.command.sh: line 3: syntax error near unexpected token `('
.command.sh: line 3: `mzidtsvfns = sorted(glob('mzidtsv*'))'

Any advice you can give me when you have time would be greatly appreciated.

Best regards Yusuke


The following is a summary of what I have done for your reference.

1. Go to the work directory in issue and check the files. Three files were created.

2. check the contents of the file Visually, the file appeared to be generated correctly.

3. Independently execute the command in the directory containing the files in issue

I ran the nextflown intermediate script in each of the directories containing the three files above.

The results are as follows.

$ bash .command.run

The same error was output as the last time the whole process was run.

Traceback (most recent call last):
  File "***/work2/43/ea8b0db1c692d9e43736093987c279/.command.sh", line 10, in <module>
    for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1]):
  File "***/work2/43/ea8b0db1c692d9e43736093987c279/.command.sh", line 10, in <listcomp>
    for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1]):
AttributeError: 'NoneType' object has no attribute 'text'

The results of the second command are as follows

$bash .command.sh

This error was newly confirmed in this run.

from: can't read /var/mail/glob
.command.sh: line 3: syntax error near unexpected token `('
.command.sh: line 3: `mzidtsvfns = sorted(glob('mzidtsv*'))'

I checked and it seems that it is common for this error output to occur when the shell script to be executed (in this case .command.sh) does not have the shebang on the first line. I checked the .command.sh script, and the shebang was written correctly.

4. other For reference, here are the results of the run, the several lines of the file, and the script in issue(.command.sh). First, I went to the directory where the problem was occurring and checked the three output files:

$cd work2/43/ea8b0db1c692d9e43736093987c279
$ls -lh 

The following files were found.

*** xyz docker  126 Sep 21 17:03 filtprot -> ***/work2/b9/07aa3d4db11c24e1681a089af75ab3/filtprot
*** xyz docker  152 Sep 21 17:03 mzident1 -> ***/work2/0c/6d41dad785b56a982d05c5219d6d60/20220512_hela200ng_woFAIMS_01.mzid
*** xyz docker  130 Sep 21 17:03 mzidtsv1 -> ***/work2/f3/2c55fc48366f044a1f3585970c630d/out.mzid.tsv

Next, I checked the file contents.

$ head -n20 filtprot

Result:

<?xml version='1.0' encoding='UTF-8'?>
<percolator_output xmlns="http://per-colator.com/percolator_out/15" xmlns:p="http://per-colator.com/percolator_out/15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://per-colator.com/percolator_out/15 https://github.com/percolator/percolator/raw/pout-1-5/src/xml/percolator_out.xsd" p:majorVersion="3" p:minorVersion="04" p:percolator_version="Percolator version 3.04.0">

  <process_info>
    <command_line>percolator -j percoin.xml -X perco.xml -N 500000 --decoy-xml-output -y</command_line>
    <other_command_line/>
    <pi_0_psms>0.467598</pi_0_psms>
    <pi_0_peptides>0.527571</pi_0_peptides>
    <psms_qlevel>37264</psms_qlevel>
    <peptides_qlevel>28472</peptides_qlevel>
  </process_info>

  <psms>
<psm p:psm_id="20220512_hela200ng_woFAIMS_01_SII_102871_1_102871_3_1" p:decoy="false">
      <svm_score>5.973</svm_score>
      <q_value>2.529e-05</q_value>
      <pep>1.003e-21</pep>
      <exp_mass>1104.5200</exp_mass>
      <calc_mass>1104.190</calc_mass>
      <peptide_seq n="-" c="-" seq="QAESC[UNIMOD:4]DC[UNIMOD:4]LQGFQLTHSLGGGTGSGMGTLLLSK"/>
nagoya-bmhi@stratus052:~/projects/proj_abe/pan-cancer-proteogenomics-analysis-master/work2/43/ea8b0db1c692d9e43736093987c279$ head -n10 filtprot
<?xml version='1.0' encoding='UTF-8'?>
<percolator_output xmlns="http://per-colator.com/percolator_out/15" xmlns:p="http://per-colator.com/percolator_out/15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://per-colator.com/percolator_out/15 https://github.com/percolator/percolator/raw/pout-1-5/src/xml/percolator_out.xsd" p:majorVersion="3" p:minorVersion="04" p:percolator_version="Percolator version 3.04.0">

  <process_info>
    <command_line>percolator -j percoin.xml -X perco.xml -N 500000 --decoy-xml-output -y</command_line>
    <other_command_line/>
    <pi_0_psms>0.467598</pi_0_psms>
    <pi_0_peptides>0.527571</pi_0_peptides>
    <psms_qlevel>37264</psms_qlevel>
    <peptides_qlevel>28472</peptides_qlevel>

I checked next file:

head -n20 mzident1

Result:

<?xml version="1.0" encoding="UTF-8"?>
<MzIdentML xmlns="http://psidev.info/psi/pi/mzIdentML/1.1" id="MS-GF+" version="1.1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psidev.info/psi/pi/mzIdentML/1.1 http://www.psidev.info/files/mzIdentML1.1.0.xsd" creationDate="2022-09-21T03:54:23">
<cvList>
  <cv fullName="PSI-MS" version="3.30.0" uri="https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo" id="PSI-MS"/>
  <cv fullName="UNIMOD" uri="http://www.unimod.org/obo/unimod.obo" id="UNIMOD"/>
  <cv fullName="UNIT-ONTOLOGY" uri="https://raw.githubusercontent.com/bio-ontology-research-group/unit-ontology/master/unit.obo" id="UO"/>
</cvList>
<AnalysisSoftwareList>
  <AnalysisSoftware version="Release (v2022.04.18)" id="ID_software" name="MS-GF+">
    <SoftwareName>
      <cvParam cvRef="PSI-MS" accession="MS:1002048" name="MS-GF+"/>
    </SoftwareName>
  </AnalysisSoftware>
</AnalysisSoftwareList>
<SequenceCollection>
  <DBSequence length="32" searchDatabase_ref="SearchDB_1" accession="COSMIC:ENST00000328974:TUBB8:p.E123Q:Substitution-Missense:1" id="DBSeq502834812">
    <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="COSMIC:ENST00000328974:TUBB8:p.E123Q:Substitution-Missense:1"/>
  </DBSequence>
  <DBSequence length="187" searchDatabase_ref="SearchDB_1" accession="ENSP00000261313.2" id="DBSeq602042350">
    <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ENSP00000261313.2 pep chromosome:GRCh38:12:118135858:118145584:1 gene:ENSG00000089220.4 transcript:ENST00000261313.2 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:PEBP1 description:phosphatidylethanolamine binding protein 1 [Source:HGNC Symbol;Acc:HGNC:8630]"/>

Finally, I checked third file.

head -n20 mzidtsv1

Result:


#SpecFile       SpecId  ScanNum ScanTime(Min)   FragMethod      Precursor       IsotopeError    PrecursorError(ppm)     Charge  Peptide Protein DeNovoScore     MSGFScore       SpecEValue      EValue  QValue  PepQValue
20220512_hela200ng_woFAIMS_01.mzML      scan=102871     102871  77.36203        HCD     1104.51575      1       -6.28241        3       QAESC+57.021DC+57.021LQGFQLTHSLGGGTGSGMGTLLLSK  COSMIC:ENST00000328974:TUBB8:p.E123Q:Substitution-Missense:1   327     325     7.4168E-39      2.6297E-30      0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=103252     103252  77.61435        HCD     868.68536       0       0.49183 4       WSGPLSLQEVDEQPQHPLHVTYAGAAVDELGK        ENSP00000261313.2       243     233     1.274E-37     4.517E-29        0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=101815     101815  76.65645        HCD     1104.51758      1       -4.62413        3       QAESC+57.021DC+57.021LQGFQLTHSLGGGTGSGMGTLLLSK  COSMIC:ENST00000328974:TUBB8:p.E123Q:Substitution-Missense:1   342     338     1.9869E-37      7.0448E-29      0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=92212      92212   70.25015        HCD     868.68555       0       0.70262 4       WSGPLSLQEVDEQPQHPLHVTYAGAAVDELGK        ENSP00000261313.2       245     239     3.9352E-37    1.3953E-28       0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=113071     113071  84.21866        HCD     1137.86035      0       -1.39465        3       TTIPEEEEEEEEAAGVVVEEELFHQQGTPR  ENSP00000261366.5       295     289     9.0405E-37    3.1719E-28       0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=98198      98198   74.24255        HCD     964.8396        0       1.83453 3       PAVVAPAPVVEAVSTPSAAFPSDATAENVK  CanProVar_rs184666535_ILF3_S482L_missense_8     204     200   6.0905E-36       2.1369E-27      0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=102721     102721  77.26036        HCD     868.68597       0       1.19445 4       WSGPLSLQEVDEQPQHPLHVTYAGAAVDELGK        ENSP00000261313.2       233     216     8.1685E-36    2.8962E-27       0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=125541     125541  92.64829        HCD     1091.17517      0       -1.56618        3       VADEDDVDNEEAALLHEEATMTIEELLTR   ENSP00000342778.4       368     345     1.2355E-35    4.3101E-27       0.0     0.0
20220512_hela200ng_woFAIMS_01.mzML      scan=72302      72302   56.99773        HCD     758.56421       0       1.52877 5       LVQAFQYTDEHGEVC+57.021PAGWKPGSDTIKPNVDDSK       ENSP00000301522.2       199     178     1.7509E-35     6.268E-27       0.0     0.0

Finally, paste the contents of the shell script ".command.sh". I have confirmed that $/usr/bin/env python3 will launch python correctly.

#!/usr/bin/env python3
from glob import glob
mzidtsvfns = sorted(glob('mzidtsv*'))
mzidfns = sorted(glob('mzident*'))
from app.readers import pycolator, xml, tsv, mzidplus
import os
ns = xml.get_namespace_from_top('filtprot', None) 
psms = {p.attrib['{%s}psm_id' % ns['xmlns']]: p for p in pycolator.generate_psms('filtprot', ns)}
decoys = {True: 0, False: 0}
for psm in sorted([(pid, float(p.find('{%s}svm_score' % ns['xmlns']).text), p) for pid, p in psms.items()], reverse=True, key=lambda x:x[1]):
    pdecoy = psm[2].attrib['{%s}decoy' % ns['xmlns']] == 'true'
    decoys[pdecoy] += 1
    try:
        psms[psm[0]] = {'decoy': pdecoy, 'svm': psm[1], 'qval': decoys[True]/decoys[False]}  # T-TDC
    except ZeroDivisionError:
        psms[psm[0]] = {'decoy': pdecoy, 'svm': psm[1], 'qval': 1.0}  # T-TDC  
decoys = {'true': 0, 'false': 0}
for svm, pep in sorted([(float(x.find('{%s}svm_score' % ns['xmlns']).text), x) for x in pycolator.generate_peptides('filtprot', ns)], reverse=True, key=lambda x:x[0]):
    decoys[pep.attrib['{%s}decoy' % ns['xmlns']]] += 1
    try:
        [psms[pid.text].update({'pepqval': decoys['true']/decoys['false']}) for pid in pep.find('{%s}psm_ids' % ns['xmlns'])]
    except ZeroDivisionError:
        [psms[pid.text].update({'pepqval': 1.0}) for pid in pep.find('{%s}psm_ids' % ns['xmlns'])]   
oldheader = tsv.get_tsv_header(mzidtsvfns[0])
header = oldheader + ['percolator svm-score', 'PSM q-value', 'peptide q-value']
with open('mzidperco', 'w') as fp:
    fp.write('\t'.join(header))
    for fnix, mzidfn in enumerate(mzidfns):
        mzns = mzidplus.get_mzid_namespace(mzidfn)
        siis = (sii for sir in mzidplus.mzid_spec_result_generator(mzidfn, mzns) for sii in sir.findall('{%s}SpectrumIdentificationItem' % mzns['xmlns']))
        for specidi, psm in zip(siis, tsv.generate_tsv_psms(mzidtsvfns[fnix], oldheader)):
            # percolator psm ID is: samplename_SII_scannr_rank_scannr_charge_rank
            print(specidi)
            print(psm)
            scan, rank = specidi.attrib['id'].replace('SII_', '').split('_')
            outpsm = {k: v for k,v in psm.items()}
            spfile = os.path.splitext(psm['#SpecFile'])[0]
            try:
                percopsm = psms['{fn}_SII_{sc}_{rk}_{sc}_{ch}_{rk}'.format(fn=spfile, sc=scan, rk=rank, ch=psm['Charge'])]
            except KeyError:
                continue
            if percopsm['decoy']:
                continue
            fp.write('\n')
            outpsm.update({'percolator svm-score': percopsm['svm'], 'PSM q-value': percopsm['qval'], 'peptide q-value': percopsm['pepqval']})
            fp.write('\t'.join([str(outpsm[k]) for k in header]))
yafeng commented 2 years ago

I am referring it to @glormph. He contributed to the code of this specific process. Hi, Jorrit! Can you have a look at this issue?

glormph commented 2 years ago

Hi @yafeng, nice to hear from you, hope all is OK! It took me a while before realizing this is not our old ipaw, but a new repo! :D

@ymatts So, the from app.readers import ... line is referring to our msstitch library, which we (in our @lehtiolab/proteogenomics-analysis-workflow ) use inside a conda env in a docker container. Judging from your command line -profile docker,standard \ you also run docker, and I don't see any errors about this line, so it should be fine. But I'm not sure which container you're in (in this repo I don't see a nextflow.config file for example).

Indeed, if you rerun, use bash .command.run, which will setup the nextflow environment you have to run .command.sh. Yu can change code inside the .command.sh if you want, .command.run will use it and also sets up any containers you use etc. In this case .command.sh is a python script and not bash, so the error from: can't read /var/mail/glob is no problem.

The error seems to be that the XML parser encounters a None element when sorting PSMs from percolator in the filtprot file. It does .text to get the text from an <svm_score/> element, and that is instead a None. I'm not sure why, and when you showed the results of head -n20 filtprot it looked fine for that one PSM which was in the first lines. So maybe there's a PSM without a percolator svm score, inside the filtprot file, which would be strange. That file is generated inside the filterPercolator process here: https://github.com/yafeng/pan-cancer-proteogenomics-analysis/blob/master/ipaw.hg38.nf#L295

To look at that PSM, insert this code right after decoys = {True: 0, False: 0} into .command.sh and rerun with bash .command.run:

for p in psms.values():
    try:
        p.find('{%s}svm_score' % ns['xmlns']).text
    except AttributeError:
        print(etree.tostring(p, pretty_print=True))

Then you can see how the PSMs look which are erroring.

ymatts commented 2 years ago

@yafeng Thank you for your response.

@glormph Hello and thank you for your time. I inserted the code you suggested into .command.sh and ran it again.

$bash .command.sh

However, I am getting the following error. I would be very grateful if you could tell me how to deal with it when you have time.

File "work2/43/ea8b0db1c692d9e43736093987c279/.command.sh", line 13, in <module>
    p.find('{%s}svm_score' % ns['xmlns']).text
AttributeError: 'NoneType' object has no attribute 'text'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nagoya-bmhi/projects/proj_abe/pan-cancer-proteogenomics-analysis-master/work2/43/ea8b0db1c692d9e43736093987c279/.command.sh", line 15, in <module>
    print(etree.tostring(p, pretty_print=True))
NameError: name 'etree' is not defined

Also, in response to the comment that it is unclear which container was used, the contents of nextflow.config, Dockerfile and environment.yml are pasted below.

you also run docker, and I don't see any errors about this line, so it should be fine. But I'm not sure which container you're in (in this repo I don't see a nextflow.config file for example).

params {
  //container = 'glormph/ipaw:dev' // Container slug. Stable releases should specify release tag!
  container = 'glormph/ipaw:0.4' // Container slug. Stable releases should specify release tag!

  outdir = './results'
  tracedir = "${params.outdir}/pipeline_info"
  awsqueue = false
  awsregion = 'eu-west-1'
  // VarDB v.1 headers
  novheaders = '^PGOHUM;^lnc;^decoy_PGOHUM;^decoy_lnc'
  varheaders = '^COSMIC;^CanProVar;^decoy_COSMIC;^decoy_CanProVar'
}

profiles {

  standard {
    includeConfig 'conf/base.config'
  }
  conda { process.conda = "$baseDir/environment.yml" }
  docker {
    docker.enabled = true
    docker.fixOwnership = true
    docker.runOptions = '-u $(id -u):$(id -g)'
  }
  singularity {
    singularity.enabled = true
  }
  lehtio { 
    includeConfig 'conf/base.config'
    includeConfig 'conf/lehtio.config'
  }
  uppmax { 
    includeConfig 'conf/base.config'
    includeConfig 'conf/uppmax.config'
  }
  testing {
    includeConfig 'conf/base.config'
  }
}

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if(type == 'memory'){
    try {
      if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'time'){
    try {
      if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'cpus'){
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

COPY envs /envs/ RUN conda env create -f /envs/environment.yml && conda clean -a

RUN git clone https://github.com/yafeng/SpectrumAI /SpectrumAI RUN cd /SpectrumAI && git pull && git reset --hard b8e7001807d834db633c30d265ef6e8361cdcb3c

ENV PATH /opt/conda/envs/ipaw-0.4/bin:$PATH


- enviornment.yml:

name: ipaw-0.4 channels:

I also briefly parsed the filtprot and checked the svm_score. In particular, no missing, nulls, etc. were found, and at first glance there appears to be no problem.

A simple table of filtprot:. filtprot_table.txt

R script written to check:.

setwd("work2/43/ea8b0db1c692d9e43736093987c279/")
# scan row by row
rawdat <- scan("filtprot",what = "character",sep = "\n")

# convert table psm
psm_st <- grep("^<psm",rawdat)
psm_ed <- grep("</psm>",rawdat)

length(psm_st)==length(psm_ed) # check whether there is any missing tag
psm_li <- vector("list",length(psm_st))
for(i in seq_along(psm_st)){
  tmp <- gsub(" ","",rawdat[psm_st[i]:psm_ed[i]])
  psm <- tmp[grep("^<psm",tmp)]
  svmscore <- tmp[grep("^<svm_score",tmp)]
  q_value <- tmp[grep("^<q_value",tmp)]
  disp <- c(psm,svmscore,q_value)
  psm_li[[i]] <- disp
}
psm_df <- do.call(rbind,psm_li)
psm_df

# convert table peptide
pep_st <- grep("^<peptide",rawdat)
pep_ed <- grep("</peptide>",rawdat)

length(pep_st)==length(pep_ed) # check whether there is any missing tag

pep_li <- vector("list",length(pep_st))
for(i in seq_along(pep_st)){
  tmp <- gsub(" ","",rawdat[pep_st[i]:pep_ed[i]])
  pep <- tmp[grep("^<peptide",tmp)]
  svmscore <- tmp[grep("^<svm_score",tmp)]
  q_value <- tmp[grep("^<q_value",tmp)]
  disp <- c(pep,svmscore,q_value)
  pep_li[[i]] <- disp
}
pep_df <- do.call(rbind,pep_li)

output <- data.frame(rbind(psm_df,pep_df))
colnames(output) <- c("header","svm_score","q_value")
output[,2] <- gsub("<svm_score>|</svm_score>","",output[,2])
output[,3] <- gsub("<q_value>|</q_value>","",output[,2])
output[,2] <- as.numeric(output[,2])
output[,3] <- as.numeric(output[,3])

all(!is.na(output[,2])) # TRUE
all(!is.nan(output[,2])) #TRUE
all(!is.null(output[,2])) # TRUE

write.table(output,file = "filtprot_table.txt",sep="\t",row.names = FALSE,col.names = TRUE)

Best regards Yusuke

yafeng commented 2 years ago

@glormph Glad to hear your response too! This workflow was modified based it your previous version. I used it to search public datasets. Regarding the svmtoTSV process, i have not seen this error before. @ymatts Since you have the input file, I suggest to copy the python script and run it in standalone version, such as python3 svmtoTSV.py input.id output.tsv

just to see if the error is related to nextflow environment or python library/version ... @glormph may have better advice, he is a more experienced nextflow player :)

glormph commented 2 years ago

Thanks, now I understand it a bit better! The run environment seems to be more or less the same as in our old pipeline. It's still a strange error though. Even if we find the problem, it can have been caused in another NF process upstream, and we may need to look there too :(

As for the new error message from the code we added, etree needs importing, so you can add the line from lxml import etree to the top of the "extra" python code and run bash .command.run again.

ymatts commented 2 years ago

@yafeng Thank you so much for your time. I am trying to attempt python standalone as you suggested, but I am stuck on elementary errors with msstitch as I have been able to install it but cannot resolve the error on import.

As you pointed out , I need to determine if this is a python or NX-specific issue, so now I am continuing to work on addressing the error.

When I executed pip3 install msstitch

Requirement already satisfied: msstitch in **/.local/lib/python3.6/site-packages
Requirement already satisfied: numpy in **/.local/lib/python3.6/site-packages (from msstitch)
Requirement already satisfied: biopython in **/.local/lib/python3.6/site-packages (from msstitch)
Requirement already satisfied: lxml in **/.local/lib/python3.6/site-packages (from msstitch)

,but I cannot import the module in python:

>>> import msstitch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'msstitch'
>>> export PYTHONPATH=*/.local/lib/python3.6/site-packages
  File "<stdin>", line 1
    export PYTHONPATH=*/.local/lib/python3.6/site-packages
           ^
SyntaxError: invalid syntax

@glormph Thank you again for your time.

I am new to NX and apologize for the hassle, but if there is an efficient way to proceed in dealing with the error, please let me know and I will follow your instructions.

Even if we find the problem, it can have been caused in another NF process upstream, and we may need to look there too :(

Now, I added the following line to .comman.sh as you instructed and ran it

from lxml import etree
for p in psms.values():
    try:
        p.find('{%s}svm_score' % ns['xmlns']).text
    except AttributeError:
        print(etree.tostring(p, pretty_print=True))

The output is as follows error.txt

I can't interpret it, but when I looked at the entire line of the file, there was what appeared to be a partially blank line (at the end of line 3 in the example below).

b'<ns40242:psm xmlns:ns40242="http://per-colator.com/percolator_out/15"/>\n'
b'<ns40243:psm xmlns:ns40243="http://per-colator.com/percolator_out/15"/>\n'
b'<ns40252:psm xmlns:ns40252="http://per-colator.com/percolator_out/15"/>\n \n\n\n\n'
b'<ns40261:psm xmlns:ns40261="http://per-colator.com/percolator_out/15"/>Computer'
b'<ns40271:psm xmlns:ns40271="http://per-colator.com/percolator_out/15"/>Computer'
b'<ns40282:psm xmlns:ns40282="http://per-colator.com/percolator_out/15"/>\n'

Is it possible to identify the cause and fix it from here? I would appreciate a reply when you have time.

Best regards Yusuke

glormph commented 2 years ago

This is strange and I'm not sure where the error is. Now it looks like there are PSMs in the perco.xml input file to this process without svm score, or without much at all. Can you attach the filtprot file which contains the percolator XML?