wwood / kingfisher-download

Easier download/extract of FASTA/Q read data and metadata from the ENA, NCBI, AWS or GCP.
https://wwood.github.io/kingfisher-download
GNU General Public License v3.0
246 stars 38 forks source link

kingfisher annotate error when accession has a blank attribute #23

Closed AroneyS closed 2 years ago

AroneyS commented 2 years ago

E.g. ERR2178284 has nothing in "# of Spots"

Command:

kingfisher annotate -r ERR2178284 -f tsv > kingfisher_metadata.tsv

Error:

10/18/2022 01:51:23 PM INFO: Kingfisher v0.0.1-dev
10/18/2022 01:51:23 PM INFO: Querying NCBI esearch for 1 distinct accessions e.g. ERR2178284
10/18/2022 01:51:25 PM INFO: Querying NCBI efetch for 1 distinct IDs e.g. 5212983
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/conda/envs/kingfisher/bin/kingfisher", line 290, in <module>
    main()
  File "/mnt/hpccs01/work/microbiome/conda/envs/kingfisher/bin/kingfisher", line 275, in main
    kingfisher.annotate(
  File "/mnt/hpccs01/work/microbiome/sw/kingfisher-download/bin/../kingfisher/__init__.py", line 554, in annotate
    metadata = SraMetadata().efetch_sra_from_accessions(run_identifiers)
  File "/mnt/hpccs01/work/microbiome/sw/kingfisher-download/bin/../kingfisher/sra_metadata.py", line 207, in efetch_sra_from_accessions
    metadata = self.efetch_metadata_from_ids(webenv, accessions, len(sra_ids))
  File "/mnt/hpccs01/work/microbiome/sw/kingfisher-download/bin/../kingfisher/sra_metadata.py", line 142, in efetch_metadata_from_ids
    d2['spots'] = try_get(lambda: int(run.attrib['total_spots']))
  File "/mnt/hpccs01/work/microbiome/sw/kingfisher-download/bin/../kingfisher/sra_metadata.py", line 79, in try_get
    return func()
  File "/mnt/hpccs01/work/microbiome/sw/kingfisher-download/bin/../kingfisher/sra_metadata.py", line 142, in <lambda>
    d2['spots'] = try_get(lambda: int(run.attrib['total_spots']))
KeyError: 'total_spots
AroneyS commented 2 years ago

Should this error be caught by: https://github.com/wwood/kingfisher-download/blob/cd7b2ed0c2488f10b91a1cf26ad3728ca26eba09/kingfisher/sra_metadata.py#L114 ?

AroneyS commented 2 years ago

I think I was using an old version. With the current commit (cd7b2ed0c2488f10b91a1cf26ad3728ca26eba09), I get a different error. Looks like it catches the error but then tries to use the "None".

Command/Error:

 kingfisher annotate -r ERR2178284 -f tsv > sra_20221018_kingfisher_metadata.tsv
10/18/2022 03:44:18 PM INFO: Kingfisher v0.0.1-dev
10/18/2022 03:44:18 PM INFO: Querying NCBI esearch for 1 distinct accessions e.g. ERR2178284
10/18/2022 03:44:20 PM INFO: Querying NCBI efetch for 1 distinct IDs e.g. 5212983
Traceback (most recent call last):
  File "/home/aroneys/src/kingfisher-download/bin/kingfisher", line 292, in <module>
    main()
  File "/home/aroneys/src/kingfisher-download/bin/kingfisher", line 276, in main
    kingfisher.annotate(
  File "/mnt/hpccs01/scratch/microbiome/aroneys/src/kingfisher-download/bin/../kingfisher/__init__.py", line 559, in annotate
    _output_formatted_metadata(metadata, output_file, output_format, all_columns)
  File "/mnt/hpccs01/scratch/microbiome/aroneys/src/kingfisher-download/bin/../kingfisher/__init__.py", line 618, in _output_formatted_metadata
    metadata_sorted = prepare_for_tsv_csv(metadata, default_columns, all_columns)
  File "/mnt/hpccs01/scratch/microbiome/aroneys/src/kingfisher-download/bin/../kingfisher/__init__.py", line 578, in prepare_for_tsv_csv
    pd.DataFrame({'Gbp': [round(bases/1e9, 3) for bases in metadata_sorted[BASES_KEY]]})
  File "/mnt/hpccs01/scratch/microbiome/aroneys/src/kingfisher-download/bin/../kingfisher/__init__.py", line 578, in <listcomp>
    pd.DataFrame({'Gbp': [round(bases/1e9, 3) for bases in metadata_sorted[BASES_KEY]]})
TypeError: unsupported operand type(s) for /: 'NoneType' and 'float'
wwood commented 2 years ago

Fixed in 73ddf62 - thanks for the report.