zerospeech / benchmarks

A command line tool that helps use the "Zero Ressource Challenge" benchmarks
https://zerospeech.com/toolbox/
GNU General Public License v3.0
8 stars 2 forks source link

Dataset downloads often timeout #32

Closed ewan closed 11 months ago

ewan commented 11 months ago

The dataset installation often cannot finish on large downloads due to network issues. Specifically the following:

zrc datasets:pull -u zrc2017-test-dataset  

typically terminates before the download has finished. This results in one of two errors. Either there is an exception of the following kind:

...
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(7989262420 bytes read, 481286955 more expect
ed)', IncompleteRead(7989262420 bytes read, 481286955 more expected))

or else the code passes to the MD5Sum check and fails due to the incomplete download.

There should be either a more robust download code which can resume partly completed downloads, or, failing this, a straightforward way to install datasets from offline downloads.

The dataset:import command, which seems like it could do the latter, does not currently work. In addition to printing a warning about being untested, it crashes with an exception when trying to run the following command (where the last argument is the name of a directory to which zrc2017-test-dataset.zip was extracted):

zrc datasets:import zrc2017-test-dataset zrc2017-test-dataset

Here is the traceback:

Traceback (most recent call last):
  File "/home/emd/python-env/zrc/bin/zrc", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/startup.py", line 39, in main
    cli.run()
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/cmd/cli_lib.py", line 258, in run
    cmd.run_cmd(argv=sys.argv[2:])
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/cmd/cli_lib.py", line 91, in run_cmd
    self.run(args)
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/cmd/datasets.py", line 91, in run
    dataset.import_(location=Path(argv.source), quiet=argv.quiet, show_progress=True)
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/model/datasets.py", line 159, in import_
    download_file(url=self.origin.install_config, dest=(self.location / "install_config.json"))
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/zerospeech/misc.py", line 167, in download_file
    response = requests.get(url, allow_redirects=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/sessions.py", line 575, in request
    prep = self.prepare_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/sessions.py", line 486, in prepare_request
    p.prepare(
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "/home/emd/python-env/zrc/lib/python3.11/site-packages/requests/models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'None': No scheme supplied. Perhaps you meant https://None?
nhamilakis commented 11 months ago

The datasets:import function was only supposed to work with the 2015 dataset, we have since decided to make that benchmark read-only and i forgot to remove it from the code.

I have never had a timeout during download, but i could repurpose the import command to allow importing directly a .zip file in case the download fails/timeouts, so in that case users can download the zip directly from the URL and use the import command to install it.

The MD5 check-fail was an error on my part as i had updated the content of the .zip archive without updating the MD5 check, this has been fixed.

Although the datasets:pull command has an option to skip, MD5 checks -u / --skip-verification.

ewan commented 11 months ago

I tried skipping the MD5 check, but I can´t be sure whether that solved the issue, as the dataset had another issue that kept me from using it. However, in any case, most of the time, the script crashed with a network error, rather than getting through to the MD5 check. Thus, yes, it is worth adding an option to import the dataset from a .zip in case the download option doesn't work.


From: Hamilakis Nicolas @.> Sent: July 17, 2023 11:12 AM To: zerospeech/benchmarks @.> Cc: Ewan Dunbar @.>; Author @.> Subject: Re: [zerospeech/benchmarks] Dataset downloads often timeout (Issue #32)

The datasets:import function was only supposed to work with the 2015 dataset, we have since decided to make that benchmark read-only and i forgot to remove it from the code.

I have never had a timeout during download, but i could repurpose the import command to allow importing directly a .zip file in case the download fails/timeouts, so in that case users can download the zip directly from the URL and use the import command to install it.

The MD5 check-fail was an error on my part as i had updated the content of the .zip archive without updating the MD5 check, this has been fixed.

Although the datasets:pull command has an option to skip, MD5 checks -u / --skip-verification.

— Reply to this email directly, view it on GitHubhttps://github.com/zerospeech/benchmarks/issues/32#issuecomment-1638351396, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAA4DULLWOECTZC5TGREJ7TXQVI6LANCNFSM6AAAAAA2KVALUY. You are receiving this because you authored the thread.Message ID: @.***>

nhamilakis commented 11 months ago

Added import for the other download functions : 0a80ddb