xkumiyu / imagenet-downloader

Downloader from ImageNet Image URLs
http://www.kumilog.net/entry/imagenet-download
MIT License
85 stars 17 forks source link

Rate limiting with the suggested download script #3

Open nathanin opened 5 years ago

nathanin commented 5 years ago

Hi, this is very helpful now that the imagenet website seems to be down. It's been several months and they still haven't granted me direct download access. So, URL's are the way to go.

While the suggested downloading script works well, I found that opening so many wget requests crashed my home internet connection. Receiving skyrockets to the cap speed, and I get through about 3GB of download before the connection dies. Instead, I use gnu-parallel to limit the number of concurrent downloads. The difference is in download.sh to keep the requests in foreground, and to strip double quotes from urllists.txt, then run this command:

download.sh:

#!/bin/sh

if [ $# -ne 2 ]; then
  exit 1
fi

# original line
# wget $2 -O $1 -T 1 -t 5 -nc -b -a wget.log

# new line
wget $2 -O $1 -T 1 -t 5 -nc
sed 's/\"//g' list/urllist.txt > list/urllist_noquote.txt
cat list/urllists_noquote.txt | parallel --jobs 12 --colsep ' ' ./download.sh {1} {2}

It's slower, yes, but for people on a limited connection this way lets you keep working during the download :)

DandiC commented 5 years ago

Hello @nathanin , I am running into a similar problem when I run the downloading script. I tried your method to see if it solved my problem, but I keep getting these messages for each line of the download list: : not foundsh: 2: ./download.sh: ./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")

Do you know what might be causing these errors? I am not very familiar with Unix commands and I can't figure out what is preventing me from executing your modified download script.

Thank you!

nathanin commented 5 years ago

@DandiC Can you paste the whole command ? maybe you have to modify permissions on download.sh, or change the first line to #!/bin/bash

Alternatively, Kaggle hosts the ImageNet data with bounding box annotation and the original class information. That's actually where I ended up downloading from.

DandiC commented 5 years ago

@nathanin The command that I'm using is the one you provided:

sed 's/\"//g' list/urllist.txt > list/urllist_noquote.txt
cat list/urllist_noquote.txt | parallel --jobs 12 --colsep ' ' ./download.sh {1} {2}

As far as I can tell, the first one works correctly but the second one gives me the error I mentioned before. Here is an extended version of my output:

~/imagenet-downloader-master$ cat list/urllist_noquote.txt | parallel --jobs 12 --colsep ' ' ./download.sh {1} {2}
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")
: not foundsh: 2: ./download.sh:
./download.sh: 13: ./download.sh: Syntax error: end of file unexpected (expecting "then")

If I change the first line of the downloading script to #!/bin/bash, then I get this error:

~/imagenet-downloader-master$ cat list/urllist_noquote.txt | parallel --jobs 12 --colsep ' ' ./download.sh {1} {2}
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file
./download.sh: line 2: $'\r': command not found
./download.sh: line 13: syntax error: unexpected end of file

Thanks for pointing out about Kaggle, I will check it up if I can't make this code work.

nathanin commented 5 years ago

Check your line endings. there's a syntax error in your download.sh

DandiC commented 5 years ago

Thanks! I rewrote the download script and that made it work. Now I'm getting some 403 and 404 errors but I'm pretty sure that's the fault of imagenet. I also checked Kaggle but it seems like they don't have the dataset public anymore. Hopefully, the images that I get to download using this method are enough for what I want.