Open mgeerdsen opened 1 year ago
Yeah this sounds like something we should look at.
I was wondering if we should move everything to the IA CLI. It's a bit flakey as it drops the connection on a handful of items from time to time (but I think this is because of IA servers going down). But they're usually easy to rerun. The benefit of having to only support one method of harvest would probably outweigh this.
The errors look like this, just for the record:
stdout: b'2023-05-31 07:30:56,470 - internetarchive.session - DEBUG - no metadata provided for "b32743725", retrieving now.\n2023-05-31 07:30:56,472 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:30:58,919 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/b32743725 HTTP/1.1" 200 None\n2023-05-31 07:30:58,923 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:31:00,562 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_hocr.html HTTP/1.1" 302 None\n2023-05-31 07:31:00,563 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia902600.us.archive.org:443\n2023-05-31 07:31:02,188 - urllib3.connectionpool - DEBUG - https://ia902600.us.archive.org:443 "GET /7/items/b32743725/b32743725_hocr.html HTTP/1.1" 200 None\n2023-05-31 07:31:03,408 - internetarchive.files - INFO - downloaded b32743725/b32743725_hocr.html to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_hocr.html\n2023-05-31 07:31:03,410 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:31:05,123 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_jp2.zip HTTP/1.1" 302 None\n2023-05-31 07:31:05,125 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia802600.us.archive.org:443\n2023-05-31 07:31:06,045 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /7/items/b32743725/b32743725_jp2.zip HTTP/1.1" 200 151951512\n2023-05-31 07:32:29,953 - internetarchive.files - INFO - downloaded b32743725/b32743725_jp2.zip to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_jp2.zip\n2023-05-31 07:32:29,956 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:31,452 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_meta.xml HTTP/1.1" 302 None\n2023-05-31 07:32:31,453 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia802600.us.archive.org\n2023-05-31 07:32:32,060 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /7/items/b32743725/b32743725_meta.xml HTTP/1.1" 200 None\n2023-05-31 07:32:32,061 - internetarchive.files - INFO - downloaded b32743725/b32743725_meta.xml to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_meta.xml\n2023-05-31 07:32:32,063 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:33,650 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_scandata.xml HTTP/1.1" 302 None\n2023-05-31 07:32:33,652 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia902600.us.archive.org\n2023-05-31 07:32:34,960 - urllib3.connectionpool - DEBUG - https://ia902600.us.archive.org:443 "GET /7/items/b32743725/b32743725_scandata.xml HTTP/1.1" 200 None\n2023-05-31 07:32:34,962 - internetarchive.files - INFO - downloaded b32743725/b32743725_scandata.xml to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_scandata.xml\n2023-05-31 07:32:42,576 - internetarchive.session - DEBUG - no metadata provided for "b3283441x", retrieving now.\n2023-05-31 07:32:42,577 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:44,987 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/b3283441x HTTP/1.1" 200 None\n2023-05-31 07:32:44,991 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:46,578 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b3283441x/b3283441x_hocr.html HTTP/1.1" 302 None\n2023-05-31 07:32:46,580 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia802600.us.archive.org:443\n2023-05-31 07:32:47,464 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /34/items/b3283441x/b3283441x_hocr.html HTTP/1.1" 200 None\n2023-05-31 07:32:48,211 - internetarchive.files - INFO - downloaded b3283441x/b3283441x_hocr.html to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/ae739ce4-2f46-42b5-a38d-648b451ef200/b3283441x_hocr.html\n2023-05-31 07:32:48,213 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:49,558 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b3283441x/b3283441x_jp2.zip HTTP/1.1" 302 None\n2023-05-31 07:32:49,559 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia802600.us.archive.org\n2023-05-31 07:32:50,227 - url
The parts that are involved in the IA harvesting pre-date the cloud migration and we should re-evaluate the way it works.
some ideas:
possible connections to #350 and #451