zouzanyan / douyin_crawl

抖音视频批量爬取
GNU General Public License v3.0
76 stars 21 forks source link

下载到一半脚本停止运行报错 #9

Open 8kcar opened 2 months ago

8kcar commented 2 months ago

系统:Windows7 版本:Python 3.8.8

运行脚本下载视频到37%的时候停止运行了并报错

请在此填入用户链接(输入exit退出): https://www.douyin.com/user/MS4wLjABAAAA1CMJ5lw4Gnb62bb19gEt80PR5heH9fJV2F7pcXA9 fVM 视频数量: 180 图片数量: 2 开始下载到本地文件 MS4wLjABAAAA1CMJ5lw4Gnb62bb19gEt80PR5heH9fJV2F7pcXA9fVM... 下载进度: 1%|▏ | 1/182 [00:06<18:19, 6.08s/文件] ........ 下载进度: 37%|██████████▊ | 68/182 [06:23<10:42, 5 .63s/文件]

Traceback (most recent call last): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 737, in _error_catcher yield File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 883, in _raw_read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) urllib3.exceptions.IncompleteRead: IncompleteRead(9470049 bytes read, 7576267 mo re expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\requests\models.py", line 816, in generate yield from self.raw.stream(chunk_size, decode_content=True) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 1043, in stream data = self.read(amt=amt, decode_content=decode_content) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 963, in read data = self._raw_read(amt) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 891, in _raw_read self._fp.close() File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\contex tlib.py", line 131, in exit self.gen.throw(type, value, traceback) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\urllib3\response.py", line 761, in _error_catcher raise ProtocolError(arg, e) from e urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(9470049 by tes read, 7576267 more expected)', IncompleteRead(9470049 bytes read, 7576267 mo re expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\Administrator\douyin\run.py", line 150, in crawl_media(user_input) File "C:\Users\Administrator\douyin\run.py", line 104, in crawl_media download_media(session, sec_uid, video_list, picture_list) File "C:\Users\Administrator\douyin\run.py", line 121, in download_media for chunk in response.iter_content(chunk_size=8192): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\site-p ackages\requests\models.py", line 818, in generate raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(94 70049 bytes read, 7576267 more expected)', IncompleteRead(9470049 bytes read, 75 76267 more expected))

8kcar commented 2 months ago

改了一下重试机制

def download_media(session: requests.Session, sec_uid, video_list, picture_list): if not os.path.exists(sec_uid): os.mkdir(sec_uid) os.chdir(sec_uid)

def download_with_retries(url, file_name, file_ext, retries=3):
    """用于在下载过程中重试"""
    for attempt in range(retries):
        try:
            with session.get(url, stream=True, timeout=(10, 30)) as response:
                if response.status_code == 200:
                    with open(f'{file_name}.{file_ext}', "wb") as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            if chunk:
                                f.write(chunk)
                    return True  # 下载成功
                else:
                    print(f"网络异常 Status code: {response.status_code}")
        except requests.exceptions.ChunkedEncodingError:
            print(f"ChunkedEncodingError: 正在重试 {attempt+1}/{retries}...")
        except requests.exceptions.RequestException as e:
            print(f"请求错误: {e}, 正在重试 {attempt+1}/{retries}...")
        if attempt < retries - 1:
            continue  # 重试
        else:
            print(f"下载失败,无法完成文件:{file_name}")
            return False  # 下载失败

with tqdm(total=len(video_list) + len(picture_list), desc="下载进度", unit="文件") as pbar:

    for i in video_list:
        des = i[0]
        url = i[1]
        file_name = my_util.sanitize_filename(des)
        success = download_with_retries(url, file_name, "mp4")
        if success:
            pbar.update(1)  # 完成当前文件的处理

    for i in picture_list:
        url = i
        file_name = my_util.IDGenerator.generate_unique_id()
        success = download_with_retries(url, file_name, "jpg")
        if success:
            pbar.update(1)  # 完成当前文件的处理

print('用户视频图片已全部下载完成')
os.chdir('..')
zouzanyan commented 2 months ago

加了重试还会出现连接异常吗?

8kcar commented 1 month ago

不会出现这个问题了