第二个爬虫程序报错

wistbean / learn_python3_spider

python爬虫教程系列、从0到1学习python爬虫，包括浏览器抓包，手机APP抓包，如 fiddler、mitmproxy，各种爬虫涉及的模块的使用，如：requests、beautifulSoup、selenium、appium、scrapy等，以及IP代理，验证码识别，Mysql，MongoDB数据库的python使用，多线程多进程爬虫的使用，css 爬虫加密逆向破解，JS爬虫逆向，分布式爬虫，爬虫项目实战实例等

http://fxxkpython.com

MIT License

18.29k stars 3.73k forks source link

第二个爬虫程序报错 #7

Closed lovevantt closed 1 year ago

lovevantt commented 4 years ago

错误为： Traceback (most recent call last): File "D:/coding/Python/PyCharm/test1/test2.py", line 127, in main(i) File "D:/coding/Python/PyCharm/test1/test2.py", line 119, in main soup = BeautifulSoup(html, 'lxml') File "C:\Programs\Python\Python38-32\lib\site-packages\bs4__init.py", line 287, in init__ elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len()

lovevantt commented 4 years ago

代码使用的是提供的代码。 douban_top_250_books.py

Ryyy233 commented 4 years ago

问题相同，有解决出来吗

zhangxy12138 commented 4 years ago

抱歉，我还没来得及做

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用

发件人: Ryyy233 notifications@github.com 发送时间: Wednesday, March 11, 2020 5:30:58 PM 收件人: wistbean/learn_python3_spider learn_python3_spider@noreply.github.com 抄送: Subscribed subscribed@noreply.github.com 主题: Re: [wistbean/learn_python3_spider] 第二个爬虫程序报错 (#7)

问题相同，有解决出来吗

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wistbean/learn_python3_spider/issues/7?email_source=notifications&email_token=AMBMSJI667NYDQ65YCTKNYTRG5K5FA5CNFSM4KGME7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOZM2A#issuecomment-597530216, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBMSJNFB2UWD3FK5O7YYPLRG5K5FANCNFSM4KGME7WQ.

panhainan commented 4 years ago

原因：请求豆瓣获取数据失败了，返回了None（看各自的设置，作者的代码是'None'，就采用！=‘None’）解决办法：

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

panhainan commented 4 years ago

原因：请求豆瓣获取数据失败了，返回了None（看各自的设置，作者的代码是'None'，就采用！=‘None’）解决办法：
  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

但是问题还是没有得到实际解决，因为请求失败了。分析原因可能是豆瓣将我们的操作判断为爬虫了，就拦截了，这时候就可以加入headers，来模拟我们是浏览器请求，而不是爬虫。重写request_douban方法：

def request_douban(url):
    headers = {
        # 假装自己是浏览器
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None

Wakerrd commented 4 years ago

加个请求头就行了

McChickenNuggets commented 3 years ago

加一个这个，不然你是python request，直接被拦截了 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.115'}

856tangbin commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。你好，我，无法亲自回复你的邮件。我将在假期结束后，尽快给你回复。最近正在休假中

bearbeers commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。