wistbean / learn_python3_spider

python爬虫教程系列、从0到1学习python爬虫,包括浏览器抓包,手机APP抓包,如 fiddler、mitmproxy,各种爬虫涉及的模块的使用,如:requests、beautifulSoup、selenium、appium、scrapy等,以及IP代理,验证码识别,Mysql,MongoDB数据库的python使用,多线程多进程爬虫的使用,css 爬虫加密逆向破解,JS爬虫逆向,分布式爬虫,爬虫项目实战实例等
http://fxxkpython.com
MIT License
18.29k stars 3.73k forks source link

第二个爬虫程序报错 #7

Closed lovevantt closed 1 year ago

lovevantt commented 4 years ago

错误为: Traceback (most recent call last): File "D:/coding/Python/PyCharm/test1/test2.py", line 127, in main(i) File "D:/coding/Python/PyCharm/test1/test2.py", line 119, in main soup = BeautifulSoup(html, 'lxml') File "C:\Programs\Python\Python38-32\lib\site-packages\bs4__init.py", line 287, in init__ elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len()

lovevantt commented 4 years ago

代码使用的是提供的代码。 douban_top_250_books.py

Ryyy233 commented 4 years ago

问题相同,有解决出来吗

zhangxy12138 commented 4 years ago

抱歉,我还没来得及做

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用


发件人: Ryyy233 notifications@github.com 发送时间: Wednesday, March 11, 2020 5:30:58 PM 收件人: wistbean/learn_python3_spider learn_python3_spider@noreply.github.com 抄送: Subscribed subscribed@noreply.github.com 主题: Re: [wistbean/learn_python3_spider] 第二个爬虫程序报错 (#7)

问题相同,有解决出来吗

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wistbean/learn_python3_spider/issues/7?email_source=notifications&email_token=AMBMSJI667NYDQ65YCTKNYTRG5K5FA5CNFSM4KGME7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOZM2A#issuecomment-597530216, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBMSJNFB2UWD3FK5O7YYPLRG5K5FANCNFSM4KGME7WQ.

panhainan commented 4 years ago

原因: 请求豆瓣获取数据失败了,返回了None(看各自的设置,作者的代码是'None',就采用!=‘None’) 解决办法:

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')
panhainan commented 4 years ago

原因: 请求豆瓣获取数据失败了,返回了None(看各自的设置,作者的代码是'None',就采用!=‘None’) 解决办法:

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

但是问题还是没有得到实际解决,因为请求失败了。 分析原因可能是豆瓣将我们的操作判断为爬虫了,就拦截了,这时候就可以加入headers,来模拟我们是浏览器请求,而不是爬虫。 重写request_douban方法:

def request_douban(url):
    headers = {
        # 假装自己是浏览器
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None
Wakerrd commented 4 years ago

加个请求头就行了

McChickenNuggets commented 3 years ago

加一个这个,不然你是python request,直接被拦截了 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.115'}

856tangbin commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。你好,我,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。最近正在休假中

bearbeers commented 1 year ago

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。