zc3945 / caipanwenshu

裁判文书网爬虫demo,2020-04-23更新
85 stars 42 forks source link

先后用了python3.7和python2.7跑,都跑不通 #5

Open CoderRobin1992 opened 5 years ago

CoderRobin1992 commented 5 years ago

是不是网站又更新了反爬策略? 报错如下: 2018-12-26 17:13:54 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: wenshu) 2018-12-26 17:13:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0f 25 May 2017), cryptography 2.1, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.2.1511-Core 2018-12-26 17:13:54 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wenshu.spiders', 'ROBOTSTXT_OBEY': True, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['wenshu.spiders'], 'BOT_NAME': 'wenshu', 'DOWNLOAD_DELAY': 3} 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled item pipelines: ['wenshu.pipelines.WenshuPipeline'] 2018-12-26 17:13:54 [scrapy.core.engine] INFO: Spider opened 2018-12-26 17:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-12-26 17:13:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-12-26 17:13:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://wenshu.court.gov.cn/robots.txt> (referer: None) 2018-12-26 17:14:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6> (referer: None) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:15 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/www/wenshu/wenshu/spiders/doc.py", line 95, in get_doc_list key = getkey(format_key_str).encode('utf-8') File "/www/wenshu/wenshu/utils/docid_v27.py", line 105, in getkey c = execjs.compile(js_str) File "/usr/lib/python2.7/site-packages/execjs/__init__.py", line 61, in compile return get().compile(source, cwd) File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 21, in get return get_from_environment() or _find_available_runtime() File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.") RuntimeUnavailableError: Could not find an available JavaScript runtime.

zc3945 commented 5 years ago

https://github.com/zc3945/caipanwenshu/tree/master/wenshu/wenshu/utils 请先尝试运行这里的docid.py 和vl5x.py 。如果报错,根据错误提示再解决

CoderRobin1992 commented 5 years ago

vl5x.py运行正常,docid.py运行报错如下: Traceback (most recent call last): File "docid.py", line 115, in <module> key = getkey(RunEval).encode('utf-8') File "docid.py", line 104, in getkey js_str = unzip(str1).replace('_[_][_](', 'return ')[:-4] File "docid.py", line 100, in unzip return btou(get_js(fromBase64(str1))) File "docid.py", line 94, in get_js eval_js = execjs.compile(js_data) File "/usr/lib/python2.7/site-packages/execjs/__init__.py", line 61, in compile return get().compile(source, cwd) File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 21, in get return get_from_environment() or _find_available_runtime() File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.") execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.

CoderRobin1992 commented 5 years ago

我用selenium爬取,从列表页进入,会更简单一些,不过效率相差太大了。。。

CoderRobin1992 commented 5 years ago

想请教下怎么跳过网站的IP检测,爬取部分完成了,但是现阶段已经触发了检测机制,跳出验证码,使用过一家的IP代理也不行,甚至提醒记录了mac address,无效用户