Open berlinhemi opened 2 years ago
Эта ошибка говорит о том, что у вас не настроены конфигурации для ротирующихся прокси. Для локального запуска это не нужно, поэтому можно сделать следующее:
wildsearch_crawler/settings.py
'wildsearch_crawler.middlewares.RotatingProxyMiddleware': 610,
Эта ошибка больше вылезать не будет, но заработает ли парсер в полном объеме – не могу гарантировать, он не обновлялся уже около года.
Ааа да, согласен, спасибо! Ошибка полечилась, но данные не выгрузились (в файл artifacts/wb.json). Прикреплю лог, может будут идеи, если нет, то попробую покурить разметку и код :)
scrapy crawl wb -o artifacts/wb.json -a category_url="https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki"
[2021-10-29 13:52:46,601][INFO] Scrapy 2.4.0 started (bot: wildsearch_crawler)
2021-10-29 13:52:46 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: wildsearch_crawler)
[2021-10-29 13:52:46,610][INFO] Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.6.1, Platform Linux-4.19.0-6-amd64-x86_64-with-debian-10.11
2021-10-29 13:52:46 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.6.1, Platform Linux-4.19.0-6-amd64-x86_64-with-debian-10.11
[2021-10-29 13:52:46,610][DEBUG] Using reactor: twisted.internet.epollreactor.EPollReactor
2021-10-29 13:52:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
[2021-10-29 13:52:46,617][INFO] Overridden settings:
{'BOT_NAME': 'wildsearch_crawler',
'NEWSPIDER_MODULE': 'wildsearch_crawler.spiders',
'SPIDER_MODULES': ['wildsearch_crawler.spiders']}
2021-10-29 13:52:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'wildsearch_crawler',
'NEWSPIDER_MODULE': 'wildsearch_crawler.spiders',
'SPIDER_MODULES': ['wildsearch_crawler.spiders']}
[2021-10-29 13:52:46,632][INFO] Telnet Password: 5b207ade21241c37
2021-10-29 13:52:46 [scrapy.extensions.telnet] INFO: Telnet Password: 5b207ade21241c37
[2021-10-29 13:52:46,657][INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
[2021-10-29 13:52:46,709][INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'wildsearch_crawler.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'wildsearch_crawler.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
[2021-10-29 13:52:46,712][INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[2021-10-29 13:52:46,713][INFO] Enabled item pipelines:
[]
2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
[2021-10-29 13:52:46,713][INFO] Spider opened
2021-10-29 13:52:46 [scrapy.core.engine] INFO: Spider opened
[2021-10-29 13:52:46,716][INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-10-29 13:52:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[2021-10-29 13:52:46,717][INFO] Telnet console listening on 127.0.0.1:6023
2021-10-29 13:52:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[2021-10-29 13:52:46,937][DEBUG] Crawled (200) <GET https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki> (referer: None)
2021-10-29 13:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki> (referer: None)
[2021-10-29 13:52:47,153][INFO] Closing spider (finished)
2021-10-29 13:52:47 [scrapy.core.engine] INFO: Closing spider (finished)
[2021-10-29 13:52:47,154][DEBUG] Get 'SCRAPY_JOB' casted as 'None'/'None' with default '0'
2021-10-29 13:52:47 [/home/user/.local/lib/python3.7/site-packages/envparse.py] DEBUG: Get 'SCRAPY_JOB' casted as 'None'/'None' with default '0'
[2021-10-29 13:52:47,154][INFO] Dumping Scrapy stats:
{'downloader/request_bytes': 256,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 69415,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.438207,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 29, 17, 52, 47, 153763),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 62980096,
'memusage/startup': 62980096,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 10, 29, 17, 52, 46, 715556)}
2021-10-29 13:52:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 256,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 69415,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.438207,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 29, 17, 52, 47, 153763),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 62980096,
'memusage/startup': 62980096,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 10, 29, 17, 52, 46, 715556)}
[2021-10-29 13:52:47,154][INFO] Spider closed (finished)
2021-10-29 13:52:47 [scrapy.core.engine] INFO: Spider closed (finished)
the same +
Ааа да, согласен, спасибо! Ошибка полечилась, но данные не выгрузились (в файл artifacts/wb.json). Прикреплю лог, может будут идеи, если нет, то попробую покурить разметку и код :)
scrapy crawl wb -o artifacts/wb.json -a category_url="https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki" [2021-10-29 13:52:46,601][INFO] Scrapy 2.4.0 started (bot: wildsearch_crawler) 2021-10-29 13:52:46 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: wildsearch_crawler) [2021-10-29 13:52:46,610][INFO] Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.6.1, Platform Linux-4.19.0-6-amd64-x86_64-with-debian-10.11 2021-10-29 13:52:46 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Jan 22 2021, 20:04:44) - [GCC 8.3.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.6.1, Platform Linux-4.19.0-6-amd64-x86_64-with-debian-10.11 [2021-10-29 13:52:46,610][DEBUG] Using reactor: twisted.internet.epollreactor.EPollReactor 2021-10-29 13:52:46 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor [2021-10-29 13:52:46,617][INFO] Overridden settings: {'BOT_NAME': 'wildsearch_crawler', 'NEWSPIDER_MODULE': 'wildsearch_crawler.spiders', 'SPIDER_MODULES': ['wildsearch_crawler.spiders']} 2021-10-29 13:52:46 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'wildsearch_crawler', 'NEWSPIDER_MODULE': 'wildsearch_crawler.spiders', 'SPIDER_MODULES': ['wildsearch_crawler.spiders']} [2021-10-29 13:52:46,632][INFO] Telnet Password: 5b207ade21241c37 2021-10-29 13:52:46 [scrapy.extensions.telnet] INFO: Telnet Password: 5b207ade21241c37 [2021-10-29 13:52:46,657][INFO] Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] [2021-10-29 13:52:46,709][INFO] Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'wildsearch_crawler.middlewares.BanDetectionMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'wildsearch_crawler.middlewares.BanDetectionMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] [2021-10-29 13:52:46,712][INFO] Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] [2021-10-29 13:52:46,713][INFO] Enabled item pipelines: [] 2021-10-29 13:52:46 [scrapy.middleware] INFO: Enabled item pipelines: [] [2021-10-29 13:52:46,713][INFO] Spider opened 2021-10-29 13:52:46 [scrapy.core.engine] INFO: Spider opened [2021-10-29 13:52:46,716][INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-10-29 13:52:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) [2021-10-29 13:52:46,717][INFO] Telnet console listening on 127.0.0.1:6023 2021-10-29 13:52:46 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 [2021-10-29 13:52:46,937][DEBUG] Crawled (200) <GET https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki> (referer: None) 2021-10-29 13:52:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wildberries.ru/catalog/zhenshchinam/odezhda/vodolazki> (referer: None) [2021-10-29 13:52:47,153][INFO] Closing spider (finished) 2021-10-29 13:52:47 [scrapy.core.engine] INFO: Closing spider (finished) [2021-10-29 13:52:47,154][DEBUG] Get 'SCRAPY_JOB' casted as 'None'/'None' with default '0' 2021-10-29 13:52:47 [/home/user/.local/lib/python3.7/site-packages/envparse.py] DEBUG: Get 'SCRAPY_JOB' casted as 'None'/'None' with default '0' [2021-10-29 13:52:47,154][INFO] Dumping Scrapy stats: {'downloader/request_bytes': 256, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 69415, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.438207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 10, 29, 17, 52, 47, 153763), 'log_count/DEBUG': 2, 'log_count/INFO': 10, 'memusage/max': 62980096, 'memusage/startup': 62980096, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2021, 10, 29, 17, 52, 46, 715556)} 2021-10-29 13:52:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 256, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 69415, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.438207, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 10, 29, 17, 52, 47, 153763), 'log_count/DEBUG': 2, 'log_count/INFO': 10, 'memusage/max': 62980096, 'memusage/startup': 62980096, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2021, 10, 29, 17, 52, 46, 715556)} [2021-10-29 13:52:47,154][INFO] Spider closed (finished) 2021-10-29 13:52:47 [scrapy.core.engine] INFO: Spider closed (finished)
решили вопрос?
решили вопрос?
Нет, не садился за проект. Но если руки дойдут, отпишу.
Получилось?
Получилось?
Не, не садился за этот проект
Запустил пример из README, но данных не получил... Может изменилась разметка сайта?