step3 自动跟进页面链接以及对内容的处理

zengsn / name-crawler-python

Chinese name crawler written by Python

2 stars 2 forks source link

step3 自动跟进页面链接以及对内容的处理 #4

Open findsomeoneyys opened 8 years ago

findsomeoneyys commented 8 years ago

使用scrapy.spider.CrawlSpider以及 Rule方法来定义如何从爬取页面提取链接
定制不同pipeline来决定对item(爬取得到的内容)处理
- 清理HTML数据
- 验证爬取的数据
- 查重(并丢弃)
- 将爬取的结果保存到数据库中

zengsn commented 8 years ago

做得很好！小建议：/Users/yangyunshen/name-crawler-python/src/spider/item.json 应该改为相对路径。另外，这个文件应该是动态获取的吧？或者在配置文件里面，将来实现。另外，配置文件最好设计为“不像Python”，让不懂Python的人才会修改配置。