Closed twangnh closed 7 years ago
新手, 想修改parser, 也就是inst_parse.py, 用threads方法用以抓取电影天堂下载链接(使用test_spider() 函数),以下是修改的parser:
parser
inst_parse.py
threads
test_spider()
def htm_parse_2(self, priority: int, url: str, keys: object, deep: int, content: object) -> (int, list, list): """ parse the content of a url, you can rewrite this function, parameters and return refer to self.working() """ *_, html_text = content url_list = [], save_list = [] if (self._max_deep < 0) or (deep < self._max_deep): if not re.compile(r"/\d{8}/").search(url): #如果输入网址是列表网页 ,则抓取各个电影的下载链接网页 a_list = re.findall(r"<a href=\"(?P<url>[\w\W]{5,}?)\" class=\"ulink\">[\w\W]+?</a>", html_text, flags=re.IGNORECASE) url_list = [(_url, keys, priority+1) for _url in [get_url_legal(href, url) for href in a_list]] else:#如果输入网址是下载链接网页,则抓取下载链接 download_url = re.search(r"<td style=\"WORD-WRAP: break-word\"[\w\W]*?><a href=\"(?P<url>[\w\W]{5,}?)\">", html_text, flags=re.IGNORECASE) save_list = [(download_url.group("url").strip(), datetime.datetime.now()), ] if download_url else [] return 1, url_list, save_list
另外修改初始的url 为http://www.ygdy8.net/html/gndy/oumei/list_7_12.html一个其他部分不变,但是刚允许程序就结束了,log信息为:
url
http://www.ygdy8.net/html/gndy/oumei/list_7_12.html
WARNING:root:MonitorThread[monitor] start... WARNING:root:ThreadPool set_start_url: keys=None, priority=0, deep=0, url=http://www.ygdy8.net/html/gndy/oumei/list_7_12.html WARNING:root:ThreadPool start: fetcher_num=10, is_over=True WARNING:root:FetchThread[fetcher-1] start... WARNING:root:FetchThread[fetcher-2] start... WARNING:root:FetchThread[fetcher-3] start... WARNING:root:FetchThread[fetcher-4] start... WARNING:root:FetchThread[fetcher-5] start... WARNING:root:FetchThread[fetcher-6] start... WARNING:root:FetchThread[fetcher-7] start... WARNING:root:FetchThread[fetcher-8] start... WARNING:root:FetchThread[fetcher-9] start... WARNING:root:FetchThread[fetcher-10] start... WARNING:root:ParseThread[parser] start... WARNING:root:SaveThread[saver] start... WARNING:root:ThreadPool status: running_tasks=0; fetch=(0, 0, 0/(5s)); parse=(0, 0, 0/(5s)); save=(0, 0, 0/(5s)); total_seconds=5 WARNING:root:FetchThread[fetcher-1] end... WARNING:root:FetchThread[fetcher-2] end... WARNING:root:FetchThread[fetcher-3] end... WARNING:root:FetchThread[fetcher-4] end... WARNING:root:FetchThread[fetcher-5] end... WARNING:root:FetchThread[fetcher-6] end... WARNING:root:FetchThread[fetcher-7] end... WARNING:root:FetchThread[fetcher-8] end... WARNING:root:FetchThread[fetcher-9] end... WARNING:root:ParseThread[parser] end... WARNING:root:SaveThread[saver] end... WARNING:root:FetchThread[fetcher-10] end... WARNING:root:ThreadPool status: running_tasks=0; fetch=(0, 0, 0/(5s)); parse=(0, 0, 0/(5s)); save=(0, 0, 0/(5s)); total_seconds=10 WARNING:root:MonitorThread[monitor] end... WARNING:root:ThreadPool end: fetcher_num=10, is_over=True
也没调过threads 程序, 不知道该怎么调试,debug 模式也无法做到一步一步进行,请问问题出在哪里呢?另外可否推荐以下怎么调threads相关的程序,需要其他模块吗,比如winpdb等等? 谢谢!
在我代码中有设置白名单和黑名单。test.py第11、12行。你需要做一些更改。或者直接把白名单去掉。
好的,谢谢,解决了
新手, 想修改
parser
, 也就是inst_parse.py
, 用threads
方法用以抓取电影天堂下载链接(使用test_spider()
函数),以下是修改的parser
:另外修改初始的
url
为http://www.ygdy8.net/html/gndy/oumei/list_7_12.html
一个其他部分不变,但是刚允许程序就结束了,log信息为:也没调过threads 程序, 不知道该怎么调试,debug 模式也无法做到一步一步进行,请问问题出在哪里呢?另外可否推荐以下怎么调threads相关的程序,需要其他模块吗,比如winpdb等等? 谢谢!