xianhu / PSpider

简单易用的Python爬虫框架,QQ交流群:597510560
https://github.com/xianhu/PSpider
BSD 2-Clause "Simplified" License
1.83k stars 504 forks source link

Debug 问题 #20

Closed twangnh closed 7 years ago

twangnh commented 7 years ago

新手, 想修改parser, 也就是inst_parse.py, 用threads方法用以抓取电影天堂下载链接(使用test_spider() 函数),以下是修改的parser

    def htm_parse_2(self, priority: int, url: str, keys: object, deep: int, content: object) -> (int, list, list):
        """
        parse the content of a url, you can rewrite this function, parameters and return refer to self.working()
        """
        *_, html_text = content

        url_list = [], save_list = []
        if (self._max_deep < 0) or (deep < self._max_deep):

            if not re.compile(r"/\d{8}/").search(url): #如果输入网址是列表网页 ,则抓取各个电影的下载链接网页       
                a_list = re.findall(r"<a href=\"(?P<url>[\w\W]{5,}?)\" class=\"ulink\">[\w\W]+?</a>", html_text, flags=re.IGNORECASE)
                url_list = [(_url, keys, priority+1) for _url in [get_url_legal(href, url) for href in a_list]]

            else:#如果输入网址是下载链接网页,则抓取下载链接
                download_url = re.search(r"<td style=\"WORD-WRAP: break-word\"[\w\W]*?><a href=\"(?P<url>[\w\W]{5,}?)\">", html_text, flags=re.IGNORECASE)
                save_list = [(download_url.group("url").strip(), datetime.datetime.now()), ] if download_url else []

        return 1, url_list, save_list

另外修改初始的urlhttp://www.ygdy8.net/html/gndy/oumei/list_7_12.html一个其他部分不变,但是刚允许程序就结束了,log信息为:

WARNING:root:MonitorThread[monitor] start...
WARNING:root:ThreadPool set_start_url: keys=None, priority=0, deep=0, url=http://www.ygdy8.net/html/gndy/oumei/list_7_12.html
WARNING:root:ThreadPool start: fetcher_num=10, is_over=True
WARNING:root:FetchThread[fetcher-1] start...
WARNING:root:FetchThread[fetcher-2] start...
WARNING:root:FetchThread[fetcher-3] start...
WARNING:root:FetchThread[fetcher-4] start...
WARNING:root:FetchThread[fetcher-5] start...
WARNING:root:FetchThread[fetcher-6] start...
WARNING:root:FetchThread[fetcher-7] start...
WARNING:root:FetchThread[fetcher-8] start...
WARNING:root:FetchThread[fetcher-9] start...
WARNING:root:FetchThread[fetcher-10] start...
WARNING:root:ParseThread[parser] start...
WARNING:root:SaveThread[saver] start...
WARNING:root:ThreadPool status: running_tasks=0; fetch=(0, 0, 0/(5s)); parse=(0, 0, 0/(5s)); save=(0, 0, 0/(5s)); total_seconds=5
WARNING:root:FetchThread[fetcher-1] end...
WARNING:root:FetchThread[fetcher-2] end...
WARNING:root:FetchThread[fetcher-3] end...
WARNING:root:FetchThread[fetcher-4] end...
WARNING:root:FetchThread[fetcher-5] end...
WARNING:root:FetchThread[fetcher-6] end...
WARNING:root:FetchThread[fetcher-7] end...
WARNING:root:FetchThread[fetcher-8] end...
WARNING:root:FetchThread[fetcher-9] end...
WARNING:root:ParseThread[parser] end...
WARNING:root:SaveThread[saver] end...
WARNING:root:FetchThread[fetcher-10] end...
WARNING:root:ThreadPool status: running_tasks=0; fetch=(0, 0, 0/(5s)); parse=(0, 0, 0/(5s)); save=(0, 0, 0/(5s)); total_seconds=10
WARNING:root:MonitorThread[monitor] end...
WARNING:root:ThreadPool end: fetcher_num=10, is_over=True

也没调过threads 程序, 不知道该怎么调试,debug 模式也无法做到一步一步进行,请问问题出在哪里呢?另外可否推荐以下怎么调threads相关的程序,需要其他模块吗,比如winpdb等等? 谢谢!

xianhu commented 7 years ago

在我代码中有设置白名单和黑名单。test.py第11、12行。你需要做一些更改。或者直接把白名单去掉。

twangnh commented 7 years ago

好的,谢谢,解决了