yuvenhol / dataharvest

AGI拓展工具,支持AI搜索&爬虫&数据清洗,开箱即用。tavily、天工、百度百科、百家号、360百科、头条、微信公众号、搜狐百科、腾讯新闻、网易新闻、马蜂窝、小红书
52 stars 8 forks source link

XiaoHongShuSpider不知道怎么用? #6

Open bobkingdom opened 2 months ago

bobkingdom commented 2 months ago

如题,换成马蜂窝的爬虫也似乎没爬到任何东西,这个要怎么用呀? 2024-09-01 23:47:53,010 - INFO - HTTP Request: GET https://www.mafengwo.cn/mdd "HTTP/1.1 301 Moved Permanently" [ERROR][2024-09-01 23:47:53][main.py:439] - Error occurred while crawling: '__jsluid_s' INFO: 127.0.0.1:53710 - "POST /fetch_mfw HTTP/1.1" 200 OK


@app.post("/fetch_mfw")
async def crawl_mafengwo_mdd():
    url = "https://www.mafengwo.cn/mdd"
    # proxy_gene_func = MyProxy()
    # config = SpiderConfig(proxy_gene_func=proxy_gene_func)
    config = SpiderConfig()
    # 使用 XiaoHongShuSpider
    spider = MaFengWoSpider(config)

    try:
        # 使用异步方法抓取网页内容
        doc = await spider.a_crawl(url)
        logger.info(f"Successfully crawled content: {doc.page_content}")

        return doc.page_content

    except Exception as e:
        logger.error(f"Error occurred while crawling: {str(e)}")
        return {"error": str(e)}
yuvenhol commented 1 month ago

新增了小红书的demo 在tests里面可以看一下