zhegexiaohuozi / SeimiCrawler

一个简单、敏捷、分布式的支持SpringBoot的Java爬虫框架;An agile, distributed crawler framework.
http://seimicrawler.org
Apache License 2.0
1.98k stars 679 forks source link

爬阿里巴巴会出现乱码啊 #21

Closed Dreamerdream closed 7 years ago

Dreamerdream commented 7 years ago

只要是中文,搜索关键字就乱码

zhegexiaohuozi commented 7 years ago

还希望给出复现条件,谢谢

Dreamerdream commented 7 years ago

@Override public List startRequests() { List requests = new ArrayList<>(); Request request = new Request(); request.useSeimiAgent(); request.setSeimiAgentRenderTime(5000); request.setSeimiAgentUseCookie(true); // try { // String watch = new String("手表".getBytes("ISO-8859-1"),"UTF-8");
// // } catch (UnsupportedEncodingException e) { // e.printStackTrace(); // } request.setUrl("https://s.1688.com/selloffer/offer_search.htm?keywords=手表"); //这里的keyword如果不 是中文就不会乱码 request.setCallBack("alibaba"); requests.add(request); return requests; } //这个是request,只要搜索关键字是中文,他返回的源码只要是包括搜索关键字的都是乱码,其他都是好的。我不知道这是我写的有问题,还是说1688的机制,把我转码掉了,还是SeimiCrawler的问题

zhegexiaohuozi commented 7 years ago

SeimiCrawler没啥关系,这是爬虫开发很基础的场景,网站本身的特点,需要开发者自己去针对特定场景进行分析的,比如这个,中文应该做下urlencode编码

Dreamerdream commented 7 years ago

不行啊, encode之后的url:https://s.1688.com/selloffer/offer_search.htm?keywords=%CA%D6%B1%ED 不encode的url:https://s.1688.com/selloffer/offer_search.htm?keywords=手表 或者 keywords的编码解码我都试过了。 我全部都试过了,都不行,(ps:我去抓天猫的时候是会出现encode的问题) 今天早上我带上了cookies和header试了下,还是乱码。

Dreamerdream commented 7 years ago

作者, 我用curl直接用 curl -X POST -H "Cache-Control: no-cache" -H "accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8" -H "accept-encoding:gzip, deflate, br" -H "cookie:sw_newuno_count=2; cna=kcD0ESJCoGoCAXuYADXVToxJ; UM_distinctid=15d63ca39c6e5-073dc583246293-474b0421-1fa400-15d63ca39c7a78; JSESSIONID=8L78bHuu1-053YMBaSqt0AKvd0FF-rcxadQQ-zLy9; _tmp_ck_0="5qZiWZ7kx3yUPYb0LhGwvhmtEXiRRjOOxKa6Y35TmNevOP40i0UEnVHMzwEUFJhpj5bKR8SUWuzPtkpVeLNTKm42RLiPLgBgk63t9P3nVDvF9ABGghT6VKH1utYu1EzA1CUilJ%2BgXm04hCsaAd4oeYoueovb3kWSjhgQCO%2BW1XszwU4wKNL2bOw%2F0mPsBGpOhSoolEopDgO1M7qRdgJ5b8qwd%2BKQ2IXZWiVtvzfKEUhZpyZS6yKboSqnX8kWaE4QNSAaFI3SNgjykUhGTF5xGqA4s1o8BKqzGp%2B10emhafsResMjNMD%2F677jL4zvVBUpP%2FicAsgj1jIi47ZW5%2Fw7gA%3D%3D"; h_keys="%u624b%u8868#%u7f57%u8499"; ad_prefer="2017/07/28 15:14:53"; alisw=swIs1200%3D1%7C; ali_ab=123.152.76.237.1500617692281.4; isg=AqOjlicY_vCESrLPWSyer3MBMueNMDdd2VZAXdUA_4J5FMM2XWjHKoFEeNLh; alicnweb=touch_tb_at%3D1501227529621" -H "referer:https://s.1688.com/selloffer/offer_search.htm?keywords=%C2%DE%C3%C9&n=y&spm=a260k.635.1998096057.d1" -H "user-agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36" -d 'url=https://s.1688.com/selloffer/offer_search.htm?keywords=%CA%D6%B1%ED&n=y&spm=a260k.635.1998096057.d1&renderTime=6000&useCookie=1' "http://localhost:8000/doload" |more 结果: <meta name="description" content="阿里巴巴锟街憋拷贸易是全球顶尖的产品交易市场,您可以查看海量精选的锟街憋拷产品供应信息,还可以浏览锟街憋拷公司黄页,与商友在线洽谈,查找最新锟街憋拷行业动态,价格行情,

索即时展会信息等。"> 照着你这么做,不管keywords是手表还是urlencode过的 都是乱码。所以有没有可能是seimiagent这个模拟浏览器是不支持1688的解析的(我瞎猜的)。

zhegexiaohuozi commented 7 years ago

试试这个版本 http://seimidl.wanghaomiao.cn/seimiagent_linux_v1.3.2_x86_64.tar.gz

Dreamerdream commented 7 years ago

终于不乱码了,谢谢作者。