Using yacy as a large scale crawler - Githubissues

yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance

http://yacy.net

Other

3.38k stars 427 forks source link

Using yacy as a large scale crawler #96

Open cloutier opened 7 years ago

cloutier commented 7 years ago

Hi everyone!

I'm currently testing yacy as a backend that will return every pages on certain large sites, which we will use on our end for further processing.

Right now my tests consists of crawling some very large news sites: Buzzfeed, the BBC, Wired and some others. I have looked at the documentation and I have come up with this query:

http://localhost:8090/yacysearch.json?query=site%3Abuzzfeed.com&nav=all&startRecord=40000&maximumRecords=1000&verify=false

Unfortuately for large values of startRecord I start to get no results. Is there a way to change this query or change the configuration of yacy to fix this? Latency wise I don't mind if the query takes up to half an hour and I don't mind too if there is pagination or not. If we can manage to do this we will be adding upwards of 100,000,000 documents to the index in the next few weeks.

Thanks for the help!

luccioman commented 7 years ago

Hi @cloutier , I don't know if you run yacy from latest sources, but I pushed 2 days ago a fix to handle properly large startRecord values (see commit https://github.com/yacy/yacy_search_server/commit/c25e48e969f180dcc3c73863acbfcc383a182c8f). This change only affects requests to your local index (queries in web portal mode, or with parameter resource=local in your query for example), but given your use case I guess this could help you.

Unfortunately for P2P queries (mixing local, DHT and federated results) the problematic is harder and for now offset (startRecord) value is still hard code limited to 10000...

Best regards

cloutier commented 7 years ago

Thank you for the quick reply, that was exactly what I needed.

I am actually on the 1.90 release which seems to be a few months old now. I can manage to compile from source until the next stable release. That will solve my scaling problems.

Related to this: do you have plans to allow to add trusted peers, so that one could run a cluster of yacy nodes?

luccioman commented 7 years ago

Mmm I am not sure to get what you mean by "add trusted peers"... But maybe you would be interested to set your YaCy peers to use an external Solr Cloud of your own instead of the default embedded Solr instances. Personally I didn't experiment recently with this, but YaCy wiki has some related instructions.

smokingwheels commented 7 years ago

@luccioman Have you looked at any of the spammy pages from .pl? I think they just cause a spiral of death for the crawler Que its just nested and nested pages of links of links. Some peers that offer remote crawls dont seem to be working.

luccioman commented 7 years ago

Hi @smokingwheels , yes I noticed these spammy results with URLs ending for example with "w.interiowo.pl". But for now I did not really have an in depth look at this. Would you suggest something else than using YaCy Blacklist feature?

smokingwheels commented 7 years ago

Hi @luccioman I have done a few things it the last few days, how to send PM? Anyway there is like 108 k spammy sub.domains / emails in .pl . I found it easy to just add ..pl.. in the block list but that's not very fair to legit sites in .pl is it? Its like I have an old Quad core and within 5 minutes, I can get Yacy to curl up and stop working.

Anyway ask around on your peers for possible solution to resolve the problem my fix is only a temporary one. I do have the time to look a bit further when I have finished post processing.

luccioman commented 7 years ago

Ok @smokingwheels , if you would like to send me PM (Private Message?) you can use YaCy forum.

For sure 108k spammy subdomains could be difficult to block, especially if subdomains like interiowo.pl contain legit content. And of course we do not want to block all Poland websites... Maybe we could define a blocking rule for the crawler based on content, but I guess this could be tricky and in the end probably easy to workaround.

smokingwheels commented 7 years ago

Hi @luccioman There is an easy way Here Still under test conditions ATM see peer 12 or 14 bit binary counter 16 pin DIP... I will add stuff if needed. Check Daily you decide what to Add n Remove. Let me know. I mean if anyone has the time to check 80 000 sites/Domains/ (Crawler que's Stacks ~4 GB) I could send you the list/files to put in web portal to try ~ 4 GB. I will slowly Verify sites that I think should be crawled and add them back in.

CC @Low012 @Orbiter

luccioman commented 7 years ago

Sorry @smokingwheels but I don't understand the meaning of your linked repository... Can you explain a bit more your idea?

smokingwheels commented 7 years ago

Sorry the repository is a bit off the tracks disregard it. I filed a bug report with Ubuntu. I just clears any traffic shaping in Linux. I had a windows 10 box inflict my systems over the network causes slow transfers/network, it slows SSH SFTP transfers to a crawl like 30 - 100kb/s.

Is there anyway to output the index browse info to a file of all the domains to a dump file?

"documents stored for host: 2; documents stored for subpath: 0; unloaded documents detected in subpath: 0?

smokingwheels commented 7 years ago

Adding a block list in the hosts file may improve crawler speed and also remove spammy domains. Crawler tends not to reduce to 0 ppm now its more constant.

Yacy Thread on Mantis

smokingwheels commented 7 years ago

"add trusted peers" could be a bunch of separate slave webportal clients indexing the sites you want then have 1 that crawls the webportals. You would have to adjust the robots.txt in each slave webportal. I am not sure if this is a good solution. There is good and bad points.

Quix0r commented 6 years ago

Alternative is to run an own cluster in p2p mode with a "closed seed file" (means not accessible by outside peers). I do this here at work. But you may require my modified version.