Option for the crawler to use fresh dns lookups on all crawls. With a PiHole.

yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance

http://yacy.net

Other

3.38k stars 427 forks source link

Option for the crawler to use fresh dns lookups on all crawls. With a PiHole. #508

Open smokingwheels opened 1 year ago

smokingwheels commented 1 year ago

For the Pihole blocking to work.

I think if yacy has an option to query the DNS server every time a crawl is started and not use cache. That way you can stop the crawling the sites having a Crawler delay of 10000 mS. The only way so far to fix it is to clear all the cache or add the site to the hosts file. This is time consuming but worth it if the crawler does not slow down.

I have a shared pihlole list on my github.io.

There is about 9000 DNS requests in 10 mins when starting a crawl of 480 sites with a Depth 2. When yacy and phole where on the same Device the pihole had some errors (too many concurrent connections Max 150).

I have 4 raspberry pi's setup for my DNS one runs DNSMasq and load balances to the other 3 Pihole's.

I have added a post on the pihole forum https://discourse.pi-hole.net/t/maximum-number-of-concurrent-dns-queries-reached-max-150-when-starting-a-crawl-with-yacy/58263

Orbiter commented 1 year ago

The java-internal DNS cache has a concurrency problem. That is the reason that YaCy has an extra DNS cache on top. I have never thought about a limitation of concurrent requests to the remote DNS, maybe that is an interesting hint.

Explicitly clearing the DNS cache is nowhere implemented, the only way right now is a restart of YaCy. Maybe we can use this. For now I must put this into the backlog, there are more pressing issues right now...

smokingwheels commented 1 year ago

Ok thanks for that work around.

I will keep doing my blocking list for the moment.

smokingwheels commented 1 year ago

Yacy is also querying unknown domains. Its not a real problem.

`dig www.soundlabsgroup.com.au%20

; <<>> DiG 9.11.3-1ubuntu1.18-Ubuntu <<>> www.soundlabsgroup.com.au%20 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 64103 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;www.soundlabsgroup.com.au%20. IN A

;; Query time: 16 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: Mon Oct 03 07:17:54 AWST 2022 ;; MSG SIZE rcvd: 57`

smokingwheels commented 1 year ago

Stuck in crawler que.

Initiator Depth Status URL agent-gbs 2 loading http:// instantrails.rubyforge.org/svn/trunk/InstantRails-win/InstantRails/mysql/bin/libmySQL.dll< / a > The "< / a >" has no spaces.

agent-gbs 2 loading http:// instantrails.rubyforge.org/svn/trunk/InstantRails-win/InstantRails/mysql/bin/libmySQL.dll"