yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.42k stars 428 forks source link

Un-loadable pages in Yacy that are loadable from the host? #449

Closed ISJ-439 closed 2 years ago

ISJ-439 commented 2 years ago

Hello,

When loading a new page to crawl, which is a locally hosted page on the same server, I'm getting a error 503. However using Lynx, a cli web browser for linux I can load it fine.

How would i troubleshoot this further?

Yacy Console: https://i.imgur.com/IwpjkuS.png

Kiwix Server Logs:

======================
Requesting : 
full_url  : /w/load.php
method    : GET (0)
version   : HTTP/1.1
request#  : 50
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate'
 - accept-language : 'en-US,en;q=0.9'
 - connection : 'keep-alive'
 - dnt : '1'
 - host : 'example.com:1234'
 - referer : 'http://example.com:1234/2016_-_wikipedia_en_all_2016-02/A/Main_Page.html'
 - sec-gpc : '1'
 - user-agent : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
arguments :
 - debug : 'true'
 - lang : 'en'
 - modules : 'jquery,mediawiki'
 - only : 'scripts'
 - skin : 'vector'
 - version : 'vWmIJl0K'
Parsed : 
full_url: /w/load.php
url   : /w/load.php
acceptEncodingDeflate : 1
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'deflate'
 - Vary: 'Accept-Encoding'
Request time : 0.000758s
----------------------
======================
Requesting : 
full_url  : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
method    : GET (0)
version   : HTTP/1.1
request#  : 51
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
url   : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Found A/Main_Page.html
mimeType: text/html
Response :
httpResponseCode : 200
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1643267520242259327/c"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.000456s
----------------------
======================
Requesting : 
full_url  : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
method    : GET (0)
version   : HTTP/1.1
request#  : 52
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
url   : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Found A/Main_Page.html
mimeType: text/html
Response :
httpResponseCode : 200
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1643267520242259327/c"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.000472s
----------------------
======================
Requesting : 
full_url  : /robots.txt
method    : GET (0)
version   : HTTP/1.1
request#  : 53
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /robots.txt
url   : /robots.txt
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
Request time : 0.000521s
----------------------
Dalethium commented 2 years ago

Hey! So, I believe your issue is that your crawler is not set to intranet/local. If you want to index lan pages you need to set the server to either intranet mode, or, switch your network to the allip.freedom one (but preferably do not index lan pages in the p2p network)

ISJ-439 commented 2 years ago

@Dalethium Hello sir,

It's set to "Search portal for your own web pages" (I have an entirely other node that's public), but its also using a FQDN that's resolvable by a public DNS. To be clear, im not using 10.0.0.0/8 or similar.

The yacy is running in a docker container so there is a possibility its querying from 172.16.x.x , if that helps.

ISJ-439 commented 2 years ago

What is the detailed step-by-step procedure to reproduce the bug?

If you we're to do this by yourself:

  1. On a Debian 11 system.
  2. 
    sudo apt install kiwix kiwix-tools docker.io

sudo mkdir /opt/yacy_data sudo chmod 777 /opt/yacy_data

https://github.com/yacy/yacy_search_server/blob/master/docker/Readme.md

Default admin account

login: admin

password: yacy

You should modify this default password with page /ConfigAccounts_p.html when exposing publicly your YaCy container.

CONFIG

Use Case & Accounts

Basic Configuration

  1. Search portal for your own web pages
  2. Uncheck SSL and UPnP

Accounts

Admin Account

Select: Access only with qualified account Peer User: adminuser Set the passwords

Network Configuration

Distributed Computing Network for Domain

Select: Robinson Mode Select: Private Peer

RAM/Disk Usage & Updates

Web Cache

HTCache Configuration

The maximum size of the cache: 50MB Compression level: 0

Access Tracker

Server Access

Local Search access rate limitations

YaCy search

Max searches in 3s: 3 Max searches in 1mn: 30 Max searches in 10mn: 300

Portal Configuration

Generic Search Portal

Greeting Line: Search the archived copies of Wikipedia for removed or changed articles. URL of Home Page: http://example.com:8090/ Index remote results: uncheck (this system is for searching the crawled pages only)

sudo docker run -d --name yacy -p 8090:8090 -p 8443:8443 -v /opt/yacy_data:/opt/yacy_search_server/DATA --log-opt max-size=200m --log-opt max-file=2 yacy/yacy_search_server:latest

KIWIX Server

sudo adduser kiwixuser

wget any of these files to use https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

su - kiwixuser -c 'kiwix-serve --library /home/kiwixuser/example.zim --port 5555 --nosearchbar --daemon'

ISJ-439 commented 2 years ago

Hello,

I've found a lead here, is it possible that it's not mapping the custom port used (5555) to the search parameters? A TCPDump shows its requesting the right URL but the wrong port (default of 80 not 5555).

Regards

thkoch2001 commented 2 years ago

Closing this issue as it seems it got solved in the last comment. Please leave a comment if further help is needed.

ISJ-439 commented 2 years ago

Closing this issue as it seems it got solved in the last comment. Please leave a comment if further help is needed.

That's incorrect sir, the issue persists, pages with non standard ports are not indexed properly.

ISJ-439 commented 2 years ago

I made a new issue to reduce "fluff" https://github.com/yacy/yacy_search_server/issues/475