yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.42k stars 428 forks source link

Transparent proxy and https pages #371

Open toomyem opened 4 years ago

toomyem commented 4 years ago

It seems that Yacy can auto index pages via transparent proxy only when the page is accessed by http protocol. But nowadays almost all pages are served via https. So what is the use for transparent proxy?

marcnause commented 4 years ago

When YaCy started in late 2003 it did not have a crawler, it was just a caching proxy which created an index from the data which went through the proxy. Back then (pre Snowden) https indicated that the page contained personal data and was ignored in the indexer for privacy reasons. Other hints for personalized pages are cookies and URL parameters.

@Orbiter suggested to drop the proxy a long time ago, but several users (including me) did not like the idea. I agree with you that nowadays https is the default and the proxy is mostly useless and only exists for historical reasons.

toomyem commented 4 years ago

Thank you for the answer - it is now clear to me. So what is the typical way to feed an index? Do you mostly run the crawler by hand, while the proxy is now not usable?

marcnause commented 4 years ago

I don't run a peer at the moment, but I used the crawler to index static pages with rare changes. For more dynamic content (e.g. blogs, news) I tried to find RSS or Atom feeds and used the Load RSS Feed page (http://localhost:8090/Load_RSS_p.html) with scheduled re-loading.

There is also a video which explains indexing with RSS feeds: https://www.youtube.com/watch?v=hGwjllUdjU0

artemanufrij commented 1 year ago

without the crawler over the proxy it seems to be less useful - in my opinion. :thinking:

The Idea: I browse over the internet and YaCy follows me and grabs the content is awesome!

yacylover commented 1 year ago

It is possible but YaCy's internal proxy need to be configured with HTTPS interception

https://wiki.squid-cache.org/ConfigExamples/Intercept/SslBumpExplicit