Open toomyem opened 4 years ago
When YaCy started in late 2003 it did not have a crawler, it was just a caching proxy which created an index from the data which went through the proxy. Back then (pre Snowden) https indicated that the page contained personal data and was ignored in the indexer for privacy reasons. Other hints for personalized pages are cookies and URL parameters.
@Orbiter suggested to drop the proxy a long time ago, but several users (including me) did not like the idea. I agree with you that nowadays https is the default and the proxy is mostly useless and only exists for historical reasons.
Thank you for the answer - it is now clear to me. So what is the typical way to feed an index? Do you mostly run the crawler by hand, while the proxy is now not usable?
I don't run a peer at the moment, but I used the crawler to index static pages with rare changes. For more dynamic content (e.g. blogs, news) I tried to find RSS or Atom feeds and used the Load RSS Feed page (http://localhost:8090/Load_RSS_p.html) with scheduled re-loading.
There is also a video which explains indexing with RSS feeds: https://www.youtube.com/watch?v=hGwjllUdjU0
without the crawler over the proxy it seems to be less useful - in my opinion. :thinking:
The Idea: I browse over the internet and YaCy follows me and grabs the content is awesome!
It is possible but YaCy's internal proxy need to be configured with HTTPS interception
https://wiki.squid-cache.org/ConfigExamples/Intercept/SslBumpExplicit
It seems that Yacy can auto index pages via transparent proxy only when the page is accessed by http protocol. But nowadays almost all pages are served via https. So what is the use for transparent proxy?