yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.38k stars 427 forks source link

Suggestion: P2P index harvesting and archiving #15

Closed paraabeli closed 8 years ago

paraabeli commented 9 years ago

Hi, regarding global index harvesting, is it possible to add extra tuning options and functionality for p2p and settings for all yacy peers (volunteerally) to contribute indexes for archiving p2p nodes who want and are able to store huge amount of data.

This would mean that theres needed additional peer role along junior,senior,principal. Archivist role would be nice addition to yacy.

Suggestion is that, those crawling peers could activate option for checking archivist tag from peer list and contribute all p2p index transmissions to those peers in round robin along normal p2p index distribution. This is purely for preserving and protecting p2p index from "erosion" when some nodes stops running yacy before theyve sended their index fully into global index.

Between Archivist nodes, they would distribute archived index to new and old peers in low priority while priority primaly is in receiving as much as possible global index and share/sync it between archivists nodes.

Edit: Also to be able to Archive as much as possible, probably is need functionatily that the p2p chunks would not be indexed directly, lets say that archivists receive index chunks for 24hours and then go into indexing mode which deactivates index receiving and archivist node starts to check for doubles from received chunks and then indexes all transfers and after that starts to receive new chunks. This due indexing chunks takes quite much cpu power and those who contribute to archivist nodes might DDoS node down quite easily.

Br, Paraabeli

Orbiter commented 8 years ago

An archive role is a good idea. However, archiving needs a common archive format. I suggest to use either the xml dump format introduced in 2015 or the usage of to-be-implemented WARC files. Details for a WARC sharing is already documented in http://kaskelix.de - a proposal for a YaCy2 architecture. However, this is extremely long-term and would need funding to implement the YaCy2 plattform. Therefore this is too future-dated.

yacylover commented 8 years ago

http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf

Am 08.01.2016 um 03:24 schrieb Michael Peter Christen:

An archive role is a good idea. However, archiving needs a common archive format. I suggest to use either the xml dump format introduced in 2015 or the usage of to-be-implemented WARC files. Details for a WARC sharing is already documented in http://kaskelix.de - a proposal for a YaCy2 architecture. However, this is extremely long-term and would need funding to implement the YaCy2 plattform. Therefore this is too future-dated.

— Reply to this email directly or view it on GitHub https://github.com/yacy/yacy_search_server/issues/15#issuecomment-169869885.

yacylover commented 8 years ago

http://blog.commoncrawl.org

Am 08.01.2016 um 03:24 schrieb Michael Peter Christen:

An archive role is a good idea. However, archiving needs a common archive format. I suggest to use either the xml dump format introduced in 2015 or the usage of to-be-implemented WARC files. Details for a WARC sharing is already documented in http://kaskelix.de - a proposal for a YaCy2 architecture. However, this is extremely long-term and would need funding to implement the YaCy2 plattform. Therefore this is too future-dated.

— Reply to this email directly or view it on GitHub https://github.com/yacy/yacy_search_server/issues/15#issuecomment-169869885.