yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.41k stars 428 forks source link

SPA urls that include hash params are truncated #393

Closed andrewheadricke closed 3 years ago

andrewheadricke commented 3 years ago

I was successfully able to manually create an index for my SPA using custom built Warc files as suggested, however now I appear to have run into a potentially much bigger issue. Yacy appears to be truncating urls after the # so http://mysite.com/ and http://mysite.com/#!/blog/1 overwrite each other in the index.

Is it possible to change my local Yacy node to not truncate at the hash, will this break p2p compatibility?

Orbiter commented 3 years ago

the truncation is done always to create unique urls. Hash parameters are usually only used on the client side, hash params wont be passed to the server. That means they never exist as a request to the server, thus all urls must be truncated as that is only relevant on a client-side (possible javascript) evaluation.

andrewheadricke commented 3 years ago

Thanks @Orbiter, while a HTTP server may not be able to tell the difference between requests for http://mysite.com/#!/blog/1 and http://mysite.com/#!/blog/2 it would seem to me that a search engine would want to index both blogs 1 and 2 and return both links in the search results.

The only issue I can think of is for crawling? A simple non-rendered crawler would get the same result for both URLs, but you could just disable crawling for hash param urls I guess?