Open puigru opened 2 years ago
I've been trying to debug this myself. There's two types of URL in the faulty collection:
The former resolve fine, the latter do not. It all points to an issue in RewriterApp
.
By tracing execution to where both diverge, I came to this:
https://github.com/webrecorder/pywb/blob/42445562dab4cfe68cabc82ee94d51e3c70ee037/pywb/warcserver/index/fuzzymatcher.py#L182-L191
Where in the latter kind of URL, index_source
produces an empty new_iter
, resulting in no results from get_fuzzy_iter
. While in the former, it does produce some results, as it finds an urlkey
match in match_general_fuzzy_query
.
From there, I've been unable to determine why exactly index_source
produces an empty iterator for these URLs. Execution is a bit hard to follow as it goes into aggregator
, so I'd appreciate if someone more familiar with the codebase could take a look.
Describe the bug
Certain URLs appear in the search results but cannot be replayed, it says the URL is not found when it was just shown in the search results
Steps to reproduce the bug
I've created a small collection which exhibits this issue: faulty_collection.zip (2 MB)
collections
folder of pywb and runwb-manager init tv3
dinamics.ccma.cat
in Domain modeAlternatively, you can just mount the collection into the Docker version:
$ docker run --rm -e INIT_COLLECTION=tv3 -v /somewhere/tv3:/webarchive/collections/tv3:ro -p 8080:8080 -it webrecorder/pywb wayback
Expected behavior
All URLs should resolve given they're both in the CDXJ indexes and the WARC files
Screenshots
https://dinamics.ccma.cat/pvideo/FLV_bbd_dadesItem.jsp?idint=3293610
Environment