webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.36k stars 212 forks source link

URLs appear in search results but cannot be replayed (URL Not Found) #700

Open puigru opened 2 years ago

puigru commented 2 years ago

Describe the bug

Certain URLs appear in the search results but cannot be replayed, it says the URL is not found when it was just shown in the search results

Steps to reproduce the bug

I've created a small collection which exhibits this issue: faulty_collection.zip (2 MB)

  1. Extract the archive
  2. Move contents the collections folder of pywb and run wb-manager init tv3
  3. Search for dinamics.ccma.cat in Domain mode
  4. Click on different URLs

Alternatively, you can just mount the collection into the Docker version: $ docker run --rm -e INIT_COLLECTION=tv3 -v /somewhere/tv3:/webarchive/collections/tv3:ro -p 8080:8080 -it webrecorder/pywb wayback

Expected behavior

All URLs should resolve given they're both in the CDXJ indexes and the WARC files

Screenshots

imatge https://dinamics.ccma.cat/pvideo/FLV_bbd_dadesItem.jsp?idint=3293610

Environment

puigru commented 2 years ago

I've been trying to debug this myself. There's two types of URL in the faulty collection:

The former resolve fine, the latter do not. It all points to an issue in RewriterApp. By tracing execution to where both diverge, I came to this: https://github.com/webrecorder/pywb/blob/42445562dab4cfe68cabc82ee94d51e3c70ee037/pywb/warcserver/index/fuzzymatcher.py#L182-L191

Where in the latter kind of URL, index_source produces an empty new_iter, resulting in no results from get_fuzzy_iter. While in the former, it does produce some results, as it finds an urlkey match in match_general_fuzzy_query.

From there, I've been unable to determine why exactly index_source produces an empty iterator for these URLs. Execution is a bit hard to follow as it goes into aggregator, so I'd appreciate if someone more familiar with the codebase could take a look.