webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.34k stars 207 forks source link

Sequence gives 0 results, even when the last item in sequence is $live? #589

Open jwest75674 opened 3 years ago

jwest75674 commented 3 years ago

Describe the bug

Edit: three week review and cleanup.

At the bottom of this report is my config.yaml for reference.

Following along the documentation with regard to fallbacks via a "Sequence", I was surprised to see 0 results for requests which I know have results in collections included in this sequence.

Steps to reproduce the bug

I am not confident in the reproducibility of this bug, but am hoping that my config can shine some light on the situation.

Expected behavior

Normal sequential failover functionality. Is the first in the sequence does not contain a result, fail to the next, until eventually pulling from $live.

Environment

config.yaml

# pywb config file -- Added comments specifically for this bug report, not present in actual config.
# ========================================
#

collections:
    all: $all # Returns 0 results
    live: $live # Works
    ia: memento+https://web.archive.org/web/ # Works
    rhiz: memento+http://webenact.rhizome.org/all/ # 0 results for this example test domain
    apt:  memento+http://arquivo.pt/wayback/ # 0 results for this example test domain

    # Sequence
    daisychain: # 0 results
        sequence:
            -
              index: /mnt/commoncrawl/collections/homepages/indexes/
              resource: /mnt/commoncrawl/collections/homepages/archive/
              name: homepages

            -
              index: /mnt/commoncrawl/collections/ca/indexes
              resource: /mnt/commoncrawl/collections/ca/archive
              name: ca

            -
              index_group:
                  rhiz: memento+http://webenact.rhizome.org/all/
                  ia:   cdx+http://web.archive.org/cdx;/web
                  apt:  memento+http://arquivo.pt/wayback/

            -
              index: $live
              name: live

    homepages: # many results
        index_paths: /mnt/commoncrawl/collections/homepages/indexed_sorted/
        archive_paths: /mnt/commoncrawl/collections/homepages/archive/
    ca: #many results
        index_paths: /mnt/commoncrawl/collections/ca/indexes/
        archive_paths: /mnt/commoncrawl/collections/ca/archive/
    memento: # many results
        index_group:
            rhiz:  memento+http://webenact.rhizome.org/all/
            ia:    memento+http://web.archive.org/web/
            local: ./collections/

# Settings for each collection
use_js_obj_proxy: true

# Memento support, enable
enable_memento: true

# Replay content in an iframe
framed_replay: true

#timeout: 20 --> Disabled for testing
jwest75674 commented 3 years ago

One month later check-in.

Still stuck.