webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.42k stars 217 forks source link

[feature request] Way to list all URLs in pywb #165

Open andrew-d opened 8 years ago

andrew-d commented 8 years ago

Hello there,

I'd like to make a feature request - a way to list all URLs in pywb. Web Archive Player has a similar feature, and it would be nice to have a version in this project, for use when I don't want to (or am unable to) run a desktop app.

Thanks, --Andrew

ikreymer commented 8 years ago

Do you mean list all urls or list all 'pages'. WebArchivePlayer lists what it detects as pages (usually HTML), but the detection is not perfect.. It is also possible to search for urls by prefix or host/domain, but there is not a mechanism for listing all of them.., though it might be possible to add (with some sort of pagination support, as it could be very large list)

N0taN3rd commented 8 years ago

Perhaps this option would be applicable when running pywb.

http://localhost:8080/coll-cdx?urls=all&page=1&filter=mime:text/html&limit=10

Given that currently pywb requires only the url parameter be set so possibly a give me all urls could be another "optional" required parameter

Serkan-devel commented 6 years ago

How come it's available to index archived urls on the webrecorder.io-frontend 2018-09-29-115237_1600x900_scrot but not on pywb directly? 2018-09-29-115816_1600x900_scrot

There's no list of all archived pages on pywb but when one searches for a particular page, one gets a detailed result. 2018-09-29-115708_1600x900_scrot Can this UI be used to show all archive pages instead of just one?

?urls=all doesn't seem to work

N0taN3rd commented 6 years ago

@Serkan-devel that level of curatorial specificity is a webrecorder only feature currently.

All tho replay via pywb is collection centric, pywb currently only provides the facilities to manage collections of web archives (create, add to, and index) and then replay the contents of a collection.

That is to say pywb is primarily concerned with the replay side of collections.

However, we have been thinking about how to provide some kind of page level specifity to pywb. But that feature requires heuristic evaluation of a collections index.... If you or anyone else would like to attempt to implement this feature we would be open to it :smiley:

Serkan-devel commented 6 years ago

But I'm not that good in python and I'm afraid to commit directly because someone is watching my steps on github who might ambush me but I really need this feature.

Can anyone link me to the exact files, responsible for showing indexes on the pywb webui and the indexing scripts on webrecorder as reference?

Serkan-devel commented 6 years ago

Do all searches run through this python script?

I'm afraid to fork this project publicly on github. But could I send patches for better url-querying by email if I do succeed?

ikreymer commented 6 years ago

To be clear, you are looking for a list of pages in the same way as they are listed in Webrecorder?

Are you using WARCs created in Webrecorder specifically or any web archive in general?

We would like to support this in the future, but there's a few issues to resolve as how best to do this in pywb.

Serkan-devel commented 6 years ago

Yes, I'd like to list urls, even if I haven't entered them completely, like on webrecorder. Listing all pages at the same time would be great too.

I do have warcs both created within a local webrecorder instance and also recorded directly with pywb.

While I'm getting closer to understand how cdx works, what are the issues, blocking the implementation of better listing?

N0taN3rd commented 6 years ago

The primary issue is we need to come up with a standard way for the search to be done and how it will be done. Are we persisting this list or computing it every time (requires heuristics that are not always correct to do this blind).

Can this functionality from webrecorder be used by pywb and or live in pywb. Do we ad a cdx query filter for this.

These are just a few of the questions we have about how to do this in pywb.

If you would like to contribute to this effort pywb/warcserver/index would be a good place to start and following how pywb/apps/frontendapp interacts with the warcserver.

Serkan-devel commented 6 years ago

I think I'm unable to do it.

But could anyone open an issue here about managing cookies on archives? It doesn't seem to be documented here and I don't want another github issue to show up at my timeline. This might be useful when webpages require to be logged in.

muramasatheninja commented 3 years ago

Any more thoughts on this? It's a feature I could really use.

Jackster commented 1 year ago

+1