Open andrew-d opened 8 years ago
Do you mean list all urls or list all 'pages'. WebArchivePlayer lists what it detects as pages (usually HTML), but the detection is not perfect.. It is also possible to search for urls by prefix or host/domain, but there is not a mechanism for listing all of them.., though it might be possible to add (with some sort of pagination support, as it could be very large list)
Perhaps this option would be applicable when running pywb.
http://localhost:8080/coll-cdx?urls=all&page=1&filter=mime:text/html&limit=10
Given that currently pywb requires only the url parameter be set so possibly a give me all urls
could be another "optional" required parameter
How come it's available to index archived urls on the webrecorder.io-frontend but not on pywb directly?
There's no list of all archived pages on pywb but when one searches for a particular page, one gets a detailed result. Can this UI be used to show all archive pages instead of just one?
?urls=all
doesn't seem to work
@Serkan-devel that level of curatorial specificity is a webrecorder only feature currently.
All tho replay via pywb is collection centric, pywb currently only provides the facilities to manage collections of web archives (create, add to, and index) and then replay the contents of a collection.
That is to say pywb is primarily concerned with the replay side of collections.
However, we have been thinking about how to provide some kind of page level specifity to pywb. But that feature requires heuristic evaluation of a collections index.... If you or anyone else would like to attempt to implement this feature we would be open to it :smiley:
But I'm not that good in python and I'm afraid to commit directly because someone is watching my steps on github who might ambush me but I really need this feature.
Can anyone link me to the exact files, responsible for showing indexes on the pywb webui and the indexing scripts on webrecorder as reference?
Do all searches run through this python script?
I'm afraid to fork this project publicly on github. But could I send patches for better url-querying by email if I do succeed?
To be clear, you are looking for a list of pages in the same way as they are listed in Webrecorder?
Are you using WARCs created in Webrecorder specifically or any web archive in general?
We would like to support this in the future, but there's a few issues to resolve as how best to do this in pywb.
Yes, I'd like to list urls, even if I haven't entered them completely, like on webrecorder. Listing all pages at the same time would be great too.
I do have warcs both created within a local webrecorder instance and also recorded directly with pywb.
While I'm getting closer to understand how cdx works, what are the issues, blocking the implementation of better listing?
The primary issue is we need to come up with a standard way for the search to be done and how it will be done. Are we persisting this list or computing it every time (requires heuristics that are not always correct to do this blind).
Can this functionality from webrecorder be used by pywb and or live in pywb. Do we ad a cdx query filter for this.
These are just a few of the questions we have about how to do this in pywb.
If you would like to contribute to this effort pywb/warcserver/index would be a good place to start and following how pywb/apps/frontendapp interacts with the warcserver.
I think I'm unable to do it.
But could anyone open an issue here about managing cookies on archives? It doesn't seem to be documented here and I don't want another github issue to show up at my timeline. This might be useful when webpages require to be logged in.
Any more thoughts on this? It's a feature I could really use.
+1
Hello there,
I'd like to make a feature request - a way to list all URLs in
pywb
. Web Archive Player has a similar feature, and it would be nice to have a version in this project, for use when I don't want to (or am unable to) run a desktop app.Thanks, --Andrew