Closed ibnesayeed closed 6 years ago
This seems like a slippery slope, there are all kinds of metadata that could be of interest in the payload of a warc response. Personally I'd like to see a little ecosystem of utilities or plugins you can use in combination with warcio.
@edsu I agree with your point and that is why I proposed this with so many conditions and caution. However, many of the metadata attributes are available in the form of WARC or HTTP headers, but the title is one of the primary attributes that we want even when we are merely listing WARC records (assuming that titles are more readable and meaningful than URIs for human).
The idea of a plugin (or plugins) is great. I can think of one that either wraps warcio
or enriches it by opportunistically adding fields like titles, detected content language, top keywords/entities, estimated creation date (which is sometimes available in the content of a news or a blog post), representative/primary image in the page, description, sans-markup text, primary text (after template removal), outlinks, and many other attributes like that. Looks like I am talking along the lines of oEmbed on steroids.
If you're going to parse the page enough to get the title, you might as well run it though a tool such as the ones that convert WARC to WET or WAT files. warcio provides a very easy interface to write a loop that takes the content in a WARC, does stuff, and then writes out another WARC with a different type of payload. Doesn't really need to be a plugin since it's using the interface warcio makes available to everyone.
Yeah, I'm not sure this is something that needs to be a plugin, but rather a standalone command-line tool that is built on top of warcio and adds any additional dependencies, eg. BeautifulSoup for HTML parsing, or text extraction, etc... (This is the approach we've taken in creating warcit).
warcio's main function is to make it easy to reading and write WARC records in a standards compliant way. Everything else should probably be a separate tool.
Should we extract titles of applicable records (such as HTML pages) and make them available as an attribute? I can see some usefulness to this, but I understand that it will add some additional processing time. While the same can be done in applications using
warcio
package, but it its usefulness is widespread, we might as well move the functionality towarcio
itself.