webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
383 stars 58 forks source link

Include a title attribute to applicable warc records #41

Closed ibnesayeed closed 6 years ago

ibnesayeed commented 6 years ago

Should we extract titles of applicable records (such as HTML pages) and make them available as an attribute? I can see some usefulness to this, but I understand that it will add some additional processing time. While the same can be done in applications using warcio package, but it its usefulness is widespread, we might as well move the functionality to warcio itself.

edsu commented 6 years ago

This seems like a slippery slope, there are all kinds of metadata that could be of interest in the payload of a warc response. Personally I'd like to see a little ecosystem of utilities or plugins you can use in combination with warcio.

ibnesayeed commented 6 years ago

@edsu I agree with your point and that is why I proposed this with so many conditions and caution. However, many of the metadata attributes are available in the form of WARC or HTTP headers, but the title is one of the primary attributes that we want even when we are merely listing WARC records (assuming that titles are more readable and meaningful than URIs for human).

The idea of a plugin (or plugins) is great. I can think of one that either wraps warcio or enriches it by opportunistically adding fields like titles, detected content language, top keywords/entities, estimated creation date (which is sometimes available in the content of a news or a blog post), representative/primary image in the page, description, sans-markup text, primary text (after template removal), outlinks, and many other attributes like that. Looks like I am talking along the lines of oEmbed on steroids.

wumpus commented 6 years ago

If you're going to parse the page enough to get the title, you might as well run it though a tool such as the ones that convert WARC to WET or WAT files. warcio provides a very easy interface to write a loop that takes the content in a WARC, does stuff, and then writes out another WARC with a different type of payload. Doesn't really need to be a plugin since it's using the interface warcio makes available to everyone.

ikreymer commented 6 years ago

Yeah, I'm not sure this is something that needs to be a plugin, but rather a standalone command-line tool that is built on top of warcio and adds any additional dependencies, eg. BeautifulSoup for HTML parsing, or text extraction, etc... (This is the approach we've taken in creating warcit).

warcio's main function is to make it easy to reading and write WARC records in a standards compliant way. Everything else should probably be a separate tool.