webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
361 stars 55 forks source link

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

Open voltagex opened 2 years ago

voltagex commented 2 years ago

I'm using Python 3.10.4 and warcio 1.7.4

Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting

    for record in ArchiveIterator(writer.get_stream()):
AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did you mean: '_iter_stream'?
import os.path
import hashlib

from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests #https://github.com/webrecorder/warcio#writing-warc-records
from bs4 import BeautifulSoup

#https://gist.github.com/edsu/62bc39890806ffd19b597186a3619419

OUTPUT_PATH = 'output/'

def cache_and_return_bs(url):
    if url_already_retrieved(url):
        raise Exception(url + ' already there')
    with capture_http(get_output_filename(url),warc_version='1.1') as writer:
        #TODO: do we want to try to append to a single file?
        requests.get(url)
        for record in ArchiveIterator(writer.get_stream()):
            if record.rec_type == 'response':
                return BeautifulSoup(record.raw_stream)

def get_output_filename(url):
    return OUTPUT_PATH + hashlib.sha256(url.encode()).hexdigest()

def url_already_retrieved(url):
    return os.path.isfile(get_output_filename(url))

if __name__ == '__main__':
    print(cache_and_return_bs('https://example.org'))

I narrowed this down to specifying a filename in the writer object - if I don't do this, the get_stream method exists

wumpus commented 2 years ago

This is by design. I agree that this isn't obvious and that we can improve the documentation and runtime error messages for this case.

What you should do instead is do the capture to a file, and once that's done, read that file.

voltagex commented 2 years ago

Thanks @wumpus, should I leave this open as a documentation bug in that case?

wumpus commented 2 years ago

Yes, please leave it open, this is not the only place where we have a lack of clarity about streaming vs files.