webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

`capture_http` fails in tests, but works otherwise #139

Closed maxyousif15 closed 3 years ago

maxyousif15 commented 3 years ago

Subject

I have developed a package which utilises capture_http. I'm observing some very strange behaviour when this method is called directly within a unit test. In particular, when I import the package module and run the scraping function, all responses and requests are recorded (the WARC output file is larger than 0 bytes, and the expected requests and responses are stored in the WARC file). However, when I run the same code in a unit test (pytest), the WARC output file is 0 bytes.

This could be due to some import functionality of python, as I am aware that capture_http needs to be imported prior to requests. This is the case in my package; the module where the scraping methods are defined, capture_http is imported prior to any requests, urllib3 or http imports. However, this does not explain why the following code when run as a unit test still only produces a WARC file with 0 bytes.

# test_random.py
from warcio.capture_http import capture_http
import requests

def test_random():
    test_fn = 'random-test-warc.warc.gz'
    url = 'http://httpbin.org/get'

    with capture_http(test_fn):
        response = requests.get(url=url)

Environment

OS & Python

>>> import platform
>>> print("OS", platform.platform())
OS macOS-10.16-x86_64-i386-64bit
>>> print("Python", platform.python_version())
Python 3.8.5

Relevant Packages

warcio: 1.7.4
requests: 2.25.0
urllib3: 1.26.7

Expected Behaviour

I would expect test_random.py to export a WARC file, named 'random-test-warc.warc.gz' which is larger than 0 bytes and contains a single request and a single response record.

maxyousif15 commented 3 years ago

The following works flawlessly when executed as python test_random.py

# test_random.py
from warcio.capture_http import capture_http
import requests

def test_random():
    test_fn = 'random-test-warc.warc.gz'
    url = 'http://httpbin.org/get'

    with capture_http(test_fn):
        response = requests.get(url=url)

if __name__ == '__main__':
    test_random()

Seems like it is definitely related to pytest and importing modules in the wrong order. Happy to close this issue, but still would be great to get some guidance on this!

maxyousif15 commented 3 years ago

I still cannot figure out why this is happening. I have even added from warcio.capture_http import capture_http # noqa to every single __init__.py in the package, but I still cannot run the pytest which produces a WARC file larger than 0 bytes.

Once package is installed, and I run the scraping method in iPython, then I see the expected WARC output. Any ideas what pytest could be doing here to disrupt the import order? Tempted to literally copy capture_http.py into the package a load capture_http from there.

maxyousif15 commented 3 years ago

This is the package structure if it helps:

.
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── playground.ipynb
├── pyproject.toml
├── pytest.ini
├── scrapex
│   ├── __init__.py
│   ├── collectors
│   │   ├── __init__.py
│   │   ├── base.py
│   │   └── batch.py
│   ├── exceptions.py
│   ├── headers
│   │   ├── __init__.py
│   │   ├── data
│   │   │   └── user-agents.txt
│   │   └── utils.py
│   ├── sessions
│   │   ├── __init__.py
│   │   ├── adapters.py
│   │   ├── retry.py
│   │   └── sessions.py
│   └── warc
│       ├── __init__.py
│       ├── base.py
│       ├── index.py
│       ├── readers.py
│       └── writers.py
├── setup.cfg
├── setup.py
└── tests
    ├── __init__.py
    ├── integration_tests
    │   ├── __init__.py
    │   ├── test_collectors_batch.py
    │   ├── test_sessions_adapters.py
    │   └── test_sessions_sessions.py
    └── unit_tests
        ├── __init__.py
        ├── resources
        │   ├── test-empty-warc-file.warc.gz
        │   ├── test-get-html.warc.gz
        │   ├── test-get-request.warc.gz
        │   ├── test-post-request.warc.gz
        │   ├── test-redirect-request.warc.gz
        │   └── test-status-codes-get.warc.gz
        ├── test_collectors_base.py
        ├── test_collectors_batch.py
        ├── test_headers_utils.py
        ├── test_sessions_adapters.py
        ├── test_sessions_retry.py
        ├── test_sessions_sessions.py
        ├── test_warc_base.py
        ├── test_warc_index.py
        ├── test_warc_readers.py
        └── test_warc_writers.py
maxyousif15 commented 3 years ago

RESOLVED - As presumed, nothing to do with warcio. The issue was pytest and loaded plugins. In particular, dash pytest plugin loads the HTTPConnection prior to the patching. Closing issue for now.

wumpus commented 3 years ago

Glad you solved your problem.

Next time, I suggest looking inside the warcio tests to see if warcio has a test. In this case, yes, warcio does successfully use pytest in testing. It wouldn't have solved your problem with dash, but at least you'd know that your environment was where the problem was, and that warcio and pytest do work together.