Closed maxyousif15 closed 3 years ago
The following works flawlessly when executed as python test_random.py
# test_random.py
from warcio.capture_http import capture_http
import requests
def test_random():
test_fn = 'random-test-warc.warc.gz'
url = 'http://httpbin.org/get'
with capture_http(test_fn):
response = requests.get(url=url)
if __name__ == '__main__':
test_random()
Seems like it is definitely related to pytest and importing modules in the wrong order. Happy to close this issue, but still would be great to get some guidance on this!
I still cannot figure out why this is happening. I have even added from warcio.capture_http import capture_http # noqa
to every single __init__.py
in the package, but I still cannot run the pytest which produces a WARC file larger than 0 bytes.
Once package is installed, and I run the scraping method in iPython, then I see the expected WARC output. Any ideas what pytest could be doing here to disrupt the import order? Tempted to literally copy capture_http.py
into the package a load capture_http
from there.
This is the package structure if it helps:
.
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── playground.ipynb
├── pyproject.toml
├── pytest.ini
├── scrapex
│ ├── __init__.py
│ ├── collectors
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── batch.py
│ ├── exceptions.py
│ ├── headers
│ │ ├── __init__.py
│ │ ├── data
│ │ │ └── user-agents.txt
│ │ └── utils.py
│ ├── sessions
│ │ ├── __init__.py
│ │ ├── adapters.py
│ │ ├── retry.py
│ │ └── sessions.py
│ └── warc
│ ├── __init__.py
│ ├── base.py
│ ├── index.py
│ ├── readers.py
│ └── writers.py
├── setup.cfg
├── setup.py
└── tests
├── __init__.py
├── integration_tests
│ ├── __init__.py
│ ├── test_collectors_batch.py
│ ├── test_sessions_adapters.py
│ └── test_sessions_sessions.py
└── unit_tests
├── __init__.py
├── resources
│ ├── test-empty-warc-file.warc.gz
│ ├── test-get-html.warc.gz
│ ├── test-get-request.warc.gz
│ ├── test-post-request.warc.gz
│ ├── test-redirect-request.warc.gz
│ └── test-status-codes-get.warc.gz
├── test_collectors_base.py
├── test_collectors_batch.py
├── test_headers_utils.py
├── test_sessions_adapters.py
├── test_sessions_retry.py
├── test_sessions_sessions.py
├── test_warc_base.py
├── test_warc_index.py
├── test_warc_readers.py
└── test_warc_writers.py
RESOLVED - As presumed, nothing to do with warcio
. The issue was pytest
and loaded plugins. In particular, dash
pytest plugin loads the HTTPConnection
prior to the patching. Closing issue for now.
Glad you solved your problem.
Next time, I suggest looking inside the warcio tests to see if warcio has a test. In this case, yes, warcio does successfully use pytest in testing. It wouldn't have solved your problem with dash
, but at least you'd know that your environment was where the problem was, and that warcio and pytest do work together.
Subject
I have developed a package which utilises
capture_http
. I'm observing some very strange behaviour when this method is called directly within a unit test. In particular, when I import the package module and run the scraping function, all responses and requests are recorded (the WARC output file is larger than 0 bytes, and the expected requests and responses are stored in the WARC file). However, when I run the same code in a unit test (pytest), the WARC output file is 0 bytes.This could be due to some import functionality of python, as I am aware that
capture_http
needs to be imported prior torequests
. This is the case in my package; the module where the scraping methods are defined,capture_http
is imported prior to anyrequests
,urllib3
orhttp
imports. However, this does not explain why the following code when run as a unit test still only produces a WARC file with 0 bytes.Environment
OS & Python
Relevant Packages
Expected Behaviour
I would expect
test_random.py
to export a WARC file, named 'random-test-warc.warc.gz' which is larger than 0 bytes and contains a single request and a single response record.