zytedata / zyte-spider-templates

Spider templates for automatic crawlers.
BSD 3-Clause "New" or "Revised" License
24 stars 4 forks source link

No errors on Python 3.12 when none of `url`, `urls`, or `urls_file` is given #59

Open BurnzZ opened 3 months ago

BurnzZ commented 3 months ago

Overview

From the following PRs:

We have respectively introduced urls_file and urls as new parameters to indicate input URLs to the crawls, alongside the existing url parameter.

Should none of these 3 parameters are supplied to a crawl, the expected behavior would be to have the following error message:

  File "/some_dir/zyte-spider-templates-project/venv/lib/python/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for EcommerceSpiderParams
  Value error, No input parameter defined. Please, define one of: url, urls, urls_file. [type=value_error, input_value={}, input_type=dict]

However, it would seem that when using Python 3.12, the error does not exist.

Code to Reproduce

from scrapy_spider_metadata import Args
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams
from zyte_spider_templates.spiders.base import BaseSpider

class Spider(Args[EcommerceSpiderParams], BaseSpider):
    name = "spider"

if __name__ == "__main__":
    Spider()

Python 3.11

Python 3.12

wRAR commented 3 months ago

It's basically https://github.com/scrapy/scrapy/issues/6047

The exception is bubbled up to the deferred created with self.crawler_process.crawl() in the crawl or runspider command, but that deferred has no errback.

(No idea why is this situation handled differently on 3.11 and 3.12 :shrug:)

wRAR commented 3 months ago

So ideally we just shouldn't rely on unhandled exceptions, unless we fix Scrapy.

Gallaecio commented 3 months ago

So in Python 3.12+ Twisted no longer reports (exceptions in) unhandled deferreds?

wRAR commented 3 months ago

Not sure what could have changed.

from scrapy import Spider

class MySpider(Spider):
    name = "spider"

    def __init__(self, *args, **kwargs):
        1/0

This shows an unhandled exception on both Python versions.

wRAR commented 3 months ago

So far I was able to minimize it to this:

import scrapy
from pydantic import BaseModel, model_validator

class Model(BaseModel):
    @model_validator(mode="after")
    def foo(self):
        raise ValueError()

class Spider(scrapy.Spider):
    name = "spider"

    def __init__(self, *args, **kwargs) -> None:
        Model()
        super().__init__(*args, **kwargs)

Just having e.g. a required field is not enough to trigger this.