zytedata / zyte-smartproxy-plugin

A plugin for playwright-extra and puppeteer-extra to provide Smart Proxy Manager specific functionalities.
https://www.npmjs.com/package/zyte-smartproxy-plugin
MIT License
3 stars 0 forks source link

[Question] Python version of smartproxy plugin #4

Open adudew852 opened 1 year ago

adudew852 commented 1 year ago

Hi - my script is written in python. Is there a python version of this plugin, so that I can easily integrate the smartproxy into my existing program? Thanks.

storymode7 commented 1 year ago

Hi @adudew852 , You can take a look at the following:

  1. scrapy-playwright
  2. Zyte smartproxy selenium
  3. scrapy-headless
  4. Pyppeteer Integration
  5. scrapy-zyte-smartproxy: Middleware for SPM if you are not using a browser automation.
adudew852 commented 1 year ago

Thanks. I just tried the most traditional way to add the proxy to my python + playwright code but the page is not loading... it's fine if I remove the proxy.

I used the API key as the username and left the password as 'empty'. Any idea why? Many thanks in advance.

Code below for your reference.


async def main(): async with async_playwright() as p: browser = await p.webkit.launch( headless=False, slow_mo=50, proxy={ "server": 'proxy.zyte.com:8011', "username": 'API Key', "password": '', } ) context = await browser.new_context() page = await context.new_page() await stealth_async(page) response = await page.goto(url, timeout=60 * 1000) print(response.headers)

    await page.screenshot(path="demo.png")
    await asyncio.sleep(500)
    # browser.close()

asyncio.run(main())

storymode7 commented 1 year ago

Thanks for sharing the code @adudew852. The above code works for me as expected.

What is the error you receive when using proxy?

adudew852 commented 1 year ago

I run into the below error...

playwright._impl._api_types.Error: Failure when receiving data from the peer

storymode7 commented 1 year ago

Is that the complete error message? Also, have you installed certificate to access https pages via SPM?

adudew852 commented 1 year ago

yes, I have installed the certificate. Here's the error message. Thanks.


Traceback (most recent call last): File "C:\Users\admin\PycharmProjects\ABC\pw_buy-zyte.py", line 93, in asyncio.run(main()) File "C:\Program Files\Python310\lib\asyncio\runners.py", line 44, in run return loop.run_until_complete(main) File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 646, in run_until_complete return future.result() File "C:\Users\admin\PycharmProjects\ABC\pw_buy-zyte.py", line 70, in main response = await page.goto(url, timeout=60 * 1000) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright\async_api_generated.py", line 8913, in goto await self._impl_obj.goto( File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_page.py", line 491, in goto return await self._main_frame.goto(**locals_to_params(locals())) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_frame.py", line 147, in goto await self._channel.send("goto", locals_to_params(locals())) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 44, in send return await self._connection.wrap_api_call( File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 419, in wrap_api_call return await cb() File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 79, in inner_send result = next(iter(done)).result() playwright._impl._api_types.Error: Failure when receiving data from the peer

=========================== logs ===========================

navigating to "https://www.google.com", waiting until "load"

============================================================

storymode7 commented 1 year ago

Do you receive the same error with other URLs?

adudew852 commented 1 year ago

@storymode7 Yes, same error with other URLs. I was wondering if it is the way I pass an empty password is incorrect. I currently do this.

"password": '',

storymode7 commented 1 year ago

That shouldn't be a problem. I'm using the script below to try to reproduce the issue. Could you confirm you receive an error with this too?

import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

url = "https://quotes.toscrape.com/"
async def main():
    async with async_playwright() as p:
        browser = await p.webkit.launch(
            headless=False,
            slow_mo=50,
            proxy={
                "server": "proxy.zyte.com:8011",
                "username": "API Key",
                "password": "",
            }
        )
        context = await browser.new_context()
        page = await context.new_page()
        await stealth_async(page)
        response = await page.goto(url)
        print(response.headers)
        await page.screenshot(path="demo.png")
        # await asyncio.sleep(50)
        # browser.close()

asyncio.run(main())
adudew852 commented 1 year ago

Thanks for following up. Not sure what went wrong. I tried pip installing all the relevant packages again, run your code and still the same error. Tried changing the browser to firefox and same error. Strangely, I tried chromium and the page loaded with a timeout error. error log below. thanks for the help on this again.


Traceback (most recent call last): File "C:\Users\admin\PycharmProjects\ABC\zyte-test.py", line 28, in asyncio.run(main()) File "C:\Program Files\Python310\lib\asyncio\runners.py", line 44, in run return loop.run_until_complete(main) File "C:\Program Files\Python310\lib\asyncio\base_events.py", line 646, in run_until_complete return future.result() File "C:\Users\admin\PycharmProjects\ABC\zyte-test.py", line 22, in main response = await page.goto(url) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright\async_api_generated.py", line 8913, in goto await self._impl_obj.goto( File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_page.py", line 491, in goto return await self._main_frame.goto(**locals_to_params(locals())) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_frame.py", line 147, in goto await self._channel.send("goto", locals_to_params(locals())) File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 44, in send return await self._connection.wrap_api_call( File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 419, in wrap_api_call return await cb() File "C:\Users\admin\PycharmProjects\ABC\venv\lib\site-packages\playwright_impl_connection.py", line 79, in inner_send result = next(iter(done)).result() playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded. =========================== logs ===========================

navigating to "https://quotes.toscrape.com/", waiting until "load"

============================================================

Process finished with exit code 1

storymode7 commented 1 year ago

I'm sorry but I'm unable to reproduce the issue.

Could you try the following things:

  1. Use the following curl to establish that you are able to use the proxy: curl -LvU API_KEY: -x proxy.zyte.com:8011 'http://quotes.toscrape.com'
  2. Use http instead of https
  3. Remove timeouts from the code
adudew852 commented 1 year ago

Thanks for this. I reinstalled all the relevant package and the SSL cert and finally got the proxy working but ran into the below issues with the now... are you able to help?

1) CORS issues, which Playwright blocks fetch request to get authentication tokens from a different domain. I would imagine Zyte proxy should be able to solve this type of CORS issues. 2) I added a number of x-crawlera parameters to the request header per (https://docs.zyte.com/smart-proxy-manager.html#request-headers) but these headers are passed directly to the target website, which caused the http request to be blocked... 3) Not sure if it is because of issue 1 & 2, javascripts on the page is not loaded by Playwright.

Here's my code...

import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

url = "https:// < url > "

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            slow_mo=50,
            proxy={
                "server": "proxy.zyte.com:8011",
                "username": "<api key>",
                "password": "",
            }
        )
        context = await browser.new_context()
        await context.set_extra_http_headers({
            'X-Crawlera-Profile': 'mobile',
            'X-Crawlera-Profile-Pass': 'it_IT',
            'X-Crawlera-No-Bancheck': '1',
            'X-Crawlera-Cookies': 'disable',
            'X-Crawlera-Session': 'create',
        })
        page = await context.new_page()
        await stealth_async(page)
        response = await page.goto(url)
        print(response.headers)
        await page.screenshot(path="demo.png")
        await asyncio.sleep(500 * 1000)
        # browser.close()

asyncio.run(main())

storymode7 commented 1 year ago

Hi @adudew852, I'm able to run the above script as is with "http://quotes.toscrape.com/" after commenting the asyncio.sleep.

  1. The script is not leading to any CORS issue
  2. Crawlera params are not necessary
  3. Changing the URL in the above script to "http://quotes.toscrape.com/js/" I can see that JS is loaded as expected.

Were you able to reproduce the issue with the suggestions from my last comment.

This could be a playwright issue AFAIK. Have you tried any other proxies with this setup?