zfcsoftware / cf-clearance-scraper

This library was created for testing and training purposes to retrieve the page source of websites, create Cloudflare Turnstile tokens and create Cloudflare WAF sessions.
MIT License
226 stars 37 forks source link

Not working with some sites #4

Closed Gujal00 closed 2 months ago

Gujal00 commented 3 months ago

Thanks a lot for developing and sharing this solution, it works nicely for a lot of sites. However there are some sites where it doesn't solve the challenge and gives up with {"code":500,"message":"Request Timeout"}

This is one of the sites, can you please check and advise https://apnetv.to/Hindi-Serials

zfcsoftware commented 3 months ago

It seems to be caused by devtools detector. Normally the browser has been tested on devtools detector and is not caught, but additional precautions must have been taken. I will investigate this issue in more detail, but you can scrape it without any problem with the following method.

Step 1: https://github.com/zfcsoftware/cf-clearance-scraper/blob/4a3bb4e86d084bf4a426eced5645d5d46cec6eeb/module/scrape.js#L93 Replace this line with the following.

waitUntil: 'domcontentloaded'

Step 2: Interfere with the request. Add the following code in the field below.

https://github.com/zfcsoftware/cf-clearance-scraper/blob/4a3bb4e86d084bf4a426eced5645d5d46cec6eeb/module/browser.js#L113

    const { RequestInterceptionManager } = await import('puppeteer-intercept-and-modify-requests')
        const client = await page.target().createCDPSession()
        const interceptManager = new RequestInterceptionManager(client)
        await interceptManager.intercept({
            urlPattern: `https://apnetv.to/Hindi-Serials`,
            resourceType: "Document",
            modifyResponse({ body }) {
              return {
                       body: body.replace(`window.location.href = 'https://apnetv.to/indexnow.html';`,''),
              };
            },
          });
Gujal00 commented 3 months ago

Thanks a lot for taking the time to look at this issue. I made the changes and ran on my ubuntu server vm. Looks like even though i did npm install it is missing some dependencies

gujal@tux:~/cf-clearance-scraper$ npm run start

> cf-clearance-scraper@1.0.0 start
> node index.js

Server running on port 3000
Failed to launch the browser process! undefined
[1590:1590:0609/185637.942529:ERROR:ozone_platform_x11.cc(243)] Missing X server or $DISPLAY
[1590:1590:0609/185637.942605:ERROR:env.cc(258)] The platform failed to initialize.  Exiting.

TROUBLESHOOTING: https://pptr.dev/troubleshooting

Anyway i will wait for you to check and release the docker image, then will check with the image as it will fully self contained. Thanks

zfcsoftware commented 3 months ago

puppeteer-intercept-and-modify-requests

npm i puppeteer-intercept-and-modify-requests Just run it. After making the changes and running it, scraping will be available on the relevant site. Thank you for your feedback.

jairoxyz commented 3 months ago

I added your code and tried it and it get's cf_clearance for that site. You think there is a way to intercept dev tool detector for any site, rather than hard-coding the intercept and modify for specific sites?

zfcsoftware commented 2 months ago

Can you try again with the latest version?

Gujal00 commented 2 months ago

Can you try again with the latest version?

Yes, tried with the latest version, with the following python code. Even though cf-clearance-scraper get the cookie, subsequent call with the cookie results in 403

import requests

url = 'https://apnetv.to/Hindi-Serials'
cs_url = 'http://localhost:3000/cf-clearance-scraper'
data = {'url': url}
res = requests.post(cs_url, json=data)
result = res.json()
if res.status_code == 200:
    h = result.get('headers')
    headers = {
        'User-Agent': h.get('user-agent'),
        'Referer': url,
        'Cookie': h.get('cookie')
    }
    res2 = requests.get(url, headers=headers)
zfcsoftware commented 2 months ago

Can you try again with the latest version?

Yes, tried with the latest version, with the following python code. Even though cf-clearance-scraper get the cookie, subsequent call with the cookie results in 403

import requests

url = 'https://apnetv.to/Hindi-Serials'
cs_url = 'http://localhost:3000/cf-clearance-scraper'
data = {'url': url}
res = requests.post(cs_url, json=data)
result = res.json()
if res.status_code == 200:
    h = result.get('headers')
    headers = {
        'User-Agent': h.get('user-agent'),
        'Referer': url,
        'Cookie': h.get('cookie')
    }
    res2 = requests.get(url, headers=headers)

Cloudflare does not only check user-agent. It also checks values such as accept-language. If you use the whole header, the problem will be solved.

Gujal00 commented 2 months ago

Yes I tried with the compete headers from cf-clearnce scraper like this

res2 = requests.get(url, headers=h)

It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.

EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr

zfcsoftware commented 2 months ago

Yes I tried with the compete headers from cf-clearnce scraper like this

res2 = requests.get(url, headers=h)

It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.

EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr

The library had a bug in the last update and didn't send the header. I don't know how he did it, but that site is very good at bot detection. When I wanted to analyse the request with burp, it blocked it instantly. Cloudflare settings are very good. When sending a request, it somehow catches it, probably doing tls fingerprint check. I am closing this because the problem with the browser and library has been solved. Good luck scraping this site.

zfcsoftware commented 1 month ago

Yes I tried with the compete headers from cf-clearnce scraper like this

res2 = requests.get(url, headers=h)

It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.

EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr

Hello, this problem is related to tls fingerprint. Since I just met tls fingerprint, I could not offer a solution. I searched for it because some of my projects had problems. You can engrave with the code below without any problem.

const initCycleTLS = require('cycletls');

(async () => {
    const cycleTLS = await initCycleTLS();

    const response = await cycleTLS('https://apnetv.to/Hindi-Serials', {
        ja3: '772,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,13-0-18-16-65281-45-43-51-5-27-23-17513-65037-35-11-10,25497-29-23-24,0',
        userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
        method: "get",
        headers: {
            "authority": "apnetv.to",
            "host": "apnetv.to",
            "origin": "https://apnetv.to",
            "referer": "https://apnetv.to/Hindi-Serials",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "accept-language": "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7",
            "cache-control": "no-cache",
            "pragma": "no-cache",
            "priority": "u=0, i",
            "sec-ch-ua": "\"Chromium\";v=\"127\", \"Not)A;Brand\";v=\"99\"",
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": "\"Linux\"",
            "sec-fetch-dest": "document",
            "sec-fetch-mode": "navigate",
            "sec-fetch-site": "none",
            "sec-fetch-user": "?1",
            "upgrade-insecure-requests": "1",
            "cookie": "",
            "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
        }
    }, 'get');

    console.log(response.status);
    cycleTLS.exit();
})();