Closed Gujal00 closed 2 months ago
It seems to be caused by devtools detector. Normally the browser has been tested on devtools detector and is not caught, but additional precautions must have been taken. I will investigate this issue in more detail, but you can scrape it without any problem with the following method.
Step 1: https://github.com/zfcsoftware/cf-clearance-scraper/blob/4a3bb4e86d084bf4a426eced5645d5d46cec6eeb/module/scrape.js#L93 Replace this line with the following.
waitUntil: 'domcontentloaded'
Step 2: Interfere with the request. Add the following code in the field below.
const { RequestInterceptionManager } = await import('puppeteer-intercept-and-modify-requests')
const client = await page.target().createCDPSession()
const interceptManager = new RequestInterceptionManager(client)
await interceptManager.intercept({
urlPattern: `https://apnetv.to/Hindi-Serials`,
resourceType: "Document",
modifyResponse({ body }) {
return {
body: body.replace(`window.location.href = 'https://apnetv.to/indexnow.html';`,''),
};
},
});
Thanks a lot for taking the time to look at this issue.
I made the changes and ran on my ubuntu server vm. Looks like even though i did npm install
it is missing some dependencies
gujal@tux:~/cf-clearance-scraper$ npm run start
> cf-clearance-scraper@1.0.0 start
> node index.js
Server running on port 3000
Failed to launch the browser process! undefined
[1590:1590:0609/185637.942529:ERROR:ozone_platform_x11.cc(243)] Missing X server or $DISPLAY
[1590:1590:0609/185637.942605:ERROR:env.cc(258)] The platform failed to initialize. Exiting.
TROUBLESHOOTING: https://pptr.dev/troubleshooting
Anyway i will wait for you to check and release the docker image, then will check with the image as it will fully self contained. Thanks
puppeteer-intercept-and-modify-requests
npm i puppeteer-intercept-and-modify-requests
Just run it. After making the changes and running it, scraping will be available on the relevant site.
Thank you for your feedback.
I added your code and tried it and it get's cf_clearance for that site. You think there is a way to intercept dev tool detector for any site, rather than hard-coding the intercept and modify for specific sites?
Can you try again with the latest version?
Can you try again with the latest version?
Yes, tried with the latest version, with the following python code. Even though cf-clearance-scraper get the cookie, subsequent call with the cookie results in 403
import requests
url = 'https://apnetv.to/Hindi-Serials'
cs_url = 'http://localhost:3000/cf-clearance-scraper'
data = {'url': url}
res = requests.post(cs_url, json=data)
result = res.json()
if res.status_code == 200:
h = result.get('headers')
headers = {
'User-Agent': h.get('user-agent'),
'Referer': url,
'Cookie': h.get('cookie')
}
res2 = requests.get(url, headers=headers)
Can you try again with the latest version?
Yes, tried with the latest version, with the following python code. Even though cf-clearance-scraper get the cookie, subsequent call with the cookie results in 403
import requests url = 'https://apnetv.to/Hindi-Serials' cs_url = 'http://localhost:3000/cf-clearance-scraper' data = {'url': url} res = requests.post(cs_url, json=data) result = res.json() if res.status_code == 200: h = result.get('headers') headers = { 'User-Agent': h.get('user-agent'), 'Referer': url, 'Cookie': h.get('cookie') } res2 = requests.get(url, headers=headers)
Cloudflare does not only check user-agent. It also checks values such as accept-language. If you use the whole header, the problem will be solved.
Yes I tried with the compete headers from cf-clearnce scraper like this
res2 = requests.get(url, headers=h)
It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.
EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr
Yes I tried with the compete headers from cf-clearnce scraper like this
res2 = requests.get(url, headers=h)
It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.
EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr
The library had a bug in the last update and didn't send the header. I don't know how he did it, but that site is very good at bot detection. When I wanted to analyse the request with burp, it blocked it instantly. Cloudflare settings are very good. When sending a request, it somehow catches it, probably doing tls fingerprint check. I am closing this because the problem with the browser and library has been solved. Good luck scraping this site.
Yes I tried with the compete headers from cf-clearnce scraper like this
res2 = requests.get(url, headers=h)
It is still getting 403. That site seems to be doing something more. May not be in the scope of your project Thanks for looking.
EDIT: Tried with Flaresolverr, even there using headers from the response for subsequent calls doesnt work. so this is some other protection on site probably. However Flaresolverr returns the retrieved page html code in the response so could use that, but not ideal putting every request through Flaresolverr
Hello, this problem is related to tls fingerprint. Since I just met tls fingerprint, I could not offer a solution. I searched for it because some of my projects had problems. You can engrave with the code below without any problem.
const initCycleTLS = require('cycletls');
(async () => {
const cycleTLS = await initCycleTLS();
const response = await cycleTLS('https://apnetv.to/Hindi-Serials', {
ja3: '772,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,13-0-18-16-65281-45-43-51-5-27-23-17513-65037-35-11-10,25497-29-23-24,0',
userAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
method: "get",
headers: {
"authority": "apnetv.to",
"host": "apnetv.to",
"origin": "https://apnetv.to",
"referer": "https://apnetv.to/Hindi-Serials",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7",
"cache-control": "no-cache",
"pragma": "no-cache",
"priority": "u=0, i",
"sec-ch-ua": "\"Chromium\";v=\"127\", \"Not)A;Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Linux\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"cookie": "",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
}
}, 'get');
console.log(response.status);
cycleTLS.exit();
})();
Thanks a lot for developing and sharing this solution, it works nicely for a lot of sites. However there are some sites where it doesn't solve the challenge and gives up with
{"code":500,"message":"Request Timeout"}
This is one of the sites, can you please check and advise
https://apnetv.to/Hindi-Serials