ziplokk1 / incapsula-cracker

Use to bypass sites which use incapsula to block access to webscraping bots.
The Unlicense
45 stars 16 forks source link

Problem with whoscored.com #4

Closed goktugerce closed 7 years ago

goktugerce commented 8 years ago

I am trying to scrape this link as an example, but getting errors. I installed the hotfix branch, tried the requests solution written in wiki but getting these:

In from incapsula import crack one:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/requests_.py", line 67, in crack
    _load_encapsula_resource(sess, r)
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/requests_.py", line 18, in _load_encapsula_resource
    code = get_obfuscated_code(response.content)
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/methods.py", line 136, in get_obfuscated_code
    return code[0]
IndexError: list index out of range

If I use IncapSession, I get this:

File "incapsula/requests_.py", line 65, in crack r = sess.get('{scheme}://{host}/_IncapsulaResource?{url_params}'.format(scheme=scheme, host=host, url_params=url_params), headers={'Referer': response.url}) File "incapsula/requests_.py", line 113, in get return crack(self, r) ... RuntimeError: maximum recursion depth exceeded in cmp

If I install module via pip install incapsula-cracker and try the first solution, I get this as response.text, which is what I should not get.

rodolphopivetta commented 8 years ago

same problem here. I discovered that when I make more than one requests to the locked site the response don't contain the incapsula token, just a blank page like:

In [5]: response.content
Out[5]: '<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=8-334279410-0 0NNN RT(1477053474013 6) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U19&incident_id=133001920800788328-1938568076429954056&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133001920800788328-1938568076429954056</iframe></body></html>'

But if I use a proxy, his send me again another token and incapsula_cracker can unlock.

I suggest to implement an exception to this case.

soncco commented 7 years ago

Same problem, in my particular case I don't have any problems with my local tests, but in my production site I get this error. Any suggestions?

ziplokk1 commented 7 years ago

This version is now outdated, Please see here for a version which works with py2.7 and py3. I have tested the new version with whoscored.com and it works.

Thank you

roy187 commented 7 years ago

do u have incapsula cracker for Java? thank you

ziplokk1 commented 7 years ago

Sorry mate. It's been years since I've messed with Java. If you know of a good HTML parser and http/s requests library which can store session data, I would be glad to muck about making one.

roy187 commented 7 years ago

I using jsoup for web scraping in java,but couldnt get pass the incapsula, dont know if it is possible to make the cracker with jsoup

ziplokk1 commented 7 years ago

It doesn't look like it's possible simply with jsoup. I'll see what I can do (probably not soon though) using a combination of jsoup and Apache HTTPClient. Incapsula just uses a simple cookie to "verify" that you're using a browser so if you can set that cookie before every request, it will get you through most of the checks.

Though be aware that even if you get past the simple check, if you're scraping too fast, they will just simply serve a recaptcha and there's no easy way to get around that as far as I know.

A few tips though:

Assumptions:

AGAIN the above tips are based on assumptions that I've made and from my experiences in the past and may not hold true to current or future versions of incapsula.

Hope these tips help in the meantime!

roy187 commented 7 years ago

Thank you for these tips man, appreciated.I think using persistent cookies will work if somehow we are able to use them consistently because websites like whoscored.com fooled by it sometimes once I participated answering the question, and sometimes it doesnt, I'm not really sure why, Incapsula seems not so stable after all. So if there is a way to crack it by using persistent cookies, that'll work I guess.