Problem with whoscored.com

goktugerce commented 8 years ago

I am trying to scrape this link as an example, but getting errors. I installed the hotfix branch, tried the requests solution written in wiki but getting these:

In from incapsula import crack one:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/requests_.py", line 67, in crack
    _load_encapsula_resource(sess, r)
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/requests_.py", line 18, in _load_encapsula_resource
    code = get_obfuscated_code(response.content)
  File "/home/username/miniconda2/lib/python2.7/site-packages/incapsula_cracker-0.1.3-py2.7.egg/incapsula/methods.py", line 136, in get_obfuscated_code
    return code[0]
IndexError: list index out of range

If I use IncapSession, I get this:

File "incapsula/requests_.py", line 65, in crack r = sess.get('{scheme}://{host}/_IncapsulaResource?{url_params}'.format(scheme=scheme, host=host, url_params=url_params), headers={'Referer': response.url}) File "incapsula/requests_.py", line 113, in get return crack(self, r) ... RuntimeError: maximum recursion depth exceeded in cmp

If I install module via pip install incapsula-cracker and try the first solution, I get this as response.text, which is what I should not get.

rodolphopivetta commented 8 years ago

same problem here. I discovered that when I make more than one requests to the locked site the response don't contain the incapsula token, just a blank page like:

In [5]: response.content
Out[5]: '<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=8-334279410-0 0NNN RT(1477053474013 6) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U19&incident_id=133001920800788328-1938568076429954056&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133001920800788328-1938568076429954056</iframe></body></html>'

But if I use a proxy, his send me again another token and incapsula_cracker can unlock.

I suggest to implement an exception to this case.

soncco commented 7 years ago

Same problem, in my particular case I don't have any problems with my local tests, but in my production site I get this error. Any suggestions?

ziplokk1 commented 7 years ago

This version is now outdated, Please see here for a version which works with py2.7 and py3. I have tested the new version with whoscored.com and it works.

Thank you

roy187 commented 7 years ago

do u have incapsula cracker for Java? thank you

ziplokk1 commented 7 years ago

Sorry mate. It's been years since I've messed with Java. If you know of a good HTML parser and http/s requests library which can store session data, I would be glad to muck about making one.

roy187 commented 7 years ago

I using jsoup for web scraping in java,but couldnt get pass the incapsula, dont know if it is possible to make the cracker with jsoup

ziplokk1 commented 7 years ago

It doesn't look like it's possible simply with jsoup. I'll see what I can do (probably not soon though) using a combination of jsoup and Apache HTTPClient. Incapsula just uses a simple cookie to "verify" that you're using a browser so if you can set that cookie before every request, it will get you through most of the checks.

Though be aware that even if you get past the simple check, if you're scraping too fast, they will just simply serve a recaptcha and there's no easy way to get around that as far as I know.

A few tips though:

Change your useragent string regularly. Simply changing this from time to time will be enough to fool incapsula for light scraping jobs.
Use persistent cookies. There are a couple cookies which (I assume) will get you more longevity for your scraping if sent out with every request. (I'm not too sure how true this is since I'm just assuming based on past experience)
A simple GET request without a browsers user agent will be blocked almost immediately.

Assumptions:

Incapsula has an IP blacklist which, until the IP is verified by actually completing the browser challenge, will continuously serve the incapsula blocked HTML page. This means that if it's blocked consistently, you can actually pull up the page in a browser, complete the challenge, and resume scraping for a small time. (This used to work for me but that was about two years ago so things may have changed)
One of the cookies which seems to be necessary is something like incap_sess_id. I can't quite remember the full name.
One thing to try is to copy/paste the cookie from your browser called ___utmvc and send that out with your requests. I have not actually tried this, but if I recall correctly there is no request unique data so you should be able to reuse the same cookie value.

AGAIN the above tips are based on assumptions that I've made and from my experiences in the past and may not hold true to current or future versions of incapsula.

Hope these tips help in the meantime!

roy187 commented 7 years ago

Thank you for these tips man, appreciated.I think using persistent cookies will work if somehow we are able to use them consistently because websites like whoscored.com fooled by it sometimes once I participated answering the question, and sometimes it doesnt, I'm not really sure why, Incapsula seems not so stable after all. So if there is a way to crack it by using persistent cookies, that'll work I guess.

ziplokk1 / incapsula-cracker

Problem with whoscored.com #4