ziplokk1 / incapsula-cracker-py3

Python3 compatible way to bypass sites guarded with Incapsula
https://ziplokk1.github.io/incapsula-cracker-py3/
The Unlicense
191 stars 33 forks source link

Incapsula changed cookie value creation algorithm for whoscored.com #4

Open ziplokk1 opened 7 years ago

ziplokk1 commented 7 years ago

Original Javascript method looked like this:

function setIncapCookie(vArray) {
    var res;
    try {
        var cookies = getSessionCookies();
        var digests = new Array(cookies.length);
        for (var i = 0; i < cookies.length; i++) {
            digests[i] = simpleDigest((vArray) + cookies[i])
        }
        res = vArray + ",digest=" + (digests.join())
    } catch (e) {
        res = vArray + ",digest=" + (encodeURIComponent(e.toString()))
    }
    createCookie("___utmvc", res, 20)
}

Now they have changed it to this:

function setIncapCookie(vArray) {
var res;
try {
    var cookies = getSessionCookies();
    var digests = new Array(cookies.length);
    for (var i = 0; i < cookies.length; i++) {
        digests[i] = simpleDigest((vArray) + cookies[i]);
    }
    var sl = "jcMQV+ffvh2BmAcW8nq2a1HZRZcsB5poBUV2Ew==";
    var dd = digests.join();
    var asl = '';
    for (var i=0;i<sl.length;i++) {
        asl += (sl.charCodeAt(i) + dd.charCodeAt(i % dd.length)).toString(16);
    }
    res = vArray + ",digest=" + dd + ",s=" + asl;
} catch (e) {
    res = vArray + ",digest=" + (encodeURIComponent(e.toString()));
}
createCookie("___utmvc", res, 20);

Here is the relevant diff:

var sl = "jcMQV+ffvh2BmAcW8nq2a1HZRZcsB5poBUV2Ew==";
var dd = digests.join();
var asl = '';
for (var i=0;i<sl.length;i++) {
    asl += (sl.charCodeAt(i) + dd.charCodeAt(i % dd.length)).toString(16);
}
ziplokk1 commented 7 years ago

Code diff needs to be translated into python and inserted into the method below which can be found in incapsula.session.IncapSession._set_incap_cookie()

def _set_incap_cookie(self, v_array, domain=''):
    """
    Calculate the final value for the cookie needed to bypass incapsula.

    .. note:: Translated from:
        function setIncapCookie(vArray) {
            var res;
            try {
                var cookies = getSessionCookies();
                var digests = new Array(cookies.length);
                for (var i = 0; i < cookies.length; i++) {
                    digests[i] = simpleDigest((vArray) + cookies[i]);
                }
                res = vArray + ",digest=" + (digests.join());
            } catch (e) {
                res = vArray + ",digest=" + (encodeURIComponent(e.toString()));
            }
            createCookie("___utmvc", res, 20);
        }

    :param v_array: Comma delimited, urlencoded string which was returned from :func:`simple_digest`.
    :param domain: Cookie domain.
    :return:
    """
    cookies = self._get_session_cookies()
    digests = []
    for cookie_val in cookies:
        digests.append(simple_digest(v_array + cookie_val))
    # Translated code must be applied here.
    res = v_array + ',digest=' + ','.join(digests)
    logger.debug('setting ___utmvc cookie to {}'.format(res))
    self._create_cookie(self._create_cookie('___utmvc', res, 20, domain=domain)
ziplokk1 commented 7 years ago

The full de-obfuscated .js file from whoscored.com can be found here.

ziplokk1 commented 7 years ago

the variable sl changes with each request to the obfuscated code despite the host and the endpoint/querystring. Originally I thought it was some site-specific base64 encoded key or something, but that proved to be wrong. I thought maybe the new querystring in the incapsula resource url had something to do with it, but even after trying this url (https://www.coursehero.com/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05) three times in a row and deobfuscating the code, the variable still changed.

The most straight forward solution I can think of right now is:

  1. Parse the original page to obtain the resource URL.
  2. Use regex or some JS engine to parse the JS document which contains and deobfuscates the obfuscated code which is simply labeled var b.
  3. Deobfuscate the JS code and again either use regex or some JS engine to get the var sl from the setIncapCookie method.
  4. Now use that variable to create the new digest which incapsula is expecting.
yalopov commented 7 years ago

Hi! I'm not sure if it works (i haven't tested it yet) but i think atleast it can get sl var value.

b = ""
char_list = []

#Code equivalent to:
# for (var i = 0; i < b.length; i += 2) {
#         z = z + parseInt(b.substring(i, i + 2), 16) + ",";
#     }
#     z = z.substring(0, z.length - 1);
#     eval(eval('String.fromCharCode(' + z + ')'));

for i in range(0,len(b),2):
    char_list.append(int(b[i:i+2], base=16))

code = ""
for char in char_list:
    code = code + chr(char)

#Regex to match sl var value

sl_var = re.search('sl = "(.+)";',code).group(1)

dd = ""
asl = ""

#Code equivalent to 
# for (var i=0;i<sl.length;i++) {
#     asl += (sl.charCodeAt(i) + dd.charCodeAt(i % dd.length)).toString(16);
# }

for i in range(0,sl_var):
    asl = asl + format(ord(sl_var[i]) + ord(dd[i % len(dd)]), 'x')

and i'm not sure if try to match Incapsula_Resource url in script tag from original page with a regex is good idea. Something like:

re.search("(/_Incapsula_Resource.+)'",response.text).group(1)
brianzinn commented 6 years ago

@Hades1996 - That logic will work. I ended up with a working solution, but not in Python. Used this repo to get most of the logic :) @ziplokk1 I went with the straightforward solution you listed - the deobfuscation of var b to get the JavaScript is the same as the other inline script on the landing page.

LuisUrrutia commented 6 years ago

Thanks @Hades1996, I updated the script using your code, I hope that @ziplokk1 accept my pr.

ziplokk1 commented 6 years ago

Sorry, I have been a bit busy lately. I will try to review the changes this weekend and create and experimental branch so that I can merge your changes without affecting the master branch until I can verify that it wont break scrapers for other sites. Which sites have you tested the changes against?

andresarslanian commented 6 years ago

Just to add to @LuisUrrutia answer, I had to do something different in parser.py to make it work...

in incapsula_script_url I've added

        m = re.search(r"_analytics_scr.src = '(.*)';", self.response.text)

        if m:
            src = m.group().split("'")
            for s in src:
                if '_Incapsula_Resource' in s:
                    return s

So the function is now:

    def incapsula_script_url(self):
        """
        The script url to get the b var value

        :rtype: str
        """
        # print self.response.text
        m = re.search(r"_analytics_scr.src = '(.*)';", self.response.text)

        if m:
            src = m.group().split("'")
            for s in src:
                if '_Incapsula_Resource' in s:
                    return s

        scripts = self.soup.find_all('script')
        if len(scripts) > 1:
            return scripts[0].get('src')

        return None

Hope it helps :)

EDIT: I had to do some more stuff to make it work.

First in session.py I had to change the constructor:

class IncapSession(Session):
    """
    Session object to bypass sites which are guarded by incapsula.

    :param max_retries: The number of times to attempt to get the incapsula resource before
        raising a :class:`MaxRetriesExceeded` error. Set this to `None` to never give up.
    :param user_agent: Change the default user agent when sending requests.
    :param cookie_domain: Use this param to change the domain which is set in the cookie.
        Sometimes the domain set for the cookie isn't the same as the actual host.
        i.e. .domain.com instead of www.domain.com.
    :param resource_parser: :class:`ResourceParser` to use when checking whether the website served back a page which
        is blocked by incapsula. Default: :class:`WebsiteResourceParser`.
    :param iframe_parser: :class:`ResourceParser` class (not instance) to use when checking whether the iframe
        contains a captcha. Default: :class:`IframeResourceParser`.
    """

    def __init__(self, max_retries=3, user_agent=None, cookie_domain='', resource_parser=WebsiteResourceParser,
                 iframe_parser=IframeResourceParser, host=None):
        super(IncapSession, self).__init__()

        default_useragent = 'IncapUnblockSession (https://github.com/ziplokk1/incapsula-cracker-py3)'
        user_agent = user_agent or default_useragent

        self.max_retries = max_retries
        self.cookie_domain = cookie_domain
        self.headers['User-Agent'] = user_agent
        if host:
            self.headers['Host'] = host

I've added the host param to make it match the site host. I was being blocked for that single line. So I init the IncapSession like this:

    session = IncapSession(host="www.mysite.com", user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')

Then I've modified crack method to be like this:

    def crack(self, resp, org=None, tries=0):
        """
        If the response is blocked by incapsula then set the necessary cookies and attempt to bypass it.

        :param resp: Response to check.
        :param org: Original response. Used only when called recursively.
        :param tries: Number of attempts. Used only when called recursively.
        :return:
        """
        # Use to hold the original request so that when attempting the new unblocked request, we have a reference
        # to the original url.
        org = org or resp

        # Return original response after too many tries to bypass incap.
        # If max_retries is None then this part will never get executed allowing a continuous retry.
        if self.max_retries is not None and tries >= self.max_retries:
            raise MaxRetriesExceeded(resp, 'max retries exceeded when attempting to crack incapsula')

        resource = self.ResourceParser(resp)

        if resource.is_blocked():
            logger.debug('Resource is blocked. attempt={} url={}'.format(tries, resp.url))
            # Raise if the response content's iframe contains a recaptcha.
            self._raise_for_recaptcha(resource)

            # Apply cookies and send GET request to apply them.
            self._apply_cookies(org.url, resource.incapsula_script_url)

            # Recursively call crack() again since if the request isn't blocked after the above cookie-set and request,
            # then it will just return the unblocked resource.
            return self.crack(self.get(org.url, bypass_crack=True), org=org, tries=tries + 1)
        else:
            if resource.incapsula_script_url and not tries:
                self._apply_cookies(org.url, resource.incapsula_script_url)
                return self.crack(self.get(org.url, bypass_crack=True), org=org, tries=tries + 1)

        return resp

I've added the else part to the if to apply the cookies if the incapsula_script_url is present.

I hope this helps someone! Cheers

lemm-leto commented 6 years ago

@andresarslanian Your solution doesn't seem to work. I got blocked by recapatcha error.

BTW, I've changed

resource = self.ResourceParser(resp)

to

resource = WebsiteResourceParser(resp)

@ziplokk1 any updates here? Does it work for you now?

lemm-leto commented 6 years ago

Also, as I can see now we have a new problem: incapsula says 'Request unsuccessful. Incapsula incident ID: 108002140047883972-116655804302232934' - this is smth new

lemm-leto commented 6 years ago

Another few hours spent on researching lead me to understanding that it is almost impossible to deobfuscate whoscored. Instead of some meaningful js I'm jsut getting following (not full code):

(function(_0x2b5b4a,_0x4ecf16){var _0x5ef947=function(_0x1d6e77){while(--_0x1d6e77){_0x2b5b4a['\x70\x75\x73\x68'](_0x2b5b4a['\x73\x68\x69\x66\x74']());}};var _0x3f7379=function(){var _0x2d7e04={'\x64\x61\x74\x61':{'\x6b\x65\x79':'\x63\x6f\x6f\x6b\x69\x65','\x76\x61\x6c\x75\x65':'\x74\x69\x6d\x65\x6f\x75\x74'},'\x73\x65\x74\x43\x6f\x6f\x6b\x69\x65':function(_0x3c1e49,_0x1da525,_0xb81f58,_0x54b02c){_0x54b02c=_0x54b02c||{};var _0x46b4c9=_0x1da525+'\x3d'+_0xb81f58;var _0x28f133=0x0;for(var _0x28f133=0x0,_0x5533ae=_0x3c1e49['\x6c\x65\x6e\x67\x74\x68'];_0x28f133<_0x5533ae;_0x28f133++){var _0xd0dd68=_0x3c1e49[_0x28f133];_0x46b4c9+='\x3b\x20'+_0xd0dd68;var _0x6de593=_0x3c1e49[_0xd0dd68];_0x3c1e49['\x70\x75\x73\x68'](_0x6de593);_0x5533ae=_0x3c1e49['\x6c\x65\x6e\x67\x74\x68'];if(_0x6de593!==!![]){_0x46b4c9+='\x3d'+_0x6de593;}}_0x54b02c['\x63\x6f\x6f\x6b\x69\x65']=_0x46b4c9;},'\x72\x65\x6d\x6f\x76\x65\x43\x6f\x6f\x6b\x69\x65':function(){return'\x64\x65\x76';},'\x67\x65\x74\x43\x6f\x6f\x6b\x69\x65':function(_0x19e734,_0x2cd600){_0x19e734=_0x19e734||function(_0x2ca07b){return _0x2ca07b;};var _0x1abcfd=_0x19e734(new RegExp('\x28\x3f\x3a\x5e\x7c\x3b\x20\x29'+_0x2cd600['\x72\x65\x70\x6c\x61\x63\x65'](/([.$?*|{}()[]\/+^])/g,'\x24\x31')+'\x3d\x28\x5b\x5e\x3b\x5d\x2a\x29'));var _0x973354=function(_0x57beb8,_0x23fede){_0x57beb8(++_0x23fede);};_0x973354(_0x5ef947,_0x4ecf16);return _0x1abcfd?decodeURIComponent(_0x1abcfd[0x1]):undefined;}};var _0x449ec7=function(){var _0x45b6c6=new RegExp('\x5c\x77\x2b\x20\x2a\x5c\x28\x5c\x29\x20\x2a\x7b\x5c\x77\x2b\x20\x2a\x5b\x27\x7c\x22\x5d\x2e\x2b\x5b\x27\x7c\x22\x5d\x3b\x3f\x20\x2a\x7d');return _0x45b6c6['\x74\x65\x73\x74'](_0x2d7e04['\x72\x65\x6d\x6f\x76\x65\x43\x6f\x6f\x6b\x69\x65']['\x74\x6f\x53\x74\x72\x69\x6e\x67']());};_0x2d7e04['\x75\x70\x64\x61\x74\x65\x43\x6f\x6f\x6b\x69\x65']=_0x449ec7;

It seems to be executable javascript, however there is no sl variable inside it

yalopov commented 6 years ago

@lemm-leto i tried to investigate how incapsula is doing client-side verification few weeks ago but i couldn't find anything useful

you're indeed right, it seems incapsula's client challenge is now being obfuscated and because of that is very hard to understand

i tried to do some research and obfuscated scripts users are receiving from incapsula seems like the ones generated by this tool https://javascriptobfuscator.herokuapp.com/

if incapsula is obfuscating their code using random seeds and adding debug protection we are pretty much scrubbed

malexovic commented 5 years ago

I was able to decode it to the following format (apparently variable and function names are non-recoverable):

(function(_0x171531, _0x41e00e) { var _0x32e077 = function(_0x4a5e48) { while (--_0x4a5e48) { _0x171531['push'](_0x171531['shift']()); } }; var _0x3fb675 = function() { var _0x3f9da5 = { 'data': { 'key': 'cookie', 'value': 'timeout' }, 'setCookie': function(_0x467b90, _0x257b76, _0x4eef65, _0x119d7a) { _0x119d7a = _0x119d7a || {}; var _0x5bb956 = _0x257b76 + '=' + _0x4eef65; var _0x221dea = 0x0; for (var _0x221dea = 0x0, _0x10a956 = _0x467b90['length']; _0x221dea < _0x10a956; _0x221dea++) { var _0x376fec = _0x467b90[_0x221dea]; _0x5bb956 += '; ' + _0x376fec; var _0x4d287c = _0x467b90[_0x376fec]; _0x467b90'push'; _0x10a956 = _0x467b90['length']; if (_0x4d287c !== !![]) { _0x5bb956 += '=' + _0x4d287c; } } _0x119d7a['cookie'] = _0x5bb956; }, 'removeCookie': function() { return 'dev'; }, 'getCookie': function(_0x24a402, _0x17be0f) { _0x24a402 = _0x24a402 || function(_0x429cfc) { return _0x429cfc; }; var _0x3e276b = _0x24a402(new RegExp('(?:^|; )' + _0x17be0f'replace'/g, '$1') + '=([^;])')); var _0x4bec58 = function(_0x392f91, _0x497325) { _0x392f91(++_0x497325); }; _0x4bec58(_0x32e077, _0x41e00e); return _0x3e276b ? decodeURIComponent(_0x3e276b[0x1]) : undefined; } }; var _0x37c64f = function() { var _0x360f3e = new RegExp('\w+ () {\w+ ['|"].+['|"];? }'); return _0x360f3e['test'](_0x3f9da5['removeCookie']['toString']()); }; _0x3f9da5['updateCookie'] = _0x37c64f; var _0x3d0d0e = ''; var _0x4b53f0 = _0x3f9da5['updateCookie'](); if (!_0x4b53f0) { _0x3f9da5['setCookie']([''], 'counter', 0x1); } else if (_0x4b53f0) { _0x3d0d0e = _0x3f9da5'getCookie'; } else { _0x3f9da5['removeCookie'](); } }; _0x3fb675(); }(_0xfb6b, 0xb8)); var _0xbfb6 = function(_0x213beb, _0x1e4afb) { _0x213beb = _0x213beb - 0x0; var _0x149b98 = _0xfb6b[_0x213beb]; if (_0xbfb6['initialized'] === undefined) { (function() { var _0x2f3b7f = Function('return (function () ' + '{}.constructor("return this")()' + ');'); var _0x22d084 = _0x2f3b7f(); var _0x2a8fd9 = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='; _0x22d084['atob'] || (_0x22d084['atob'] = function(_0xb02412) { var _0x1310a2 = String(_0xb02412)'replace'; for (var _0x294d55 = 0x0, _0x282309, _0x31bd58, _0x260e26 = 0x0, _0x4bc08b = ''; _0x31bd58 = _0x1310a2'charAt'; ~_0x31bd58 && (_0x282309 = _0x294d55 % 0x4 ? _0x282309 0x40 + _0x31bd58 : _0x31bd58, _0x294d55++ % 0x4) ? _0x4bc08b += String['fromCharCode'](0xff & _0x282309 >> (-0x2 _0x294d55 & 0x6)) : 0x0) { _0x31bd58 = _0x2a8fd9'indexOf'; } return _0x4bc08b; }); }()); var _0x240a8c = function(_0x2fe31c, _0x546c4a) { var _0x4908b6 = [], _0x41f3b4 = 0x0, _0x1b3bbb, _0x91ca4a = '', _0x100701 = ''; _0x2fe31c = atob(_0x2fe31c); for (var _0x5cb880 = 0x0, _0x16a946 = _0x2fe31c['length']; _0x5cb880 < _0x16a946; _0x5cb880++) { _0x100701 += '%' + ('00' + _0x2fe31c'charCodeAt''toString')'slice'; } _0x2fe31c = decodeURIComponent(_0x100701); for (var _0x21ad1f = 0x0; _0x21ad1f < 0x100; _0x21ad1f++) { _0x4908b6[_0x21ad1f] = _0x21ad1f; } .................................

But the main question is WHY it cannot be compiled without errors??? I tried many times, but compiler says variable _0xfb6b is not assigned. The whole line 61 looks strange for me:

_}(0xfb6b, 0xb8));

brianzinn commented 5 years ago

I ended up getting pretty far into reading the obfuscated code and used this project for months until about 1 year ago. I was going the way you went - learned a lot about minification and could work through many of the constructs following that link for the javascript obfuscator above. Came to the realization that any change to the challenges would break my scraper and some sites had the higher level of incapsula protection. I got it working with chrome headless and taking over the cookies from a session into my scraper. I switched over many months ago with no issues so far.

dbeans commented 5 years ago

@brianzinn, so you use Selenium to get passed incapsula? Once you have the valid cookies you use them in your regular requests until they expire, after which you would pick up new once with Selenium? Would be interesting how you guys put up a robust solution around this problem. Any recent progress (@ziplokk1, @Nyadesune, @LuisUrrutia)?