crash when hooking sceIoClose() in a kernel module

VitaSmith commented 6 years ago

Issue & replication

As the title indicates, I'm encountering a crash whenever I try to hook sceIoClose() that I am beginning to think might be a taiHEN issue rather than something with my code.

You can find a complete simple project to demonstrate the issue, with a simple kernel module and a test application that takes care of loading/unloading it (so that you don't have to crash the whole boot process by adding the module to tai/config.txt) at: https://github.com/VitaSmith/sceIoClose

The instructions on how to recompile and run the module and test application to replicate the crash are in the Readme.

I have been testing with taiHen on fw 3.60 and I really can't figure out what it is I might be doing wrong in my code to produce the crash.

Code

Basically, the code for the hook is:

int hook_user_close(SceUID fd)
{
    printf("sceIoClose(0x%08X) (Before TAI_CONTINUE)\n", fd);
    int r = TAI_CONTINUE(int, close_ref, fd);
    printf("r = 0x%08X (After TAI_CONTINUE)\n", r);
    return r;
}

And the installation code:

    // sceIoClose Hook
    close_id = taiHookFunctionExportForKernel(KERNEL_PID, &close_ref,
                                   "SceIofilemgr", TAI_ANY_LIBRARY,
                                   0xC70B8886, hook_user_close);
    printf("- sceIoClose Hook ID: 0x%08X\n", close_id);

However, the end result is that a crash occurs in TAI_CONTINUE as per the log output below:

Loading IO logger kernel driver...
- sceIoOpen Hook ID: 0x00052C8D
- sceIoClose Hook ID: 0x00052BEF
sceIoOpen('ux0:app/IO_LOGGER/eboot.bin') = 0x400100A3
sceIoClose(0x400100A3) (Before TAI_CONTINUE)

Things I tried

Removing logging altogether, in case the problem had to do with calling ksceIoClose() → Still crashes!
Overriding sceIoCloseForDriver() instead sceIoClose() → This doesn't crash but of course the override I need is sceIoClose(). This does seem to confirm that the problem appears to be only with sceIoClose() override however.
Using the actual SceIofilemgr NID (0xF2FF276E) instead of TAI_ANY_LIBRARY → Same issue.
Updating VitaSDK to latest → Still crashes!
Not calling sceIoClose() in the test app, but waiting to unload the module → Still crashes as soon as a background app calls sceIoClose().
Adding void *args as extra parameters to sceIoClose() and TAI_CONTINUE() just in case → Same issue.
Using SceUID as return value instead of int → Same issue.

After having spent the last 24 hours trying to figure this out, I wouldn't mind having some expert input on this issue, as I really can't figure out what may be wrong in the sceIoClose() override when the sceIoOpen() one is working just fine. At this stage, I have reason to believe that the crash might have to do with taiHEN itself, which is why I am logging this issue.

yifanlu commented 6 years ago

Is this a syscall? If so you need the syscall macros.

VitaSmith commented 6 years ago

Hi yifanlu. Thanks for replying!

If you mean adding:

    int state;
    ENTER_SYSCALL(state);
    (...)
    EXIT_SYSCALL(state);

in the hooks then I'm afraid I've already tried that, after posting this issue, with no success.

For the record, I have now updated the code in the github repository with the sample to replicate the issue, to invoke those macros.

Another thing I tried is break down the TAI_CONTINUE() macro and use the _tai_hook_user struct directly for the invocation, like this:

typedef int (sceIoClose_t)(SceUID fd);

int hook_user_close(SceUID fd)
{
    int state;
    ENTER_SYSCALL(state);
    printf("sceIoClose(0x%08X) (Before TAI_CONTINUE)\n", fd);
    struct _tai_hook_user *cur = (struct _tai_hook_user *)close_ref;
    printf("sceIoClose: cur = %p, cur->old = %p\n", cur, cur->old);
    sceIoClose_t* psceIoClose = (sceIoClose_t*)cur->old;
    int r = psceIoClose(fd);
    printf("r = 0x%08X (After TAI_CONTINUE)\n", r);
    EXIT_SYSCALL(state);
    return r;
}

But there again, even as cur and cur->old look properly set, it looks like as soon as the call to cur->old is issued, the system crashes...

VitaSmith commented 6 years ago

I think that, at the very least, I have found a workaround for this issue.

The inspiration for it comes from what dots-tb does in ioPlus.

Considering that original sceIoClose() appears to be screwed, the trick is not to call it through TAI_CONTINUE() but instead convert the userland fd to a kernelland one, and then call ksceIoClose() directly.

Thus, if you replace the TAI_CONTINUE() line with

int r = ksceIoClose(ksceKernelKernelUidForUserUid(ksceKernelGetProcessId(), fd));

then fds can be closed, from within the hook, and without incurring a crash.

Of course, you then lose the ability to chain sceIoClose() hooks. But since those are crashing the system, there shouldn't be any anyway...

yifanlu commented 6 years ago

That’s good to know!

CelesteBlue-dev commented 6 years ago

Here is the explanation:

you hook ksceIoClose export
the patch function calls log_printf -> log_print -> ksceIoOpen + ksceIoWrite + ksceIoClose BAM ! the hook is a new time called, and this is an infinite loop. A workaround is not to log using ksceIoClose anymore. You can either use a sub by using module_get_offset or use different logging way like UART or network. Another workaround (used in rePatch) is to hook IMPORTS and not exports. This way when your logging function calls sceIoClose, the hook is not triggered. The last workaround is not to call TAI_CONTINUE, as you told, but I don't see how it can fix that... For me it works in ioPlus because:
define printf ksceDebugPrintf

so there is no infinite loop like in your plugin.

yifanlu commented 6 years ago

Good diagnosis, I think this issue can be closed.

VitaSmith commented 6 years ago

@CelesteBlue-dev

Thanks for trying to come up with an explanation. However I do have to mention that, of course, I too thought about infinite loops due to logging in open/close, and the first thing I tried was disabling logging altogether. I actually mentioned this in the Things I tried part of the issue, as the very first item!

But I still got the crash.

Besides, it would make little sense for sceIoOpen() (which does logging too) not to crash in the exact same fashion, but only sceIoClose(). Also, please bear in mind that what is being hooked is sceIoClose() not ksceIoClose().

Again, the first thing I tried was disabling logging and it still failed. So I have to disagree with your diagnosis. The issue is not related to calling log_printf at all.

The last workaround is not to call TAI_CONTINUE, as you told, but I don't see how it can fix that...

That's because I'm afraid your explanation is not correct. There is a bug somewhere when it comes to hooking sceIoClose() and I'm pretty sure it doesn't come from my code.

Now, please also bear in mind that the logging module is just an example to demonstrate the issue. I am not interested in logging at all. Instead, I am developing a kernel plugin that allows "mounting" Sony pkg files as if they were already extracted/installed on the file system. And to be able to do that, I MUST be able to override sceIoClose().

Finally, I also have to report that the workaround still produces a freezout when trying to launch an application from the shell, when added to tai/config.txt.

That is, if you don't do any logging at all, but simply override sceIoClose() and just issue return ksceIoClose(ksceKernelKernelUidForUserUid(ksceKernelGetProcessId(), fd)); (whether guarded by ENTER_SYSCALL/CLOSE_SYSCALL or not), you will not be able to launch any application from the shell without observing a freeze. In some cases, you may even observe a freezout during shell launch.

So, the module does seem to work when launched on an app basis, as was done in the test app, but when used as a tai/config.txt plugin, even as the Vita seems to be able to boot to the UI, as soon as you try to launch an app, everything freezes.

This is not expected. Therefore, I still believe that there is some kind of underlying bug in taiHEN, for which we don't have yet an explanation, that needs to be investigated, as it is a complete showstopper for the creation of a whole category of very useful plugins, such as virtual file systems.

So can this issue be re-opened please?

And please, for people who want to provide a potential explanation or run some analysis (which I very much appreciate), please try to use and modify the code I shared to validate you proposal, as it should make things a lot clearer as to what does or doesn't cause the issue.

CelesteBlue-dev commented 6 years ago

Idea: try changing TAI_ANY_LIBRARY to the library NID you want you hook. Maybe the issue comes because SceIoFileMgr has same NIDs for userland and for kernel exported functions.

VitaSmith commented 6 years ago

Things I tried

Using the actual SceIofilemgr NID (0xF2FF276E) instead of TAI_ANY_LIBRARY → Same issue.

Can you please read the original issue?

I must admit I'm starting to get a bit annoyed having to dispell all of the obvious stuff that I already tried, which I made sure to document when I opened the original issue.

That's not to say I don't appreciate your willingness to help (I really do, coz you are a lot more familiar with the system than I am), but at this stage, if you think you have an idea about what in the code is causing the issue with sceIoClose(), I would prefer if you simply forked the sample project, tested the .skprx with your added modification, as a tai/config.txt plugin, and then post your code modifications to make it work either here or as a pull request on the project.

Note that I don't care if the sceIoClose() hook logs anything, as long as it does close the user fds that it is presented with.

yifanlu commented 6 years ago

Okay I'll re-open the issue.

VitaSmith commented 6 years ago

Thanks.

VitaSmith commented 6 years ago

Sooo, since I'm stuck on this issue, and considering building my own hooker for sceIoClose ("With blackjack and theme parks. In fact, forget the hooker!"), I've started poking around at the what might be going on behind the scenes.

Basically I've been comparing the sceIoClose function data before and after the hook, to try to both find ideas of how I could construct my own hooking and find some clues that could explain the crash, and here is what I can see (NB: I am obviously clearing bit 0 in the addresses below, as they are only there to indicate that the code should be executed in THUMB mode) :

Loading VPFS kernel driver...
sceIoClose @ 0x00c22918 (before hook):
00000000  38 b5 04 46 14 f0 56 ef 00 28 16 dc 00 20 21 46 
00000010  14 f0 98 ee 03 1e 04 db fd f7 6e fd 05 46 80 b1 
sceIoClose @ 0x00c22918 (after hook):
00000000  df f8 00 f0 65 00 71 02 00 28 16 dc 00 20 21 46 
00000010  14 f0 98 ee 03 1e 04 db fd f7 6e fd 05 46 80 b1 
content of 0x02710064:
00000000  2d e9 f0 43 81 46 83 b0 1d ee 70 8f 4f ea 08 43 
00000010  0d ee 70 3f 48 f2 00 04 00 22 c0 f2 de 14 01 21

This tells us (using an ARM Thumb disassembler such as this one) that, before the hook, we have the original sceIoClose() doing this:

0x0000: push {r3, r4, r5, lr}
0x0002: mov r4, r0
0x0004: blx #0x14eb4
0x0008: cmp r0, #0
0x000a: bgt #0x3a
0x000c: movs r0, #0
0x000e: mov r1, r4
(...)

Then, after the hook is set:

0x0000: ldr.w pc, [pc, #0]
0x0004: .long 0x02710065
0x0008: cmp r0, #0
0x000a: bgt #0x3a
0x000c: movs r0, #0
0x000e: mov r1, r4
(...)

Which tells us that the hook forces the call to jump to address 0x02710064 (in THUMB mode), with the disassembly of address 0x02710064 being something like:

0x0000: push.w {r4, r5, r6, r7, r8, sb, lr}
0x0004: mov sb, r0
0x0006: sub sp, #0xc
0x0008: mrc p15, #0, r8, c13, c0, #3
0x000c: lsl.w r3, r8, #0x10
0x0010: mcr p15, #0, r3, c13, c0, #3
0x0014: movw r4, #0x8000
0x0018: movs r2, #0
0x001a: movt r4, #0x1de
0x001e: movs r1, #1
(...)

I suspect the code above comes from the substitute "injector" used by taiHEN, and that at some stage, this "injector" executes a copy of the code that it replaced, namely:

push {r3, r4, r5, lr}
mov r4, r0
blx #0x14eb4

before jumping back into the original code.

So, of course, the presence of the blx branch instruction in the substituted section would be my prime suspect with regards to why TAI_CONTINUE may crash: if that relocated blx is not handled properly, it will never jump to where it's supposed to, especially as my understanding is that this instruction will change the instruction set from THUMB to ARM.

Now, from a very quick look at substitute, I can see that it had some provisions for branch handling and relocation. But I can't help but have some doubts as to whether it is properly able to handle a relocated blx, which would most certainly explain the crash...

yifanlu commented 6 years ago

0x02710064 Should be your function. If you disassemble it to the end, you’ll see it jump to the reproduced code for the replaced instructions. This replaced code should be the same behavior as the code it overwrote but with branches translated to absolute addresses.

VitaSmith commented 6 years ago

Well, here's the disassembly I get when I simply issue a return 0 in my sceIoClose() hook:

0x0000: movs r0, #0
0x0002: bx lr

So far so good. Obviously, since we're not calling TAI_CONTINUE() the ldr.w pc ... jump should just zero r0 and branch back to the caller, which is exactly what we are observing.

Now, let's have a look at a call that simply does return TAI_CONTINUE(int, close_ref, fd); (Without ENTER/EXIT_SYSCALL, since it'll simplify our disassembly and we don't care about them for this exercise):

0x0000: movw r3, #0xc000
0x0004: movt r3, #0x1ee
0x0008: ldr r2, [r3, #0x6c]
0x000a: ldr r3, [r2]
0x000c: cbz r3, #0x12
0x000e: ldr r3, [r3, #4]
0x0010: bx r3
0x0012: ldr r3, [r2, #8]
0x0014: bx r3
0x0016: nop 
0x0018: push {r4, r5, r6, r7, lr}
0x001a: sub sp, #0x10c
0x001c: cbz r0, #0x74
0x001e: movw r3, #0xc088
0x0022: movt r3, #0x1ee
0x0026: ldr r6, [r3]
0x0028: cbz r6, #0x74
0x002a: ldrb r3, [r0]
0x002c: mov r5, r0
0x002e: cbnz r3, #0x4a
0x0030: movw r5, #0xb080
0x0034: mov r4, r1
0x0036: movt r5, #0x27e
0x003a: mov r6, r3
0x003c: ldm r5!, {r0, r1, r2, r3}
0x003e: ldr r5, [r5]
0x0040: stm r4!, {r0, r1, r2, r3}
0x0042: str r5, [r4]
0x0044: mov r0, r6
0x0046: add sp, #0x10c
0x0048: pop {r4, r5, r6, r7, pc}
0x004a: mov r4, r1
0x004c: blx #0x2fa0
0x0050: mov r1, r5
0x0052: mov r7, r0
0x0054: mov r2, r0
0x0056: add r0, sp, #8
0x0058: blx #0x2f60
0x005c: movs r5, #1
0x005e: movs r3, #0
0x0060: mov r2, r7
0x0062: add r0, sp, #8
0x0064: mov r1, r4
0x0066: strd r5, r3, [sp]
0x006a: blx r6
0x006c: mov r6, r0
0x006e: mov r0, r6
0x0070: add sp, #0x10c
0x0072: pop {r4, r5, r6, r7, pc}
0x0074: movs r6, #3
0x0076: movt r6, #0x8002
0x007a: b #0x44

The first thing we notice is that it trashes r3, whereas the original call made sure to restore it on exit (push {r3, r4, r5, lr} balanced with a matching pop {r3, r4, r5, pc} on exit).

Now I understand that, as a general rule, code should expect r[0-3] to be trashed on an ARM function call, since these registers are meant to be used for parameters as well as return values. However, it seems to me that Sony may have forgotten to get that memo for their sceIoClose() implementation, as the they have designed their code to preserve r3...

I'm also a bit surprised at how happily the override seems to be willing to use r2 and r3 as scratch registers, apparently without worrying as to whether these may be parameters. My guess is that substitute is smart enough to work with the linker, and figures out how many parameter a function call actually has (in our case, only r0 is used as a parameter), but if it were me, I think I'd want to be more careful about preserving registers just in case...

At any rate, what is clear from this is that, r3 is definitely trashed beyond recovery as soon as TAI_CONTINUE is invoked, whereas the original call was designed to preserve r3. Thus, if there is any code running on the Vita that expects r3 to have kept its value after a call to sceIoClose(), we're going to have a problem...

yifanlu commented 6 years ago

Thus, if there is any code running on the Vita that expects r3 to have kept its value after a call to sceIoClose(), we're going to have a problem...

I don’t think there is such code. ARM ABI prohibits it.

VitaSmith commented 6 years ago

Well, Sony have a long history of completely disregarding whatever existing conventions or specs other people use, and introducing their own... 😄

yifanlu commented 6 years ago

Their compiler is gcc based though.

VitaSmith commented 6 years ago

I'm pretty sure you can achieve non ARM ABI compliant behaviour with vanilla gcc through inline assembly.

The fact is that nobody here knows why r3 is being pushed and popped in sceIoClose(), in an apparent violation of the ARM ABI specifications.

My understanding is that, if Sony's code was compiled with an ABI compliant toolchain, you should never observe a push/pop r3 at the beginning and end of a function call. Yet, this is what we are seeing. So, unfortunately, there is a nonstandard behaviour that we have no choice but to take into account.

Also, I can confirm that when I apply my own hooker (Boy is that a complete pain in the ass to replicate! - I have even more respect for what you must have gone through, when developing taiHEN, now that I've seen just how much work is required to simply get hooks that work), I get a stable sceIoClose() override, with a kernel module that can be loaded from tai/config.txt and doesn't make the Vita crash or freeze.

Thus, as far as I am concerned, there is something that should be improved or fixed in the manner taiHEN applies hooks, because I can demonstrate that, if you use an alternate method to create your hooks, you won't observe the crash that you observe with taiHEN... and I doubt everybody will want to go through the ordeal of creating their own custom hooker, just to work around what seems to be a problematic taiHEN behaviour.

I should also point out that one thing I tried, without success, was to create a taiHEN hook, with TAI_CONTINUE for an sceIoClose that takes 4 parameters, in the hope that maybe the taiHEN hooker would preserve r3 by the time it jumped back to the original code. However, that doesn't seem to work either...

I guess the one way to prove or disprove the r3 hypothesis is to just change all the the push {r3, r4, r5, lr} and pop {r3, r4, r5, pc} in sceIoClose() to push {r4, r5, lr} / pop {r4, r5, pc}, and see what happens. I haven't tried that, first because it's a bit tricky (there are a few caveats to doing that properly so that you don't get a crash for other reasons), and second because, now that I've lost about a week on this issue, I'd rather get back to making some progress with what I was doing before I got stuck with this, and I estimate that I have provided more than enough information for whoever is willing to investigate this further...

yifanlu commented 6 years ago

Thanks for looking into this. It’s not non-compliant to save R3 even if it doesn’t have to be saved. However I agree it’s a suspicious piece of code. If you make TAI_CONTINUE take 4 args, make sure your patched function also takes 4 args and no other hooks are installed on it. Someone may want to try this out to confirm.

Princess-of-Sleeping commented 4 years ago

The hook system basically redirects to my_hook by replacing the first 12 bytes ( 1) of the function. 1. The size varies depending on the opcode

The opcode before being replaced with the hook opcode is copied to the RX memory by taiHEN

sceIoClose
|------------|
|   opcode1  |
|------------|
|   opcode2  |
|------------|
|   opcode3  |
|------------|
|   opcode4  |
|------------|
...

hooked sceIoClose
|--------------|
| hook opcode1 |
|--------------|
| hook opcode2 |
|--------------|
| hook opcode3 | -> redirect hook_function
|--------------|
|    opcode4   |
|--------------|
...

taiHEN RX mem
|------------------------|
| org sceIoClose opcode1 |
|------------------------|
| org sceIoClose opcode2 |
|------------------------|
| org sceIoClose opcode3 |
|------------------------|
|         opcode4        | -> continue to sceIoClose opcode4
|------------------------|
...

// Assuming that this is the only function hooking sceIoClose
int hook_function(SceUID fd){
    return TAI_CONTINUE(int, io_close_ref, fd); // -> continue to taiHEN RX mem
}

The problem at this time is to call the function with org sceIoClose opcodeX within taiHEN RX mem

org sceIoClose opcode 12byte(3.60)

SceIofilemgr + 0x2918 0x0000: push {r3, r4, r5, lr}
SceIofilemgr + 0x291A 0x0002: mov r4, r0
SceIofilemgr + 0x291C 0x0004: blx #0x14eb4 // SceSysmemForKernel_D514BB56, import entry : SceIofilemgr + 0x177cc
SceIofilemgr + 0x2910 0x0008: cmp r0, #0
SceIofilemgr + 0x2912 0x000a: bgt #0x3a

The mechanism to call the function is realized by jumping to the offset determined by the following calculation

pc       bl/blx offset   idk
0x291C + 0x14EB4       - 4   = 0x177CC

But when hooked, taiHEN RX mem + 4 + 0x14EB4 - 4 doesn't exist and crashes with invalid access In rare cases, even if some module exists in taiHEN RX mem + 4 + 0x14EB4 - 4, it crashes due to register inconsistency

taiHEN RX mem
|------------------------|
| org sceIoClose opcode1 |
|------------------------|
| org sceIoClose opcode2 | // blx #0x14EB4
|------------------------|
| org sceIoClose opcode3 |
|------------------------|
|         opcode4        | -> continue to sceIoClose opcode4
|------------------------|
~~~
|---------------------------------|
| taiHEN RX mem + 4 + 0x14EB4 - 4 |
|         Invalid memory          |
|---------------------------------|

Therefore, when hooking a function such as sceIoClose, it is necessary to re-implement it by yourself. By the way, jump instructions such as bge and beq have the same problem.

yifanlu / taiHEN