yasukata / zpoline

system call hook for Linux
Apache License 2.0
408 stars 32 forks source link

Red-zone Integrity #9

Open chiache opened 11 months ago

chiache commented 11 months ago

It was a pleasure chatting with the authors at OSDI/ATC! We are currently considering integrating zpoline into Gramine (formerly known as Graphene): https://github.com/gramineproject/gramine

Just wondering if you have any thought on replacing the syscall instruction with call when the red zone (128-byte region below the RSP) is used, likely in the leaf functions. Does this ever occur in your experiment? If so, how did you handle this case?

Thank you!

yasukata commented 11 months ago

Thank you for your message.

It was great for me to have a chat with you at the conference.

Regarding the issue, I am sorry, but we did not know about the red zone while we were writing the paper.

As reported by a previous issue, our implementation causes faults in some cases; now, I think it would be because of the red zone.

I will try to resolve this issue and update the implementation once I find a solution.

Thank you for letting me know about this issue.

chiache commented 10 months ago

Thank you so much for the response. I understand that conducting research can be difficult and sometimes there are unforeseen issues.

On top of my head, one (not so ideal) solution is to replace "syscall" with "jmp eax" instead of "call eax", but only serve one code spot at a time. For the other "syscall" locations, they can be replaced some illegal instruction, causing an exception to be capture (this naturally happens on SGX) and swap the serving target.

To assess this idea, just wondering if you have any statistics about normally how many "syscall" instructions exist in a static binary and how often they are called during execution.

Thank you!

yasukata commented 10 months ago

Thank you very much for your great idea.

Regarding the number of syscall instructions in binaries, I found there are 544 in glibc-2.35 and 50 in ld.so.

I checked the frequency of system call invocation by counting the number of system calls triggered in one second during the simple HTTP server and Redis experiments; The following shows the results.

Simple HTTP server

system call number of invocation in one second
read 1185497
write 1185497
epoll_wait 1185497

Redis

system call number of invocation in one second
read 665983
write 665973
close 10
epoll_wait 665997
openat 10

The server programs execute the loop of epoll_wait, read, write system calls, and we may be able to guess what would come next at runtime, while it would depend on applications.

I deeply appreciate that you bring the idea to address the issue.

chiache commented 10 months ago

Thanks for the data! Given the amount of syscall instructions and the frequency of system calls, swapping the jmp target may be kinds of expensive.

One optimization I can think of is that we can detect whether the syscall instruction is in a leaf function (which will not use the red zone). We only have to use "jmp *eax" in a leaf function and that should reduce the amount of the syscall instructions that need to be handled with exception handler.

yasukata commented 10 months ago

Thank you very much for your idea; it quite makes sense.

Please let me try the approach you introduced, and I will get back here when I have an update.

Again, I really appreciate your kind help in resolving the issue.

sidkshatriya commented 10 months ago

As a temporary workaround, compiling with -mno-red-zone (gcc) could be useful ?

Note that every component/library used by program must be compiled with -mno-red-zone. This could be made fully self contained if you are able to build a statically compiled version of the program with every component compiled with -mno-red-zone.

It would be useful to see if a crashing program starts working when compiled with -mno-red-zone.

But then again, if your program consists of compilers other than C/C++ that use the red-zone or hand-written assembly that uses that red-zone even this workaround may not be fully helpful.

P.S. Clarification "But then again, if your program consists of compilers other than C/C++ that use the red-zone [...]" I really meant to say "But then again, if your program consists of the output of compilers other than C/C++ that use the red-zone [...]"

yasukata commented 10 months ago

I believe your workaround, compiling with -mno-red-zone, is definitely useful though the limitation you describe remains.

Thank you very much for providing the information.

chiache commented 9 months ago

Using -mno-red-zone could be a good way to verify if the use of red zone is indeed the root cause of recently reported crashes.

In general, I think the assumption is that the users cannot or do not want to recompile the binaries, and thus we have to resort to dynamic binary modification for intercepting system calls. If recompilation is possible, we won't have this problem, since we can just replace system calls at the source code or IR level, or add more empty bytes to the instruction to allow modification.

Yangff commented 9 months ago

Maybe.. add a pop %rcx before call *%rax and add 0x60; int3; int3 after it... This allow the fix of red zone using value in rcx. The seq is pop %rcx; call *%rax; 0x60; int3; int3 = 0x59 0xff 0xd0 0x60 0xcc 0xcc.

This allows illegal instructions if jump to the byte +2, +3, +4, +5. but the fix-up do have some complexity...

Will require %rax to be small enough so 0xd0 0x60 0xcc = shlb 0xcc(%rax) will crash as expected if jump to byte +2.. And if it is not crash... at least the shifted bit will be keeped inside CF, so the next int3 can raise signal and the modified bit can be restored using value in CF... May still have some problem if CF is used immediately... but I don't think this is common is syscall related code..? And will need some analysis to decide when it trap, if it is from a direct jump, or a shlb that is not crash..

Because ModR/M byte must be 0x10 ([rax]), 0x50 ([rax]+disp8), 0x90 ([rax]+disp32) or 0xD0 (rax) to be related to rax... 0x10 and 0x50 will be adc and push, not as good as shlb.. 0x90 will cause a [rax]+disp32 which is even more uncontrollable (given the endianness)..

And it is anyway still better than normal 5 bytes direct jmp because that will contain an uncontrollable constant distance and may cause problem if any those bytes used as jump targets which may require rewrite instruction before or even impossible to rewrite.

yasukata commented 9 months ago

@Yangff

Thank you very much for your very cool idea.

I have experimentally implemented some foundational parts, and please let me share the current status.

patch

The following is the diff to commit 0a349e65c102f8f9bdbbf6da0a52c4006589178b .

diff --git a/main.c b/main.c
index 281a235..a9376ff 100644
--- a/main.c
+++ b/main.c
@@ -24,6 +24,7 @@
 #include <stdbool.h>
 #include <stdlib.h>
 #include <string.h>
+#include <signal.h>
 #include <errno.h>
 #include <assert.h>
 #include <sys/syscall.h>
@@ -111,6 +112,16 @@ static bool is_replaced_instruction_addr(uintptr_t addr)

 #endif

+#define MAX_REPLACED_SYSCALL (2048)
+
+struct replaced_syscall {
+   uintptr_t addr;
+   uint8_t code[6 + 16 + 12];
+};
+
+static struct replaced_syscall *replaced_syscall_list = NULL;
+static unsigned int replaced_sycall_cnt = 0;
+
 extern void syscall_addr(void);
 extern long enter_syscall(int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t);
 extern void asm_syscall_hook(void);
@@ -156,18 +167,6 @@ void ____asm_impl(void)
    "asm_syscall_hook: \n\t"
    "popq %rax \n\t" /* restore %rax saved in the trampoline code */

-   /* discard pushed 0x90 for 0xeb 0x6a 0x90 if rax is n * 3 + 1 */
-   "pushq %rdi \n\t"
-   "pushq %rax \n\t"
-   "movabs $0xaaaaaaaaaaaaaaab, %rdi \n\t"
-   "imul %rdi, %rax \n\t"
-   "cmp %rdi, %rax \n\t"
-   "popq %rax \n\t"
-   "popq %rdi \n\t"
-   "jb skip_pop \n\t"
-   "addq $8, %rsp \n\t"
-   "skip_pop: \n\t"
-
    "cmpq $15, %rax \n\t" // rt_sigreturn
    "je do_rt_sigreturn \n\t"
    "pushq %rbp \n\t"
@@ -188,10 +187,10 @@ void ____asm_impl(void)
    "pushq %rdi \n\t"
    "pushq %rsi \n\t"
    "pushq %rdx \n\t"
-   "pushq %rcx \n\t"

    /* arguments for syscall_hook */

+   "pushq %rbp \n\t"
    "pushq 8(%rbp) \n\t"    // return address
    "pushq %rax \n\t"
    "pushq %r10 \n\t"
@@ -201,9 +200,8 @@ void ____asm_impl(void)
    "callq syscall_hook \n\t"

    "popq %r10 \n\t"
-   "addq $16, %rsp \n\t"   // discard arg7 and arg8
+   "addq $24, %rsp \n\t"   // discard arg7 and arg8 and rcx

-   "popq %rcx \n\t"
    "popq %rdx \n\t"
    "popq %rsi \n\t"
    "popq %rdi \n\t"
@@ -213,10 +211,14 @@ void ____asm_impl(void)

    "leaveq \n\t"

-   "retq \n\t"
+   "popq %rcx \n\t"
+   "addq $128, %rsp \n\t"
+
+   "jmpq *%rcx \n\t"

    "do_rt_sigreturn:"
-   "addq $8, %rsp \n\t"
+   "popq %rcx \n\t"
+   "addq $128, %rsp \n\t"
    "jmp syscall_addr \n\t"
    );
 }
@@ -225,19 +227,23 @@ static long (*hook_fn)(int64_t a1, int64_t a2, int64_t a3,
               int64_t a4, int64_t a5, int64_t a6,
               int64_t a7) = enter_syscall;

+static unsigned int init_hook_cnt = 0;
+
 long syscall_hook(int64_t rdi, int64_t rsi,
          int64_t rdx, int64_t __rcx __attribute__((unused)),
          int64_t r8, int64_t r9,
          int64_t r10_on_stack /* 4th arg for syscall */,
          int64_t rax_on_stack,
-         int64_t retptr)
+         int64_t retptr,
+         int64_t rbp)
 {
+   init_hook_cnt++;
 #ifdef SUPPLEMENTAL__REWRITTEN_ADDR_CHECK
    /*
     * retptr is the caller's address, namely.
     * "supposedly", it should be callq *%rax that we replaced.
     */
-   if (!is_replaced_instruction_addr(retptr - 2 /* 2 is the size of syscall/sysenter */)) {
+   if (!is_replaced_instruction_addr(retptr - 2 - 1 /* 2 is the size of syscall/sysenter */)) {
        /*
         * here, we detected that the program comes here
         * without going through our replaced callq *%rax.
@@ -261,12 +267,25 @@ long syscall_hook(int64_t rdi, int64_t rsi,
        }
    }

+   {
+       unsigned int i;
+       for (i = 0; i < replaced_sycall_cnt; i++) {
+           if (retptr == (int64_t) replaced_syscall_list[i].addr) {
+               *((int64_t *)(rbp + 8)) = (uintptr_t) replaced_syscall_list[i].code;
+               break;
+           }
+       }
+       assert(i != replaced_sycall_cnt);
+   }
+
    return hook_fn(rax_on_stack, rdi, rsi, rdx, r10_on_stack, r8, r9);
 }

 struct disassembly_state {
    char *code;
    size_t off;
+   uintptr_t ptr;
+   uint16_t inst_off;
 };

 /*
@@ -295,13 +314,48 @@ static int do_rewrite(void *data, const char *fmt, ...)
             */
            goto skip;
        }
-       ptr[0] = 0xff; // callq
-       ptr[1] = 0xd0; // *%rax
+       s->ptr = (uintptr_t) ptr;
 #ifdef SUPPLEMENTAL__REWRITTEN_ADDR_CHECK
        record_replaced_instruction_addr((uintptr_t) ptr);
 #endif
    }
 skip:
+   if (s->ptr == ((uintptr_t) s->code) + s->off - 2)
+       s->inst_off = 2;
+   else if (s->inst_off) {
+       struct replaced_syscall *rc = &replaced_syscall_list[replaced_sycall_cnt];
+       unsigned int inst_size = ((uintptr_t) s->code) + s->off - s->ptr - s->inst_off;
+       assert(s->inst_off - 2 + inst_size + 12 <= sizeof(rc->code));
+       memcpy(&rc->code[s->inst_off - 2], (void *) (s->ptr + s->inst_off), inst_size); /* save original code */
+       s->inst_off += inst_size;
+       if (6 <= s->inst_off) {
+           // 48 b9 [64-bit addr (8-byte)]   movabs [asm_syscall_hook],%rcx
+           rc->code[s->inst_off - 2 + 0x0] = 0x48;
+           rc->code[s->inst_off - 2 + 0x1] = 0xb9;
+           rc->code[s->inst_off - 2 + 0x2] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 0)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x3] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 1)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x4] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 2)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x5] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 3)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x6] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 4)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x7] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 5)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x8] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 6)) & 0xff;
+           rc->code[s->inst_off - 2 + 0x9] = ((uint64_t) (s->ptr + s->inst_off) >> (8 * 7)) & 0xff;
+           // ff e1                   jmp    *%rcx
+           rc->code[s->inst_off - 2 + 0xa] = 0xff;
+           rc->code[s->inst_off - 2 + 0xb] = 0xe1;
+           /* overwrite */
+           ((uint8_t *) s->ptr)[0] = 0x59; // pop %rcx
+           ((uint8_t *) s->ptr)[1] = 0xff; // callq
+           ((uint8_t *) s->ptr)[2] = 0xd0; // *%rax
+           ((uint8_t *) s->ptr)[3] = 0x60; //
+           ((uint8_t *) s->ptr)[4] = 0xcc; // int3
+           ((uint8_t *) s->ptr)[5] = 0xcc; // int3
+           rc->addr = s->ptr + 3;
+           memset((void *) rc->addr + 3, 0xcc, s->inst_off - 6);
+           s->inst_off = 0;
+           assert(++replaced_sycall_cnt < MAX_REPLACED_SYSCALL);
+       }
+   }
    va_end(arg);
    return 0;
 }
@@ -422,50 +476,9 @@ static void setup_trampoline(void)
    }

    {
-       /*
-        * optimized instructions to slide down
-        * repeat of 0xeb 0x6a 0x90
-        *
-        * case 1 : jmp to n * 3 + 0
-        * jmp 0x6a
-        * nop
-        * jmp 0x6a
-        * nop
-        *
-        * case 2 : jmp to n * 3 + 1
-        * push 0x90
-        * jmp 0x6a
-        * nop
-        * jmp 0x6a
-        *
-        * case 3 : jmp to n * 3 + 2
-        * nop
-        * jmp 0x6a
-        * nop
-        * jmp 0x6a
-        *
-        * for case 2, we discard 0x90 pushed to stack
-        *
-        */
        int i;
-       for (i = 0; i < NR_syscalls; i++) {
-           if (NR_syscalls - 0x6a - 2 < i)
-               ((uint8_t *) mem)[i] = 0x90;
-           else {
-               int x = i % 3;
-               switch (x) {
-               case 0:
-                   ((uint8_t *) mem)[i] = 0xeb;
-                   break;
-               case 1:
-                   ((uint8_t *) mem)[i] = 0x6a;
-                   break;
-               case 2:
-                   ((uint8_t *) mem)[i] = 0x90;
-                   break;
-               }
-           }
-       }
+       for (i = 0; i < NR_syscalls; i++)
+           ((uint8_t *) mem)[i] = 0x90;
    }

    /* 
@@ -473,35 +486,68 @@ static void setup_trampoline(void)
     *
     * here we embed the following code.
     *
+    * sub    $0x80,%rsp
+    * push   0x80(%rsp)
+    * mov    %rcx,0x88(%rsp)
     * push   %rax
     * movabs [asm_syscall_hook],%rax
     * jmpq   *%rax
     *
     */

+   // 48 81 ec 80 00 00 00    sub    $0x80,%rsp
+   ((uint8_t *) mem)[NR_syscalls + 0x00] = 0x48;
+   ((uint8_t *) mem)[NR_syscalls + 0x01] = 0x81;
+   ((uint8_t *) mem)[NR_syscalls + 0x02] = 0xec;
+   ((uint8_t *) mem)[NR_syscalls + 0x03] = 0x80;
+   ((uint8_t *) mem)[NR_syscalls + 0x04] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x05] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x06] = 0x00;
+
+   /* save retptr of callq *%rax */
+   // ff b4 24 80 00 00 00    push   0x80(%rsp)
+   ((uint8_t *) mem)[NR_syscalls + 0x07] = 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x08] = 0xb4;
+   ((uint8_t *) mem)[NR_syscalls + 0x09] = 0x24;
+   ((uint8_t *) mem)[NR_syscalls + 0x0a] = 0x80;
+   ((uint8_t *) mem)[NR_syscalls + 0x0b] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x0c] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x0d] = 0x00;
+
+   /* repair 1st 8-byte of the stack */
+   // 48 89 8c 24 88 00 00 00 mov    %rcx,0x88(%rsp)
+   ((uint8_t *) mem)[NR_syscalls + 0x0e] = 0x48;
+   ((uint8_t *) mem)[NR_syscalls + 0x0f] = 0x89;
+   ((uint8_t *) mem)[NR_syscalls + 0x10] = 0x8c;
+   ((uint8_t *) mem)[NR_syscalls + 0x11] = 0x24;
+   ((uint8_t *) mem)[NR_syscalls + 0x12] = 0x88;
+   ((uint8_t *) mem)[NR_syscalls + 0x13] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x14] = 0x00;
+   ((uint8_t *) mem)[NR_syscalls + 0x15] = 0x00;
+
    /*
     * save %rax on stack before overwriting
     * with "movabs [asm_syscall_hook],%rax",
     * and the saved value is resumed in asm_syscall_hook.
     */
    // 50                      push   %rax
-   ((uint8_t *) mem)[NR_syscalls + 0x0] = 0x50;
+   ((uint8_t *) mem)[NR_syscalls + 0x16] = 0x50;

    // 48 b8 [64-bit addr (8-byte)]   movabs [asm_syscall_hook],%rax
-   ((uint8_t *) mem)[NR_syscalls + 0x1] = 0x48;
-   ((uint8_t *) mem)[NR_syscalls + 0x2] = 0xb8;
-   ((uint8_t *) mem)[NR_syscalls + 0x3] = ((uint64_t) asm_syscall_hook >> (8 * 0)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x4] = ((uint64_t) asm_syscall_hook >> (8 * 1)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x5] = ((uint64_t) asm_syscall_hook >> (8 * 2)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x6] = ((uint64_t) asm_syscall_hook >> (8 * 3)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x7] = ((uint64_t) asm_syscall_hook >> (8 * 4)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x8] = ((uint64_t) asm_syscall_hook >> (8 * 5)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0x9] = ((uint64_t) asm_syscall_hook >> (8 * 6)) & 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0xa] = ((uint64_t) asm_syscall_hook >> (8 * 7)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x17] = 0x48;
+   ((uint8_t *) mem)[NR_syscalls + 0x18] = 0xb8;
+   ((uint8_t *) mem)[NR_syscalls + 0x19] = ((uint64_t) asm_syscall_hook >> (8 * 0)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1a] = ((uint64_t) asm_syscall_hook >> (8 * 1)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1b] = ((uint64_t) asm_syscall_hook >> (8 * 2)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1c] = ((uint64_t) asm_syscall_hook >> (8 * 3)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1d] = ((uint64_t) asm_syscall_hook >> (8 * 4)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1e] = ((uint64_t) asm_syscall_hook >> (8 * 5)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x1f] = ((uint64_t) asm_syscall_hook >> (8 * 6)) & 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x20] = ((uint64_t) asm_syscall_hook >> (8 * 7)) & 0xff;

    // ff e0                   jmpq   *%rax
-   ((uint8_t *) mem)[NR_syscalls + 0xb] = 0xff;
-   ((uint8_t *) mem)[NR_syscalls + 0xc] = 0xe0;
+   ((uint8_t *) mem)[NR_syscalls + 0x21] = 0xff;
+   ((uint8_t *) mem)[NR_syscalls + 0x22] = 0xe0;

    /*
     * mprotect(PROT_EXEC without PROT_READ), executed
@@ -542,6 +588,80 @@ static void load_hook_lib(void)
    }
 }

+static void sig_h(int sig, siginfo_t *si __attribute__((unused)), void *ptr)
+{
+   ucontext_t *uc = (ucontext_t *) ptr;
+   switch (sig) {
+   case SIGSEGV:
+       {
+           unsigned int i;
+           for (i = 0; i < replaced_sycall_cnt; i++) {
+               if (replaced_syscall_list[i].addr - 1 <= (uint64_t) uc->uc_mcontext.gregs[REG_RIP]
+                       && (uint64_t) uc->uc_mcontext.gregs[REG_RIP] <= replaced_syscall_list[i].addr + 3) {
+                   uc->uc_mcontext.gregs[REG_RIP] = (uint64_t) &replaced_syscall_list[i].code[0];
+                   break;
+               }
+           }
+           assert(i != replaced_sycall_cnt);
+       }
+       break;
+   case SIGILL:
+       {
+           unsigned i;
+           for (i = 0; i < replaced_sycall_cnt; i++) {
+               if (replaced_syscall_list[i].addr <= (uint64_t) uc->uc_mcontext.gregs[REG_RIP]
+                       && (uint64_t) uc->uc_mcontext.gregs[REG_RIP] <= replaced_syscall_list[i].addr + 3) {
+                   uc->uc_mcontext.gregs[REG_RIP] = (uint64_t) &replaced_syscall_list[i].code[1];
+                   break;
+               }
+           }
+           assert(i != replaced_sycall_cnt);
+       }
+       break;
+   case SIGTRAP:
+       {
+           unsigned i;
+           for (i = 0; i < replaced_sycall_cnt; i++) {
+               if (replaced_syscall_list[i].addr <= (uint64_t) uc->uc_mcontext.gregs[REG_RIP]
+                       && (uint64_t) uc->uc_mcontext.gregs[REG_RIP] <= replaced_syscall_list[i].addr + 3) {
+                   uc->uc_mcontext.gregs[REG_RIP] = (uint64_t) &replaced_syscall_list[i].code[uc->uc_mcontext.gregs[REG_RIP] - replaced_syscall_list[i].addr];
+                   break;
+               }
+           }
+           assert(i != replaced_sycall_cnt);
+       }
+       break;
+   }
+}
+
+static void setup_signal(void)
+{
+   { /* SIGSEGV */
+       struct sigaction sa = { 0 };
+       sa.sa_flags = SA_SIGINFO | SA_RESTART;
+       sa.sa_sigaction = sig_h;
+       assert(!sigemptyset(&sa.sa_mask));
+       assert(!sigaddset(&sa.sa_mask, SIGSEGV));
+       assert(!sigaction(SIGSEGV, &sa, NULL));
+   }
+   { /* SIGILL */
+       struct sigaction sa = { 0 };
+       sa.sa_flags = SA_SIGINFO | SA_RESTART;
+       sa.sa_sigaction = sig_h;
+       assert(!sigemptyset(&sa.sa_mask));
+       assert(!sigaddset(&sa.sa_mask, SIGSEGV));
+       assert(!sigaction(SIGILL, &sa, NULL));
+   }
+   { /* SIGTRAP */
+       struct sigaction sa = { 0 };
+       sa.sa_flags = SA_SIGINFO | SA_RESTART;
+       sa.sa_sigaction = sig_h;
+       assert(!sigemptyset(&sa.sa_mask));
+       assert(!sigaddset(&sa.sa_mask, SIGTRAP));
+       assert(!sigaction(SIGTRAP, &sa, NULL));
+   }
+}
+
 __attribute__((constructor(0xffff))) static void __zpoline_init(void)
 {
 #ifdef SUPPLEMENTAL__REWRITTEN_ADDR_CHECK
@@ -550,6 +670,12 @@ __attribute__((constructor(0xffff))) static void __zpoline_init(void)
            MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE,
            -1, 0)) != MAP_FAILED);
 #endif
+   assert((replaced_syscall_list = mmap(NULL,
+           0x1000 * (((sizeof(struct replaced_syscall) * MAX_REPLACED_SYSCALL) / 0x1000) + ((sizeof(struct replaced_syscall) * MAX_REPLACED_SYSCALL) % 0x1000 ? 1 : 0)),
+           PROT_READ | PROT_WRITE | PROT_EXEC,
+           MAP_PRIVATE | MAP_ANONYMOUS,
+           -1, 0)) != MAP_FAILED);
+   setup_signal();
    setup_trampoline();
    rewrite_code();
    load_hook_lib();

The patch can be applied by the following commands.

git clone https://github.com/yasukata/zpoline.git
cd zpoline
git checkout 0a349e65c102f8f9bdbbf6da0a52c4006589178b main.c

Let's say the patch above is saved in diff.patch.

patch main.c diff.patch
make

Then, we will have a binary libzpoline.so.

test program

The following is a simple test program.

#include <stdio.h>

extern long skip_syscall_0(void);
extern long skip_syscall_1(void);
extern long skip_syscall_2(void);
extern long skip_syscall_3(void);
extern long leaf_fn(void);

void __impl(void)
{
    asm volatile (
    ".globl skip_syscall_0 \n\t"
    "skip_syscall_0: \n\t"
    "mov $5,%rax \n\t"
    "jmp skip0 \n\t"
    "syscall \n\t"
    "skip0: \n\t"
    "mov $100,%rax\n\t"
    "nop \n\t"
    "nop \n\t"
    "nop \n\t"
    "ret \n\t"
    );
    asm volatile (
    ".globl skip_syscall_1 \n\t"
    "skip_syscall_1: \n\t"
    "mov $5,%rax \n\t"
    "jmp skip1 \n\t"
    "syscall \n\t"
    "nop \n\t"
    "skip1: \n\t"
    "mov $100,%rax\n\t"
    "nop \n\t"
    "ret \n\t"
    );
    asm volatile (
    ".globl skip_syscall_2 \n\t"
    "skip_syscall_2: \n\t"
    "mov $5,%rax \n\t"
    "jmp skip2 \n\t"
    "syscall \n\t"
    "nop \n\t"
    "nop \n\t"
    "skip2: \n\t"
    "mov $100,%rax\n\t"
    "nop \n\t"
    "ret \n\t"
    );
    asm volatile (
    ".globl skip_syscall_3 \n\t"
    "skip_syscall_3: \n\t"
    "mov $5,%rax \n\t"
    "jmp skip3 \n\t"
    "syscall \n\t"
    "nop \n\t"
    "nop \n\t"
    "nop \n\t"
    "skip3: \n\t"
    "mov $100,%rax\n\t"
    "nop \n\t"
    "ret \n\t"
    );
    asm volatile (
    ".globl leaf_fn \n\t"
    "leaf_fn: \n\t"
    "movq $20, -8(%rsp) \n\t"
    "movq $39,%rax \n\t"
    "syscall \n\t"
    "nop \n\t"
    "nop \n\t"
    "movq -8(%rsp), %rax \n\t"
    "nop \n\t"
    "nop \n\t"
    "ret \n\t"
    );
}

int main(void)
{
    printf("0: %lu\n", skip_syscall_0());
    printf("1: %lu\n", skip_syscall_1());
    printf("2: %lu\n", skip_syscall_2());
    printf("3: %lu\n", skip_syscall_3());
    printf("%lu\n", leaf_fn());
    return 0;
}

If the test program above is saved in a file named program_above.c, the following command will generate an executable file a.out.

gcc program_above.c

output

The expected output from the test program looks like this.

$ ./a.out
0: 100
1: 100
2: 100
3: 100
20

The output of version 0a349e65c102f8f9bdbbf6da0a52c4006589178b is as follows.

$ LD_PRELOAD=./libzpoline.so ./a.out
env LIBZPHOOK is empty, so skip to load a hook library
0: 100
1: 100
2: 100
3: 100
93908857225659

The following is the version having the patch above.

$ LD_PRELOAD=./libzpoline.so ./a.out
env LIBZPHOOK is empty, so skip to load a hook library
0: 100
1: 100
2: 100
3: 100
20

Here, the newer version seems to be able to handle +2, +3, +4 instructions and repair the value in the red zone by %rcx.

So, in summary, I think the direction of your approach is super nice.

The patch above is not perfect yet, and the following is the to-do list in my mind.

I will update the code here when I have progress.

I really appreciate that you provided an excellent solution. Thank you very much.

sidkshatriya commented 9 months ago

Maybe.. add a pop %rcx before call *%rax and add 0x60; int3; int3 after it... This allow the fix of red zone using value in rcx. The seq is pop %rcx; call *%rax; 0x60; int3; int3 = 0x59 0xff 0xd0 0x60 0xcc 0xcc.

Naive question about pop %rcx . Doesn't %rcx itself get clobbered then ? Shouldn't we need to save %rcx somewhere before pop %rcx ?

Yangff commented 9 months ago

Maybe.. add a pop %rcx before call *%rax and add 0x60; int3; int3 after it... This allow the fix of red zone using value in rcx. The seq is pop %rcx; call *%rax; 0x60; int3; int3 = 0x59 0xff 0xd0 0x60 0xcc 0xcc.

Naive question about pop %rcx . Doesn't %rcx itself get clobbered then ? Shouldn't we need to save %rcx somewhere before pop %rcx ?

System call will clobber %rcx and %r11 anyway, and they are not used to pass argument to the kernel (%rdi, %rsi, %rdx, %r10, %r8 and %r9) + %rax, thus as long as it starts from the location of original syscall, it is fine. %r11 needs an extra prefix and makes it two bytes.

yasukata commented 8 months ago

Please let me report some updates.

I am sorry, but I am still working on the solution presented by @Yangff and have not finished it yet.

The current direction is that, during the initialization phase, I save the code, overwritten with 0x59 0xff 0xd0 0x60 0xcc 0xcc, on some other place and execute it just before the hook function returns; here, I need to convert the saved code that includes an instruction accessing data via an offset from the value of %rip so that it will access the appropriate address, and this conversion is a bit more complicated than I thought.

Essentially, I am still exploring a simple way to translate the saved code, and please let me continue the investigation.

Besides this, I made a temporary workaround that preserves the first 8 bytes of the red zone for our callq *%rax by rewriting the code accessing the red zone from N(%rsp) to N-8(%rsp) where N is a negative value.

The limitations of this workaround are:

  1. Because we use 136 bytes as the red zone, the kernel may overwrite the exceeded 8 bytes if a process, executing a leaf function, is pre-empted.
  2. We cannot apply this code replacement if N of N(%rsp) is smaller than -120 because the specification of -129 needs two bytes and the two-byte offset specification does not fit in the one-byte space having the original offset value.
  3. The current implementation does not track the access to the red zone through a different register (e.g., mov %rsp,%rax; mov $1,N(%rax);).

Despite these limitations, I applied this workaround to the main branch because I found the issue reported in https://github.com/yasukata/zpoline/issues/1 has disappeared and I thought it was slightly better than nothing.

Anyway, I will continue the exploration.

Again, I deeply appreciate all of your kind help.

Yangff commented 8 months ago

Hmmm.. just another thought.. is it possible to simply enable the last branch record to capture the return address instead of using call? Using jmp *%rax and restricting the max depth = 1, and limit the types to near jmp only, it should be possible to retrive the RIP using RDMSR after the NOPs? (although this may need another system call... or will XSAVE works? not sure... ) I'm not sure what could happen if there is CPU interrupt in between this... I assume the kernel should be able to keep the records purely for user events?

yasukata commented 8 months ago

@Yangff

Thank you very much for your idea.

I did not know that the CPUs have the Last Branch Record (LBR) feature, and it looks very useful.

While I am still looking through the CPU manual and relevant materials, please let me report my thoughts.

I'm not sure what could happen if there is CPU interrupt in between this... I assume the kernel should be able to keep the records purely for user events?

I guess, the MSR_LBR_SELECT field would allow us to configure the LBR feature to omit branch records while the CPU is in ring 0; so, I expect we can omit the branch records for interrupts.

I thought of the case where multiple processes run on the same CPU core; my concern was that the branch records in the LBR storage space of the CPU would be mixed (i.e., some branch records are for process A, and some are for process B). For this case, I am not sure yet, but the Linux kernel seems to implement a feature to preserve an LBR state for each process/thread at every process/thread switching ( https://github.com/torvalds/linux/commit/47125db27e47e9d44c878bf8925aa057824bb0d5 ) and it would allow us to get the branch records of a specific process even though there are multiple processes on the same CPU core. On the other hand, to activate this feature, we may need to do something, in user space, to communicate with the kernel-space perf subsystem through the perf_event_open system call interface; essentially, I need to get more familiar with this.

Anyway, I think that the approach using the LBR feature is feasible.

Please let me continue to work on your solution embedding 0x59 0xff 0xd0 0x60 0xcc 0xcc, and besides this, I will investigate the LBR-based approach.

I really appreciate that you provided a very cool idea along with useful information.

Yangff commented 8 months ago

@Yangff

Thank you very much for your idea.

I did not know that the CPUs have the Last Branch Record (LBR) feature, and it looks very useful.

While I am still looking through the CPU manual and relevant materials, please let me report my thoughts.

I'm not sure what could happen if there is CPU interrupt in between this... I assume the kernel should be able to keep the records purely for user events?

I guess, the MSR_LBR_SELECT field would allow us to configure the LBR feature to omit branch records while the CPU is in ring 0; so, I expect we can omit the branch records for interrupts.

I thought of the case where multiple processes run on the same CPU core; my concern was that the branch records in the LBR storage space of the CPU would be mixed (i.e., some branch records are for process A, and some are for process B). For this case, I am not sure yet, but the Linux kernel seems to implement a feature to preserve an LBR state for each process/thread at every process/thread switching ( torvalds/linux@47125db ) and it would allow us to get the branch records of a specific process even though there are multiple processes on the same CPU core. On the other hand, to activate this feature, we may need to do something, in user space, to communicate with the kernel-space perf subsystem through the perf_event_open system call interface; essentially, I need to get more familiar with this.

Anyway, I think that the approach using the LBR feature is feasible.

Please let me continue to work on your solution embedding 0x59 0xff 0xd0 0x60 0xcc 0xcc, and besides this, I will investigate the LBR-based approach.

I really appreciate that you provided a very cool idea along with useful information.

yeah. The LBR is per CPU so the kernel needs to do something when switching process.. I think the biggest problem is there is no easy way to read this MSR.. it reuqires privileged instructions for both RDMSR or XSAVES in order to read..

yasukata commented 8 months ago

@Yangff

I think the biggest problem is there is no easy way to read this MSR.. it reuqires privileged instructions for both RDMSR or XSAVES in order to read..

I see, but I guess executing one additional system call for RDMSR would be acceptable especially if we limit the access to the MSR only for the leaf function cases; so, the use of the LBR feature looks like a plausible option to me.

At least, you have expanded the option list for me; I really appreciate your kind help.

pdlan commented 4 months ago

How about doing some simple static analysis to detect if the red zone is used? For example, starting from the next instruction of syscall, detect if the redzone is referenced in the following instructions and branches until ret/call/push is found (basically building a CFG, maybe with the help of some decoding library such as Intel XED). And if there is absolute jmp found just give up rewriting (because we can't build CFG in this case). This solution can't handle the case in which red zone is referenced by other registers (some aliases of %rsp) though but that shouldn't happen a lot.

yasukata commented 3 months ago

@pdlan

Thank you for your message. The current implementation already has a quick check to detect red zones and try to avoid overwriting them https://github.com/yasukata/zpoline/issues/9#issuecomment-1855093788. But, I think there is room for improvement particularly for the detection accuracy. I expect your idea along with the use of one of the existing static analysis tools can improve the check mechanism. So, I put this on my to-do list. Thank you very much for providing me with your idea and the pointer to the tool.