Ambiguous traces. - Githubissues

vext01 commented 3 years ago

TLDR: A single instruction in the final (i.e. post-platform-specific passes) LLVM IR can generate many machine blocks and this can lead to multiple interpretations of a trace through that instruction.

To understand this requires a lot of details. Sorry!

I discovered this with the following test:

  int x = 1, res = 0;                                                                                                                                                                                                                        
  void *tt = __yktrace_start_tracing(HW_TRACING, &x, &res);                                                                                                                                                                                  
  //NOOPT_VAL(x);                                                                                                                                                                                                                            
  switch (x) {                                                                                                                                                                                                                               
    case 1:                                                                                                                                                                                                                                  
      res = 5;                                                                                                                                                                                                                               
      break;                                                                                                                                                                                                                                 
    case 2:                                                                                                                                                                                                                                  
      res = 12;                                                                                                                                                                                                                              
      break;                                                                                                                                                                                                                                 
    case 3:                                                                                                                                                                                                                                  
      res = 4;                                                                                                                                                                                                                               
      break;                                                                                                                                                                                                                                 
    default:                                                                                                                                                                                                                                 
      res += 1;                                                                                                                                                                                                                              
  }                                                                                                                                                                                                                                          
  void *tr = __yktrace_stop_tracing(tt);                                                                                                                                                                                                     
  assert(res == 5);

This gives the following IR at -O3:

; Function Attrs: nounwind uwtable
define dso_local i32 @main(i32 %0, i8** nocapture readnone %1) local_unnamed_addr #0 !dbg !16 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  call void @llvm.dbg.value(metadata i32 %0, metadata !24, metadata !DIExpression()), !dbg !33
  call void @llvm.dbg.value(metadata i8** %1, metadata !25, metadata !DIExpression()), !dbg !33
  %5 = bitcast i32* %3 to i8*, !dbg !34
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %5) #6, !dbg !34
  call void @llvm.dbg.value(metadata i32 1, metadata !26, metadata !DIExpression()), !dbg !33
  store i32 1, i32* %3, align 4, !dbg !35, !tbaa !36
  %6 = bitcast i32* %4 to i8*, !dbg !34
  call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %6) #6, !dbg !34
  call void @llvm.dbg.value(metadata i32 0, metadata !27, metadata !DIExpression()), !dbg !33
  store i32 0, i32* %4, align 4, !dbg !40, !tbaa !36
  call void @llvm.dbg.value(metadata i32* %3, metadata !26, metadata !DIExpression(DW_OP_deref)), !dbg !33
  call void @llvm.dbg.value(metadata i32* %4, metadata !27, metadata !DIExpression(DW_OP_deref)), !dbg !33
  %7 = call i8* (i64, ...) @__yktrace_start_tracing(i64 1, i32* nonnull %3, i32* nonnull %4) #6, !dbg !41
  call void @llvm.dbg.value(metadata i8* %7, metadata !28, metadata !DIExpression()), !dbg !33
  %8 = load i32, i32* %3, align 4, !dbg !42, !tbaa !36
  call void @llvm.dbg.value(metadata i32 %8, metadata !26, metadata !DIExpression()), !dbg !33
  switch i32 %8, label %12 [
    i32 1, label %9
    i32 2, label %10
    i32 3, label %11
  ], !dbg !43

9:                                                ; preds = %2
  call void @llvm.dbg.value(metadata i32 5, metadata !27, metadata !DIExpression()), !dbg !33
  store i32 5, i32* %4, align 4, !dbg !44, !tbaa !36
  br label %15, !dbg !46

10:                                               ; preds = %2
  call void @llvm.dbg.value(metadata i32 12, metadata !27, metadata !DIExpression()), !dbg !33
  store i32 12, i32* %4, align 4, !dbg !47, !tbaa !36
  br label %15, !dbg !48

11:                                               ; preds = %2
  call void @llvm.dbg.value(metadata i32 4, metadata !27, metadata !DIExpression()), !dbg !33
  store i32 4, i32* %4, align 4, !dbg !49, !tbaa !36
  br label %15, !dbg !50

12:                                               ; preds = %2
  %13 = load i32, i32* %4, align 4, !dbg !51, !tbaa !36
  call void @llvm.dbg.value(metadata i32 %13, metadata !27, metadata !DIExpression()), !dbg !33
  %14 = add nsw i32 %13, 1, !dbg !51
  call void @llvm.dbg.value(metadata i32 %14, metadata !27, metadata !DIExpression()), !dbg !33
  store i32 %14, i32* %4, align 4, !dbg !51, !tbaa !36
  br label %15, !dbg !52

15:                                               ; preds = %12, %11, %10, %9
  %16 = call i8* @__yktrace_stop_tracing(i8* %7) #6, !dbg !53
  call void @llvm.dbg.value(metadata i8* %16, metadata !30, metadata !DIExpression()), !dbg !33
  %17 = load i32, i32* %4, align 4, !dbg !54, !tbaa !36
  call void @llvm.dbg.value(metadata i32 %17, metadata !27, metadata !DIExpression()), !dbg !33
  %18 = icmp eq i32 %17, 5, !dbg !54
  br i1 %18, label %20, label %19, !dbg !57

What's important here is the use of switch to dispatch to the correct case:

  switch i32 %8, label %12 [
    i32 1, label %9
    i32 2, label %10
    i32 3, label %11
  ], !dbg !43

When we trace this, we see the following blocks indices reported: 0, 0, 0, 1, 5:

mapper: 201b27-201b2f -> [Element { range: 2104064..2104112, value: BlockMapEntry { f_off: 2104064, corr_bbs: [0] } }]
mapper: 201b30-201b34 -> [Element { range: 2104112..2104117, value: BlockMapEntry { f_off: 2104064, corr_bbs: [0] } }]
mapper: 201b35-201b39 -> [Element { range: 2104117..2104122, value: BlockMapEntry { f_off: 2104064, corr_bbs: [0] } }]
mapper: 201b3a-201b42 -> [Element { range: 2104122..2104131, value: BlockMapEntry { f_off: 2104064, corr_bbs: [1] } }]
mapper: 201b59-201b5d -> [Element { range: 2104153..2104167, value: BlockMapEntry { f_off: 2104064, corr_bbs: [5] } }]

^ These debug prints show the addresses being queried and the IR block(s) that each address range corresponds with (corr_bbs). The block numbers are indices, so that's actually the entry block (three times) then the block labeled 9: followed by the block labeled 15:.

We can see that several block all mapped to BB0. Why? It's to do with the machine blocks that the LLVM switch statement decomposed to.

Disassembling the binary we see:

            0x00201b1b      bf01000000     mov edi, 1                  ; switch.c:17   void *tt = __yktrace_start_tracing(HW_TRACING, &x, &res);                                                                                             
            0x00201b20      31c0           xor eax, eax                                                                                                                                                                                      
            0x00201b22      e8f9000000     call sym.imp.__yktrace_start_tracing ;[1]                                                                                                                                                         
            0x00201b27      8b4c2404       mov ecx, dword [rsp + 4]    ; switch.c:19   switch (x) {                                                                                                                                          
            0x00201b2b      83f903         cmp ecx, 3                  ; 3                                                                                                                                                                   
        ┌─< 0x00201b2e      7413           je 0x201b43                                                                                                                                                                                       
        │   0x00201b30   *  83f902         cmp ecx, 2                  ; 2                                                                                                                                                                   
       ┌──< 0x00201b33      7417           je 0x201b4c                                                                                                                                                                                       
       ││   0x00201b35      83f901         cmp ecx, 1                  ; 1                                                                                                                                                                   
      ┌───< 0x00201b38      751b           jne 0x201b55                                                                                                                                                                                      
      │││   0x00201b3a      c70424050000.  mov dword [rsp], 5          ; switch.c:21       res = 5;                                                                                                                                          
     ┌────< 0x00201b41      eb16           jmp 0x201b59                                                                                                                                                                                      
     │││└─> 0x00201b43      c70424040000.  mov dword [rsp], 4          ; switch.c:27       res = 4;                                                                                                                                          
     │││┌─< 0x00201b4a      eb0d           jmp 0x201b59                                                                                                                                                                                      
     ││└──> 0x00201b4c      c704240c0000.  mov dword [rsp], 0xc        ; switch.c:24       res = 12;    ; [0xc:4]=-1 ; 12                                                                                                                    
     ││┌──< 0x00201b53      eb04           jmp 0x201b59                                                                                                                                                                                      
     │└───> 0x00201b55      83042401       add dword [rsp], 1          ; switch.c:30       res += 1;                                                                                                                                         
     └─└└─> 0x00201b59      4889c7         mov rdi, rax                ; switch.c:32   void *tr = __yktrace_stop_tracing(tt);                                                                                                                
            0x00201b5c      e8cf000000     call sym.imp.__yktrace_stop_tracing ;[2]

Although the test checks the switch cases in the order 1, 2, 3, default, LLVM is free to, and has, re-ordered the checks. It tests 3, 2, 1, default using cascading cmp and jxx instructions.

In the eyes of PT, a jxx terminates a block, so when x=1 we pass through the blocks for testing the 3 and 2 cases before eventually taking the 1 case. So that explains why we see block 0 appearing three times.

The block map (in the .llvm_bb_addr_map section) confirms that LLVM treats each of these micro-blocks as a distinct machine block:

blockmap: 201b00-201b2f <- First MB corresponding with BB0.
blockmap: 201b30-201b34 <- Second
blockmap: 201b35-201b39 <- Third
blockmap: 201b3a-201b42
blockmap: 201b43-201b4b
blockmap: 201b4c-201b54
blockmap: 201b55-201b58
blockmap: 201b59-201b66
blockmap: 201b67-201bab
blockmap: 201bac-201bb5
blockmap: 201bb6-201bce
blockmap: 201bcf-201be7

OK, so far so good. What's the problem?

What's bothering me about this is that a block sequence 0, 0, 0 is ambiguous. It either expresses progression through MBBs which all correspond with the BB, or it expresses that BB0 was run three times.

One potential fix is to pair each BB number in our trace with an MBB address. So perhaps our trace is instead of 0, 0, 0, rather (0, 0x123), (0, 0x456), (0, 0x789), where the 0x numbers are the addresses of the machine blocks. By doing this we can clearly see that BB0 has not been executed thrice.

What I'm thinking about now is whether this fix is sufficient under the assumption that any IR instruction could decompose to arbitrary control flow (perhaps including loops and unstructured jumps). We only consider a block "re-executed" if there's a jump to the address of the first MBB in the block.

Any thoughts?

[Side note: Since @ptersilie moved the IR encoding stage later, I did not expect there to still be a one-to-many mapping from a "final" IR block to it's machine blocks. I thought that the code-gen would canonicalise the control flow in such a way to make a 1-to-1 mapping. Clearly this is not the case. Therefore, in an ideal world, we'd serialise the machine basic blocks and use those as our IR, but this would be highly impractical, as LLVM doesn't offer an API to serialise/deserialise MBBs, I don't think]

ltratt commented 3 years ago

LLVM doesn't offer an API to serialise/deserialise MBBs

Could / should we add one?

vext01 commented 3 years ago

Anything's possible I guess. But it will take time to implement, and it had better be worthwhile if we are to diverge further from LLVM upstream.

On a related note, another scenario under which we see "repeated basic blocks" is function calls. Since a LLVM call doesn't terminate a block, you see the same block again after returning. We've had to hack around that in our code, but using an MBB-derived IR would simplify this scenario too I think.

OK, so what I'd like to do is find out next is if:

LLVM has a public API for manipulating MBB CFGs. Presumably it does for implementing code-gen backends.
We can re-use the byte-streamer thingy to serialise MBBs.

@ptersilie any thoughts?

ptersilie commented 3 years ago

Yes, working on MBBs instead of LLVM IR was one of the things we've discussed yesterday, but it's no small undertaking. Since we have some ideas to work around the switch-problem, my immediate thoughts are to try that first and see how far we get. I wouldn't be surprised if there are other unexpected codegen shenanigans that make mapping back to IR difficult. Ideally, we have a 1-1 mapping from machine code to something we can work with.

Another problem I see with working with MBB CFGs is how much we can still optimise those. The reason we decided to work on LLVM IR was that we can construct a trace and then pass it to LLVM for optimisation. I'm unsure how much that is possible if we work on different IR.

vext01 commented 3 years ago

I'm unsure how much that is possible if we work on different IR.

This is a very good point.

bjorn3 commented 3 years ago

Would creating and adding a switch lowering pass at the end of the llvm ir passes work? And maybe disable or restrict jump threading if necessary?

vext01 commented 3 years ago

Hi @bjorn3!

The thought did cross my mind. The problem is, we don't know how many other IR instructions might have hidden control flow, and we may end up with many passes. It would be nice to have a small and general fix.

ykjit / yk

Ambiguous traces. #383