ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.51k stars 2.52k forks source link

MSVC Compiler 19.22.xxxxx generates invalid code which leads to inability to build libuserland #3024

Closed dimenus closed 1 year ago

dimenus commented 5 years ago

Related to #2740

With the upgrade from 19.21 (VS 2017) to 19.22, Microsoft seems to have broken something related to our FNV hash implementation. Building libuserland leads to this error:

invalid builtin function: 'hasDecl'

The workaround is to use VS 2017 or build tools older than 19.22.

mikdusan commented 5 years ago

going to put this here for posterity:

STATUS CL.EXE VS CMAKE INLINE
19.21.27702 2019 16.0 Release /Ob2
19.22.27905 2019 16.2 Release /Ob2
MinSizeRel /Ob1
Release /Ob1 (hacked)
19.23.28105.4 2019 16.3.0-pre.4.0 MinSizeRel /Ob1
Release /Ob2
2019 16.3.2 MinSizeRel /Ob1
Release /Ob2
19.24.28117.0 2019 16.4.0-pre.1.0 MinSizeRel /Ob1
Release /Ob2
19.24.28314.0 2019 16.4.2 MinSizeRel /Ob1
Release /Ob2
2019 16.5.0-pre.1.0 MinSizeRel /Ob1
Release /Ob2
19.24.28316.0 2019 16.4.4 Release /Ob2
19.25.28508.3 2019 16.5.0-pre.2.0 Release /Ob2
XVilka commented 5 years ago

Can it be related to the optimizations? In radare2 we met with memmove optimization bug: https://developercommunity.visualstudio.com/content/problem/583227/vs-2019-cl-1921277022-memmove-instrinsic-optimizat.html

andrewrk commented 5 years ago

Based on @XVilka's link, that could very well be our issue. Microsoft has acknowledged the bug and so eventually Azure will be updated with a patched MSVC. Until then I am testing the MinSizeRel workaround.

andrewrk commented 5 years ago

The CI is using MinSizeRel to work around this issue. Thanks @mikdusan.

I'm not actually able to reproduce it locally. Let's keep this issue open until the fix is widely available.

Sahnvour commented 5 years ago

I investigated this with

$ cl.exe
Microsoft (R) C/C++ Optimizing Compiler Version 19.23.28105.4 for x64

at 17f2af10b56dd508ebd4a22834fd594cb5a49864.

The root cause is that building the global builtin function table corrupts memory while adding entries. They will then later not be available when trying to access them.

The bug only occurs with CMake's Release target, so in order to get debug symbols I had to hack a bit using https://gitlab.kitware.com/cmake/community/wikis/FAQ#make-override-files. In my case I used a RelWithDebInfo and overriding flags to be sure to have /O2 /Ob2.

set(CMAKE_USER_MAKE_RULES_OVERRIDE_CXX
   ${CMAKE_CURRENT_SOURCE_DIR}/cxx_flag_overrides.cmake)
if(MSVC)
    set(CMAKE_CXX_FLAGS_RELWITHDEBINFO_INIT "/MD /Zi /O2 /Ob2 /D NDEBUG")
endif()

Building target libuserland (with the command zig0.exe build --override-lib-dir <path>/<to>/zig/lib libuserland install -Doutput-dir=<path>/<to>/zig/build -Drelease=true -Dlib-files-only --prefix <path>/<to>/Install/zig) will then fail.

So I launched it via the debugger. Now, while running define_builtin_fns from codegen.cpp, we can observe g->builtin_fn_table._entries being corrupted (it's a sparse array in memory) when executing create_builtin_fn(..., "errorReturnTrace");.

The disasm is

                  create_builtin_fn(g, BuiltinFnIdErrorReturnTrace, "errorReturnTrace", 0);
00007FF625EE5219  lea         edx,[r14+28h]  
00007FF625EE521D  lea         ecx,[rdx-27h]  
00007FF625EE5220  call        qword ptr [__imp_calloc (07FF62C995030h)]  
00007FF625EE5226  mov         rsi,rax  
00007FF625EE5229  test        rax,rax  
00007FF625EE522C  je          define_builtin_fns+2999h (07FF625EE5B59h)  
00007FF625EE5232  lea         rbx,[rax+8]  
00007FF625EE5236  mov         qword ptr [rbp+28h],rax  
00007FF625EE523A  mov         rcx,rbx  
00007FF625EE523D  lea         r8d,[r14+10h]  
00007FF625EE5241  lea         rdx,[string "errorReturnTrace" (07FF62A907A28h)]  
00007FF625EE5248  call        buf_init_from_mem (07FF625EDCE10h)  
00007FF625EE524D  lea         r8,[rbp+28h]  
00007FF625EE5251  mov         qword ptr [rbp+30h],rbx  
00007FF625EE5255  lea         rdx,[rbp+30h]  
00007FF625EE5259  mov         rcx,rdi  
00007FF625EE525C  mov         dword ptr [rax],65h  
00007FF625EE5262  mov         qword ptr [rax+20h],r14  
00007FF625EE5266  call        HashMap<Buf * __ptr64,BuiltinFnEntry * __ptr64,&buf_hash,&buf_eql_buf>::put (07FF625DA17F2h)  

This is basically create_builtin_fn inlined, see we're creating the builtin entry, its name and inserting it into the hashmap. Near the end, at mov dword ptr [rax],65h the memory pointed to by builtin_fn->name.list.items (rax) is stomped and so the key for the hashmap is garbage.

If you look at the asm generated for the previous builtins, the part after the call to buf_init_from_mem is nearly identical, and then changes slightly starting from "errorReturnTrace".

I'm pretty sure this is a codegen bug, but maybe not the one linked above. An easy workaround is to surround one of these two functions with #pragma optimize("", off/on) and the bug will disappear. I guess splitting defined_builtin_fns in two might also help. Anyways we should probably file a bug report to Microsoft.

Sahnvour commented 4 years ago

Tested this again with

set(CMAKE_CXX_FLAGS_RELEASE_INIT
        "/MT /O2 /Ob2 /D NDEBUG")

and

set(CMAKE_CXX_FLAGS_RELEASE_INIT
        "/MT /O2 /Ob3 /D NDEBUG")

(Ob3 is a new flag in VS 2019) on compilers 19.24.28314 and 19.24.28316 without walking into the issue anymore. Maybe we can close this (I'll make a PR to reset the optimization flags to default) ?

mikdusan commented 4 years ago

@Sahnvour,

compilers 19.24.28314 and 19.24.28316

Something odd here. At my last update I tested 19.24.28314 and the issue remained. This was building against the same LLVM binary bundle used by CI.

I speculated that maybe this issue can be resolved by rebuilding LLVM with new compiler. Unfortunately I ran into difficulties rebuilding LLVM on my system and left that unresolved.

Are you using your own LLVM dependency?

update: 19.24.28316.0 and 19.25.28508.3 work for me with Release and /Ob2 so assuming CI is running minimal version we should be fine going back to /Ob2.

andrewrk commented 1 year ago

This issue no longer plagues us since we use the wasm kernel for building from source rather than C++ code.