xenia-project / xenia

Xbox 360 Emulator Research Project
https://xenia.jp
Other
8.25k stars 1.14k forks source link

Consideration for ARM64 Windows native support #2002

Open holaimscott opened 2 years ago

holaimscott commented 2 years ago

Howdy! I wanted to say that the work done on Xenia is amazing <3

It would mean a lot if you could consider adding native ARM64 support for Windows devices.

While I know these sorts of things are very complicated and time consuming, I think it would be great to make this amazing emulator friendly with a lot of other platforms in windows, and also set the foundation for ARM compatibility now that more ARM devices are starting to show up and promise great performance.

Several emulators have adopted ARM compilers or are capable of having ARM builds and are starting to shine on these new ARM devices, like Dolphin, Duckstation and now PS2.

Masamune3210 commented 2 years ago

Is there a ARM64 device with sufficient CPU and GPU oomph? Most of them if not almost all are tablets which don't have the best track record when it comes to power

Masamune3210 commented 2 years ago

Other than the M1 which has its own host of issues thanks to Apple being Apple

holaimscott commented 2 years ago

Is there a ARM64 device with sufficient CPU and GPU oomph? Most of them if not almost all are tablets which don't have the best track record when it comes to power

I believe most of the newer ARM devices out might be at least solid to emulate some of the less demanding games. For instance, let me put here some samples:

The upcoming Snapdragon 8cx 3rd Generation might be the most powerful device to come, along withthe existing 2 gen model, or the Snapdragon 850

I don't believe they are powerful like the M1 chips, but ARM64 implementation would be a nice thing to start setting foundation as these devices keep popping off.

JoelLinn commented 2 years ago

From personal experience I believe SM8350 and up should be good to emulate games that run at 60fps on 7-gen i7s at acceptable framerates. SM8450 and SM8475 should be significantly faster though. M1 would eat games for breakfest.

But all this requires proper implementation of an aarch64 cpu backend and Vulkan (and maybe Metal) gpu backends that are at least as good as current x86-64 and dxd12 ones

Masamune3210 commented 2 years ago

It would likely require some engine to convert or handle x86 instructions in a way that gets the same result as native code, which is a colossal undertaking in and of itself

JoelLinn commented 2 years ago

It would likely require some engine to convert or handle x86 instructions in a way that gets the same result as native code, which is a colossal undertaking in and of itself

No. Currently on x86-64 we do PPC -> xenia custom IR -> x86-64

We just need to replace the part that translates the xenia IR to aarch64. that would be common to all aarch64 operating systems

Masamune3210 commented 2 years ago

Fair, for some reason it completely slipped my mind we are already going from PPC to x86

iMacker2020 commented 2 years ago

I am thinking if we used LLVM we would be able to support as many architectures as LLVM supports. So PowerPC -> LLVM IR -> (ARM64, X86-64, X86, SPARC, MIPS, and RISC-V).

JoelLinn commented 2 years ago

That is indeed appealing but LLVM is a poor choice for other reasons (see wiki/docs). Plus even with LLVM we would only be able to support a small number of architectures with enough compute power. Say amd64 aarch64, maybe risc-v if it picks up and elbrus if the war goes on.

Triang3l commented 2 years ago

LLVM may be a more or less quick route (though will still require a lot of work), but aside from long recompilation (I also don't know what the runtime performance implications of it are when it's used for the purpose of recompilation of already compiled/optimized code), I see a few maintenance and architectural issues with it as well:

First of all, its structure is extremely complex (at least at first glance), making it very difficult to add specific optimized sequences for certain operations. @Wunkolo has been able to very quickly add some GFNI and BMI paths for certain PowerPC instructions, and for my first commit I also easily traced the path vpkd3d128 and vupkd3d128 went from PowerPC to x86.

And while the previous issue I listed is pretty subjective and probably can be solved by just learning more about how LLVM works (and having PowerPC operations decomposed into relatively primitive LLVM ones somewhat even eliminates the need to pay much attention to the instructions emitted by the backend at all), I expect this one to be quite fundamental to how freely we'll be able to optimize things when they actually need optimizing.

The IR of LLVM is pretty distant from both PowerPC and x86, basically being a description of a quite abstract RISC ALU and memory access units, designed for hand-written code input. As I said, this of course allows us to, for instance, easily describe all the edge cases of some PowerPC operation in terms of those primitive operations, and just forget about everything beyond the IR, letting LLVM handle the rest. But I don't think that's truly what we want all the time.

Some semantic context may definitely be helpful. Some PowerPC operations may map directly to host operations on some host architectures, but not on other ones, in this case we shouldn't be adding explicit special case handling on the hosts where we don't need them. We may, for instance, want to treat host flags as guest flags to some extent.

Note that this issue of an abstract IR is what we already have to some extent with our HIR — and we already have some pretty complex "macro operations" in it, sometimes pretty intrusive, like the way we implement vsldoi (vector rotation) explicitly via a lookup table at the frontend side, rather than letting the backend use a lookup table if it has no other way of implementing it — this is what we should eliminate in the future, I think.

In my opinion, a more or less clean and hack-free translation chain, without the constraints of a "greatest common divisor", should look as follows:

  1. Original PowerPC code.
  2. Control flow analysis.
  3. PowerPC instructions, including their status flags, floating-point control register values, chained as SSA. "Global" processor state (status flags, floating-point and vector control registers) for the purpose of dependency chaining can be passed as a special kind of arguments and return values — and whether they have been changed explicitly or just preserved across a function call boundary being visible to the host code (so, for example, on x86 we reload FPSCR and VSCR into MXCSR before every floating-point operation that is either the first in the current function or the first after a function called in the same block or in the nearest predecessor block) because one MXCSR is used for both FPU and VMX as they're both implemented via AVX, but on Arm the control register has effect only on scalar floating-point operations (sadly it doesn't have non-denormal-flushing vector operations at all for emulation of the Java mode of VMX — if it's needed, we'll have to fall back to scalar implementations with proper FPSCR/VSCR switching in a "very high accuracy mode" — another reason for context awareness of operations).
  4. Possibly some optimizations on the PowerPC SSA IR level (like flag-affecting constant propagation — only as long as the dynamic floating-point rounding mode can't have effect on the results).
  5. SSA IR for host instructions.
  6. Advanced host-specific optimizations, possibly involving more supersets/subsets of the host SSA IR, and host register allocation, maybe followed by optimizations amortizing the impact of register spilling. Function call inlining (which is trivial on IR level) may also be done at this stage.
  7. Final host instructions.

As for the Elbrus E2K architecture specifically, while a LLVM fork exists for it and is being developed with some speed (primarily for the D language if I remember correctly), from what I know, currently LLVM generates pretty subpar code for it. E2K is way more demanding when it comes to optimizations performed by the compiler than out-of-order architectures.

As a first (but huge) step towards proper E2K support, it should be interesting to research EPIC-like code generation for other platforms. I'm not sure about the details of Arm when it comes to scheduling, but x86 has pretty well-documented instruction port allocation and latencies. Possibly in the first pass, we can write a somewhat naïve scheduler that assumes that all memory can be aliased (a hard constraint for execution unit allocation on E2K) and that everything is in the L1 cache (making assumptions about the latencies of instructions accordingly — on x86 we don't have much space for separating memory requests and dependencies anyway due to a relatively tiny register count in the ISA). I expect that we'll have a lot of scheduling-related code that can be shared between implementations — an execution port on x86 may roughly correspond to an ALU/predicate/control flow operation slot on E2K, so allocation will definitely have something in common between them.

Another fun thing to research would be background optimizations, maybe even somewhat profile-guided (can we measure cache hit rate somehow, or is that not possible especially due to Meltdown/Spectre?), and trying to perform whole-program optimization to some extent, maybe even aliasing analysis? Though I don't know how possible it is at all when you don't actually have the whole program visible (due to indirect jumps also), and can't be 100% sure that, like, some memory address will be accessed only through this or that pointer, and no other way. But the OS-level Lintel translator of x86 code on Elbrus (as well the app-level RTC translator, but I don't know how exactly that's handled there) reserves 2 cores entirely for translation work. I don't know the details, but I've heard that it performs very advanced optimizations, probably sharing a lot with the LCC C/C++/Fortran compiler. Maybe post-optimizing already translated code, and pre-translating code that's likely to be accessed soon, can also be done at some point in the future in Xenia.

Wunkolo commented 2 years ago

Might be worth the data-point but there is an xbyak for aarch64 that may be utilized similar to how we currently utilize xbyak.

Triang3l commented 2 years ago

Might be worth the data-point but there is an xbyak for aarch64 that may be utilized similar to how we currently utilize xbyak.

Alternatively, we can use the official VIXL library, but it will likely need some changes to the code buffer allocation logic. Particularly, one inconvenient thing about VIXL that I remember, if I understand correctly, is that it manages the memory for the code by itself, and even contains Linux-specific memory management calls, rather than letting the user handle the allocation.

However, Xbyak is just the final point in the chain, merely the existence of an assembler doesn't solve the primary issues unfortunately, but it's necessary for being able to emit anything of course. Though Armv8 has fixed-length 32-bit instructions if I recall correctly, so writing our own assembler, if we face significant issues with the existing ones, should be even easier than for x86.

iMacker2020 commented 2 years ago

@Triang3l I assume you are talking about this VIXL: https://github.com/Linaro/vixl. After looking at xbyak aarch64 and VIXL's GitHub pages, my vote goes for xbyak. It's documentation is better and its x86 version is already in use in Xenia.

holaimscott commented 2 years ago

This is exciting! If there is any site where I can support devs for yalls work, let me know. Means a lot to see this possibly coming to fruition <3

Wunkolo commented 2 years ago

Bumping to mention another AArch64(ARMv8.0) JIT-emitter that can be used. With a very light header-only implementation. https://github.com/merryhime/oaknut

iMacker2020 commented 2 years ago

@Wunkolo I looked at the project page and couldn't find any mention of it working on M1 Macs. Would you have any information about that?

The example program on the project page did look easy to understand and write, so it might be good project to use.

Wunkolo commented 2 years ago

The only platform-specific code that Oaknut has is for marking a memory region as executable. Which handles the case of Windows/Apple/Linux platforms: https://github.com/merryhime/oaknut/blob/9acafdcdd9b4c1140b1f7a125844a22405a7774d/include/oaknut/code_block.hpp#L29-L35

I believe developer(merryhime) also develops and tests their ARMv8.0 code primarily on an M1 Mac for their efforts of porting dynarmic to AARCH64. Which is currently the primary client of Oaknut.

iMacker2020 commented 2 years ago

Oaknut documentation is very lacking. When I was searching for documentation for it I actually found another project called dynarmic (https://github.com/merryhime/dynarmic). There are a lot of options available to us.

xbyak_aarch64 has a very well documented GitHub page. It also looks very complete and easy to use.

The solution we pick has to work for Windows on ARM and Mac OS on ARM64. I think both solutions meet that requirement.

@Triang3l You are a maintainer of this project. Could you or another maintainer pick one?

Triang3l commented 2 years ago

@Triang3l You are a maintainer of this project. Could you or another maintainer pick one?

Shifts and ORs? 😜 All instructions are 32-bit anyway.

iMacker2020 commented 2 years ago

So you don't have a preference...

JoelLinn commented 2 years ago

I believe nobody is gonna decide anything based on a course overview. In the ideal case the one who decides to work on this first would make some POCs with the hottest candidate concerning our special use case. But in the end it would be ultimately decided when aarch64 is merged to master since this is open source and there is no steering committee.

Triang3l commented 2 years ago

So you don't have a preference...

I think the final step of recompilation is the least complex part, any way to generate bit fields would work, while all the fun will be at earlier stages, before and during register allocation and scheduling — where we'll probably have some other representations of Arm instructions that won't be final Arm instruction encodings.

merryhime commented 2 years ago

Hola. I wrote oaknut primarily because I looked at vixl and didn't like how it wanted control of memory, and I just wanted a very very dumb code emitter. Importantly, I wanted something that just emits one instruction per emission function, without anything fancy at all. In other words, something which basically just does *m_ptr++ = instruction;.

oaknut::CodeGenerator accepts a user-provided uint32_t* and writes to that. It's your responsibility to do protection/unprotection/icache invalidation/allocation/deallocation.

There is an optional header <oaknut/code_block.hpp> that provides utilities for mmaping a region but it is completely ignorable.

Updated documentation recently for reference. See also a simple inefficient fibonacci example on using labels.

I didn't use xbyak_aarch64 because of licensing reasons (did not want to include something Apache licensed in a 0BSD project), otherwise I would likely have used it.

I appreciate this is a very small part of a recompiler, just wanted to throw in my 2¢.

iMacker2020 commented 2 years ago

@merryhime Wow! You really improved the documentation. It looks great. Thank you. You made the documentation so good I don't know which project to support.

Razzile commented 2 years ago

I wonder if Project Volterra would be useful for this task, depending on what GPU it has

Wunkolo commented 2 years ago

I wonder if Project Volterra would be useful for this task, depending on what GPU it has

The latest ARM-based Surface Pros and Project Volterra(which utilize the SQ series of chips) and the ThinkPad x13s all use the same Snapdragon 8cx Gen 3 SoC afaik. Though the ThinkPad x13s is getting official Linux support by Lenovo as well.

These devices only provide DirectX drivers on Windows which lends itself perfectly to Xenia's current usage of DX12. There is an OpenGL compatibility pack that turns OpenGL calls into DirectX12 calls under the hood but for Vulkan you will have to utilize something like Dozen.

They all feature the Adreno 690 but may clock them differently.

Wunkolo commented 6 months ago

To help capture some compatibility info, here's some D3D12 info from my ThinkPad x13s. https://d3d12infodb.boolka.dev/?ID=110

Additionally I have some Vulkan info captured from their native Vulkan drivers here. https://vulkan.gpuinfo.org/displayreport.php?id=30457