Add support for AssemblyScript

Dudeplayz commented 4 years ago

Hey Uri, We have already spoken about the possibilities to speed up the simulation. I am interested in adding support for AssemblyScript. I followed the recent discussions and speed ups. The question is if you still think it can bring some performance gains, also after the enhancements that were made?

urish commented 4 years ago

That's a good question.!

One challenge with AssemblyScript is that the code can no longer be easily extended or modified from the JavaScript realm, and communication between JS and the AssemblyScript code can be quite costly, so we'll want to keep it down to minimum. For starters, I'd suggest trying to create a version of the demo project that works with AssemblyScript, so we can compare the performance.

If AssemblyScript does bring a notable performance improvement that can't be achieved by tuning the TS code, we can probably find a way to include an AssemblyScript binary in the releases, either in this repo or a dedicated repo. Once we have more data about the performance we can consider the best course of action.

Dudeplayz commented 4 years ago

This sounds like a good plan. I will take a look at it!

urish commented 4 years ago

Thanks!

Another idea that I had would be to come up with a AVR → WebAssembly compiler, that is to convert the raw AVR binary into WebAssembly code that does the same, so we don't have to pay the overhead of decoding the each instruction as the program is executing, and perhaps the JIT will be able to do a better job at optimizing the generated code.

Dudeplayz commented 4 years ago

Oh wow. This sounds really interesting! The question would be, how to integrate the peripherals.

urish commented 4 years ago

That's a good question. I'd imagine having a bitmap or so that will indicate which memory addresses are mapped to peripherals. Whenever you update a memory address, you'd check in the bitmap. If it has a peripheral mapped to it, then you'd call a web-assembly function that will resemble the writeData() function that we currently have....

Dudeplayz commented 4 years ago

That should work. The peripherals will be in WebAssembly too? In this case, everything would run in WebAssembly, expect the visuals which have to stay in JavaScript. But then we have again the problem, that all peripherals have to be converted to WebAssembly :/.

urish commented 4 years ago

We could also mix-and-match. I believe stuff like the timers, which has to run constantly (in some cases after every CPU instruction or so), will have to live in the WebAssembly land. But some other, less frequent peripherals could possibly bridged over to JavaScript land.

There's definitely much to explore here...

Dudeplayz commented 4 years ago

The thing is, it always depends on the workload. So yes it is really interesting, but we need to start at some point with the exploring. Maybe we will see any point where we can say, we are already faster than ever expected. On the whole, you can say the CPU which is simulating will be always a lot faster than the simulated MCU. The gain of optimization is still for slow end devices.

So, should we create a plan where to start with testing and exploring the possibilities each approach has?

urish commented 4 years ago

Yes, and as you say, it's good to have some baseline to compare to.

Right now, the JS simulation has some thing that can already be improved (e.g. the lookup table for instructions), and it runs pretty okay on modern hardware - achieving between 50% speed on middle-range mobile phones and 160%+ speed on higher-end laptops.

However, lower end devices (such as Raspberry Pi) only achieve simulation speed of 5% to 10%.

So there is definitely room for improvement, especially if we consider the use-case of simulating more than one Arduino board at the same time (e.g. two boards communicating with eachother)

Dudeplayz commented 4 years ago

Ok yes. For some playgrounds etc. it would be really funny and interesting to have multiple Boards running at once. So only if this case will be locked to higher-end devices, we need to improve it. I also had some interesting ideas with simulating such boards with node.js and other JavaScript runtimes.

I am actually not familiar with the benchmark code. Is the benchmark possible to run under all mentioned approaches or do we need a new benchmark to make a meaningful comparison?

urish commented 4 years ago

The current benchmark is pretty minimal - it runs compares a single-instruction program many many times to compare different approaches for decoding.

I think a better benchmark would need:

A larger program, that runs for a specific amount of CPU cycles, and includes a variety of instructions, branches, etc.
A more extensive benchmark program, that uses peripherals, in addition to just running code, so we can check the integration of everything together.
If we want to run it in Web Assembly, we'd also need a runner in Web Assembly, though that shouldn't be to complicated.

I'd probably start with just the 1st, simpler benchmark, to get a feeling if the direction seem promising, and if it is, then we can devise a more extensive benchmark that will allow us to do a comprehensive comparison.

What do you think?

Dudeplayz commented 4 years ago

Starting with 1st should be the best approach. I would start looking at AssemblyScript or at WebAssembly directly? With WebAssembly directly we also have the decision between AVR Instruction translation and full interpretation (like JavaScript) in WebAssembly.

Maybe we can focus on some most needed Assembler instructions to reduce the amount of initial instructions and to focus the benchmark on those. So we can faster see first results and decide after, where we dig deeper or if we can already see a clear winner?

urish commented 4 years ago

I believe that Web Assembly interpretation (written in C or RUST) wouldn't be much different than AssemblyScript, but it's pretty easy to write one or two instructions, as you suggest, and compare the generate WAT (Web Assembly Text) between the different implementations. Ideally, if there is no significant difference, using AssemblyScript means we can probably keep one code base which is preferable.

Here are some useful resources:

Web Assembly Studio - Great environment for experimenting with different languages for WASM (C, Rust, AssemblyScript), if I'm not wrong, you can also see the generated code (so it can also be helpful with the AVR → WASM compiler).
Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code - According to this article, I'd expect WASM to be about 60-70% the speed of writing native code. So if, for instance, we simulate a single AVR instruction with 30 WASM instructions, that'd be roughly the equivalent of 50 host CPU instructions, or x50 slowdown (so about 20MIPS for every 1000MIPS of host CPU)

urish commented 4 years ago

@gfeun (author of #19), you may find this discussion interesting :)

gfeun commented 4 years ago

Hey, i've been summoned. Yeah, i'm following along :smile:

Dudeplayz commented 4 years ago

Yes, I agree. If there is not a really breaking point which requires to translate the TS code, we should stay at TS and use the AssemblyScript compiler.

The Web Assembly Studio looks great. I remember to have found it before in my researches. If you can wait, I will try to create a first comparison between the Rust, C and AssemblyScript approach in the next 2 days. These languages are not my preferred ones, so I have to get familiar with the tooling 😅. I will try to create some sharable Web Assembly Studio workspaces and post them here. (And yes, the generated wasm code is visible.)
If we can get the point with translating the AVR binaries to WebAssembly we can overcome the 30 WASM instructions. I think this project has some really interesting possibilities to get the best out of it. If we can get a good combination of all these possibilities we get the opportunity to support a really wide range of end devices and possible runtimes! Also, 1000MIPS is for a modern CPU no problem. Also, an RPi 4 could maybe run at something around 70-100% or even more. (measured on some synthetic benchmark)

Dudeplayz commented 4 years ago

And hello @gfeun. Nice to meet you!

urish commented 4 years ago

Yes, definitely, sounds like a good plan. It's going to be interesting :)

Dudeplayz commented 4 years ago

Hey, I've created 3 WebAssembly Studio (WAS) workspaces for C, Rust and AssemblyScript: C - https://webassembly.studio/?f=u627pgs1r5q Rust - https://webassembly.studio/?f=h80i9yrgjxa AssemblyScript - https://webassembly.studio/?f=j5dlh1kn5s

The code and folder structure is based on the empty template workspaces for each language. I've tried to bring all together in one workspace, but this is more complicated. For the start, I have implemented in each of them the same basic functions with an identical file structure. All wat (WAS transforms them transparent to and from wasm) files are looking basically the same. Except of some different orderings and the funny point, that Rust and C are swapping the variables for the add function. Surprisingly the AssemblyScript version is the smallest, measured on the line count. Rust and C have this line (table $T0 1 1 anyfunc) additional, which tells me nothing because I currently can't understand the file scheme. There is also a difference for Rust and AssemblyScript in the line (memory $memory (export "memory") 17)), where AssemblyScript has 0 instead of 17. C also has a few lines more, which are maybe only for meta informations, which are left by the others.

I know, the example functions are not really hard to interpret differently. So can you take a look at it and tell me what you think? What would be a good example function to implement to find out something more meaningful?

Currently, I have the opinion, that even if AssemblyScript could have at some point an outbreaking performance difference, we would always have the possibility to implement this specific feature in a different language.

(To see the wasm/watcode, run Build in the WebAssembly Studio)

For some reference, you can access https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format.

Dudeplayz commented 4 years ago

Oh, wow. I just realized now that all compilers have pre-evaluated the method calls in the main function 😅.

urish commented 4 years ago

Yes, the compile seems to be very good at optimizing - it inlines the functions and then also precalculates the result of the expression. Pretty smart!

So far, it seems like for basic arithmetic, we get roughly the same amount of opcodes. What I'd try to do next is to implement a complete opcode, e.g. define an array of hold the program data, and then implement an opcode that also reads and updates the data memory, such as ADD (the registers reside in the data memory, and it also updates the SREG register, so that would be a memory write).

Does that make sense?

Dudeplayz commented 4 years ago

I would say it makes sense 😄. I will try it first with the AssemblyScript version and do some copy and paste of the original code. After that, I will convert it to the other both.

So for correct understanding:

Implement 'ADD' command
Already implement basic opcode decoding?
Small data memory to be used by 'ADD'
Include SREG registers

urish commented 4 years ago

Indeed, we'd need some basic opcode decoding to extract the target registers out of ADD. Also, the existing TS code can probably be used without too much changes...

Dudeplayz commented 4 years ago

Ok, I will do so. I would create an additional repo/project for this testing code? I also moved from WebAssembly Studio to IntelliJ (or any other IDE).

Dudeplayz commented 4 years ago

I had tonight another idea: Would it be possible to add some async command prefetching to reduce the time for opcode decoding?

urish commented 4 years ago

Yes, we can either create a different repo (if the code is entirely different), or a new branch here, in case the code is still the same (or auto-generated from the current code, like I did with the benchmark).

As for IDE, that makes sense. I use VSCode for this repo, so if you open it with VSCode you should get a list of recommended extensions (prettier, eslint, etc).

What do you mean by async command prefetching?

Dudeplayz commented 4 years ago

I currently only mean for testing things. So the example AssemblyScript, Rust and C code. For the "real" implementation I would prefer a different branch with the target to bring it to master.

I will try to get warm with VSCode for dev purposes ;D.

We currently discovered the problem with the big if-else statement. The idea would be to evaluate this statement for the next opcode async. But this requires to bring the code in a format where this is possible.

urish commented 4 years ago

Yes, it makes sense to create a new repo for experiments. Would you prefer to have it under your github user or here (under the wokwi organization)?

Not sure how you'd go about evaluating the next opcode async, maybe I need to see an example to understand?

Dudeplayz commented 4 years ago

I think because it is related to this project, it should stay under wokwi. Also because if I copy some code. And maybe there will be some more experiments in the future.

Ok, I think I see the misunderstanding. I do not mean the full evaluation only the "prefetching" of the next operation to do. But currently are all operations under their specific if-statement. A refactor in their own methods would be required. In this case, we can evaluate the next command lookup async and for example, save the required function in a variable and the "running" code then only needs to evaluate this method. Instead of evaluating the big if-else block and evaluate after. It would be something like a 2 stage command pipeline.

Dudeplayz commented 4 years ago

But this would only bring an improvement if the decoding stays the same. If the decoding is rebuilt with look-up tables or something else this would bring only an (I think) small improvement.

urish commented 4 years ago

That makes sense! What would you like to call the new repository?

Refactoring of the code to separate methods is already done by the benchmarking code, if you run npm run benchmark:prepare, it creates a new instructions-fn.ts file that contains one function per instruction and a lookup table at the end. Here is an excerpt from that file:

export function instADD(cpu: ICPU, opcode: number) {
  /* 0000 11rd dddd rrrr */
  const d = cpu.data[(opcode & 0x1f0) >> 4];
  const r = cpu.data[(opcode & 0xf) | ((opcode & 0x200) >> 5)];
  const R = (d + r) & 255;
  cpu.data[(opcode & 0x1f0) >> 4] = R;
  let sreg = cpu.data[95] & 0xc0;
  sreg |= R ? 0 : 2;
  sreg |= 128 & R ? 4 : 0;
  sreg |= (R ^ r) & (R ^ d) & 128 ? 8 : 0;
  sreg |= ((sreg >> 2) & 1) ^ ((sreg >> 3) & 1) ? 0x10 : 0;
  sreg |= (d + r) & 256 ? 1 : 0;
  sreg |= 1 & ((d & r) | (r & ~R) | (~R & d)) ? 0x20 : 0;
  cpu.data[95] = sreg;
  cpu.cycles++;
  if (++cpu.pc >= cpu.progMem.length) {
    cpu.pc = 0;
  }
}

Dudeplayz commented 4 years ago

Good question :D. Something like WebAssembly-(Test/Comparison/Examples) or more abstract and related to avr8js? Do you have any suggestions?

Ah ok. That looks great. So is there already a look-up table now? Have you refactored this by hand or is there a script that does it for you?

urish commented 4 years ago

How about avr8js-perf, avr8js-wasm, or avr8js-research?

Yes, this script does the magic, a lot of text parsing :-)

There is also a preliminary implementation of the lookup table, I just never got around to measuring how it performs and integrating this into the main code.

Dudeplayz commented 4 years ago

avr8js-wasm and avr8js-research sounding great. But I would prefer research because

avr8js-wasm looks like an alternative implementation to the current
The experiments could be also related to some different stuff or do you prefer to set up for other experiments additional repos? Branching doesn't look correct for me at this point. If we say this repo only contains wasm related tests, then avr8js-wasm would be more matching.

Ok, nice. Genius to solve it this way!

So the baseline is standing.

urish commented 4 years ago

Let's start with avr8js-research then, we can always rename it later if we decide to

urish commented 4 years ago

You can join the new repo here: https://github.com/wokwi/avr8js-research/invitations I made it public, is that ok?

Dudeplayz commented 4 years ago

Yes, it's absolutely fine 👍.

Should I try to get all three languages into one project? So additional comparison code would be possible.

urish commented 4 years ago

Awesome! I'd suggest to start by establishing the baseline in AssemblyScript, and then if we feel like there's a room for improvement, we'll see how to add either Rust or C++, so we don't have to spend the overhead of setting a project that supports all the languages right now.

Dudeplayz commented 4 years ago

Yes, this was my intention. I'm currently trying to get the exported code from WebAssembly Studio running in my IDE, but there are some pitfalls. I will try to set up a clean and empty project. The intention to hold the WebAssembly Studio overhead was to share it there. But with this repo I think this can be safely dropped.

urish commented 4 years ago

Yes, I think the main advantage of the Web Assembly Studio is that you can get quick results, just like you did, and then, once you decide to go in a specific direction, spend the time needed for the local setup.

On a different topic, I don't know if you noticed that, but we have a small AVR assembler in this repo, which can be quite useful when writing small programs for the benchmark. I originally created it for the TWI Interface tests, so I could test the example code from the datasheet and quickly tweak the program under tests without having the manually translate the instructions...

Dudeplayz commented 4 years ago

Yes, I agree.

I noticed it but was wondering what the purpose of it is. So the assembler is just a simple compiler? I will take a look at it. For the starting with the simple ADD command, this should be to much :D.

I have made an initial commit to the repo. The question is now, how do we want to run the tests? With some visual output or only with node?

urish commented 4 years ago

Yes, the assembler just translates the instructions from something readable into AVR machine code. I also started using it for the single instruction tests, as it makes things more readable.

I don't think visual output is a requirement right now. As long as we have the numbers, we can always put them in Excel later and produce some visuals if we want

Dudeplayz commented 4 years ago

Yes, this sounds reasonable.

Ok, so I will set up a node environment.

Here is an impressive demo of some WebAssembly compilations. If you turn up the passes then you will see, that AssemblyScript is performing the best of all!

Dudeplayz commented 4 years ago

Where do we want to track incompatibilities and experiences with AssemblyScript? For the incompatibilities, we will later need separate issues to fix them. We could also track them in the research project. Either with separate issues or for the start with a collection issue.

urish commented 4 years ago

What do you mean by incompatibilities?

Dudeplayz commented 4 years ago

I have already found some things which are not working with the current implementation. For example the explicit declaration of types. And the problem that interfaces are not handled correctly, eg. the ICPU in CPU.

urish commented 4 years ago

I see, let's have a different branch that makes the necessary changes, so we can later turn this into a PR if we decide to go down this path

Dudeplayz commented 4 years ago

Sounds good. I will continue with benchmarking and testing before we start to do something in the project. Now back to the question where to store these experiences?

urish commented 4 years ago

By experiences, you mean a log of your findings? If so, I think a github issue can be a good place, or we can also use the GitHub Wiki for that

Dudeplayz commented 4 years ago

Yes, correct. In this repo or the other?

wokwi / avr8js

Add support for AssemblyScript #35