zuloloxi / mecrisp-ice

http://mecrisp.sourceforge.net/ Mecrisp-Ice is an enhanced version of Swapforth and the J1a stack processor by James Bowman, featuring three MSP430 style IO ports, a tick counter, constant folding, inlining and tail-call optimisations
BSD 3-Clause "New" or "Revised" License
27 stars 4 forks source link

Mecrisp Peripherals? #5

Open PythonLinks opened 11 months ago

PythonLinks commented 11 months ago

The irresistable draw of the MeCrisp family is the large installed base, and the promise that many of those implemented peripherals will also work on Mecrisp-Ice. The problem is that I do not know what will work on Mecrisp-Ice and what will not work. I do not know which MeCrisp libraries are dependent on C routines, or on specific microcontroller hardware. Are the different flavors of Mecrisp Forth really compatible? So a list of which peripherals actually work on this FPGA would be a killer elevator pitch!

I recently returned to graduate school, my master's thesis will be on a Forth processor. Step 1 was to review what is out there, and Mecrisp-ice looks very promising. Here re the slides for a talk on "Review of Forth Processors" which I will be presenting at the August SVFIG meeting, and at the September FPGA World Conference in Stockholm.
https://pythonlinks.info/presentations/ForthPresentation.pdf

If you had a list of what peripherals work with Mecrisp-Ice I would include it, and that would be a big reason for people to use Mecrisp-Ice.

What would be even better if there were a list of peripherals which could connect to Mecrisp-Ice, along with the size of the Mecrisp solution and the size of the dedicated hardware solution. I suspect that there are a number of FPGA cores which could be made smaller, or more flexible with a Mecrisp Forth cores.

Mecrisp commented 11 months ago

On the list of notable peripherals, these vary by target, and include:

LCD with text mode in logic, and atomic clear/set/invert on GPIO pins:

mecrisp-ice-2.6c/mch2022/icestorm/mch2022.v

Configurable interrupts on GPIO pins:

mecrisp-ice-2.6c/icebreaker/icestorm/j1a.v

USB terminal, taken from https://github.com/ulixxe/usb_cdc

mecrisp-ice-2.6c/fomu/icestorm/j1a.v mecrisp-ice-2.6c/common-verilog/usb_cdc

FOMU also comes with Ledcomm, a bidirectional optical terminal with handshake over a singe LED, nice for adding a debug connection somewhere.

mecrisp-ice-2.6c/fomu-ledcomm/icestorm/j1a.v mecrisp-ice-2.6c/common-verilog/ledcommflow.v

One could use VGA text mode as featured in Mecrisp-Quintus (RISC-V) together with Mecrisp-Ice, similar to the LCD code for the MCH2022 badge, but the Nandland Go HX1K FPGA just does not have enough BRAM memory for Mecrisp-Ice and VGA together:

mecrisp-quintus-1.0.5/nandland/nandland-verilog/nandland.v

Yes, I follow the issues, but by the way, the zuloloxi/mecrisp-ice repository is not the upstream branch of Mecrisp-Ice. Find the official packages here: https://sourceforge.net/projects/mecrisp/files/ We are at Mecrisp-Ice 2.6c at the moment.

On the slides... I forgot to update the passage in the README. I would not recommend Nandland Go anymore, because it is a very small FPGA, and everything is a sizecode challenge on that. My current recommendation is to use the Icebreaker (with UP5K FPGA).

PythonLinks commented 11 months ago

Thanks for the great answer. Sadly currently the Icebreaker is sold out. https://1bitsquared.com/collections/fpga/products/icebreaker

I am still digging through your answer. I think part of the problem is that I was not clear in my own mind. Many of your links were .v files. For someone interested in Mecrisp Ice connections to various hardware circuitry is critical.

I have a different perspective. If someone wants to use software on forth for bit banging control of peripherals, what they are looking for is the software which runs on Mecrisp Ice cores. Any fpga designer can find ip cores for I2C, or uart. What would be very interesting to them if they could do that on a forth core, and either save space, or get more functionality.

Mecrisp commented 11 months ago

I understand your desire - I usually hook up new peripherals to GPIO lines for bit-banging while doing experiments and determining my requirements before I rewrite it in Verilog for exact timing. The prototypes are just hacked together Forth files that I drop as soon as the Verilog one works as expected, with Forth doing the high-level or non-critical stuff, gluing the action of the peripherals together. That is the point of an FPGA with a softcore.

Because of lack of gates and memory, I bit-banged a valid VGA signal on Nandland Go, with a 16x16 resolution. It is hard to bit-bang video signals if the CPU runs at the pixel clock, but one needs more than one cycle for generating the pixel and is very short on RAM.

mecrisp-ice-2.6c/nandland/examples/vga.fs

More useful is a mixed design with low-level parts reading an analog-digital converter in Verilog and all the other parts done in Forth:

mecrisp-ice-2.6c/ulx3s/signallab/capture.txt mecrisp-ice-2.6c/ulx3s/signallab/asciisignal.txt mecrisp-ice-2.6c/ulx3s/signallab/live.txt

Also see bit-banged access to a SD card:

mecrisp-ice-2.6c/ulx3s/examples/sdcard.fs

Technically you just put nicely choreographed values into GPIO registers from within Forth.

By the way, I am curious on how you are going to implement communication between multiple cores, as the Mecrisp-Ice processor does not feature wait/busy signaling.

Both the Transputer and the Greenarrays GA144 provide nice features for interprocessor communication, have a look at these! I love the idea of pointing the CPU to a peripheral port on reset that does handshaking in logic, and disables the address increment, so one just feeds an instruction stream into a port, and concludes with a jump instruction into the freshly loaded RAM area. But this obviously requires memory mapped IO and wait/busy signaling.

Mecrisp commented 11 months ago

PS: If you cannot find an Icebreaker for a first start, try getting the ECP5-based ULX3S board from Radiona. That one will be large enough to support multiple instances of Mecrisp-Ice.

Mecrisp commented 11 months ago

PS2: The Nandland Go itself is capable of up to 800x600 video output... But then Forth does not fit into BRAM anymore: https://github.com/Mecrisp/Nandland-RISC-V/tree/main/800x600

PythonLinks commented 11 months ago

More useful is a mixed design with low-level parts reading an analog-digital converter in Verilog and all the other parts done in Forth:

Thank you. As a newbie to digital design that is just the kind of advice I need. Yes bit-banging sounds like a great idea, but a more careful study suggests that it is subject to glitches. So dedicated hardware makes more sense. And so your previous response with verlilog links also makes more sense.

And maybe that explains why the GA144 had communication problems. It tried to do bit banging. I think you have it right. Use dedicated circuits, working closely with a higher level processor . I am converted.

Thank you for the links to files. The three files look really interesting. I will be digging into them. Just the kind of examples I need. Where does one submit pull requests? The URL mecrisp-ice-2.6c does not seem to work.

The transputer, cooperating sequential processes, coroutines and pauses feature strongly in my thinking. I will look more closely at the transputer and the GA144. I like how Axi does handshaking. Both the sender and receiver have to be ready. Sure I need to make some changes to your cpu. The reputation of the J1, is that it is not many lines of code, quite easy to customize. I have already read the code verilog very carefully. Should be easy to add a pause. Both the microcore and the Core-1 have pauses.

I love the idea of pointing the CPU to a peripheral port on reset that does handshaking in logic, and disables the address increment, so one just feeds an instruction stream into a port, and concludes with a jump instruction into the freshly loaded RAM area. But this obviously requires memory mapped IO and wait/busy signalling.

Thank you. I had no idea how the details worked. But I am not quite sure why memory mapped IO would be needed. The J1 can read and write to the TOS, NOs, return stack, and from memory. Should be easy to create another port, and have it read from there.

As for the GA144, I was quite inspired by it. But too little memory, and reportedly communication problems, which I have yet to fully understand.

As for the GA144 matrix, the lesson from the leader of the Core-1 group is that these are FPGA's. No need to do a fixed matrix of cores, just connect them however your application wants to connect them.

Mecrisp commented 11 months ago

Thank you. As a newbie to digital design that is just the kind of advice I need. Yes bit-banging sounds like a great idea, but a more careful study suggests that it is subject to glitches. So dedicated hardware makes more sense. And so your previous response with verlilog links also makes more sense.

A well designed GPIO port for bit-banging does not glitch either, that isn't the problem. It is all about timing in an FPGA. See, you can easily bit-bang one analog-digital-converter or handle communication over RS232, but what about handling terminal, five ADCs and three DACs with a stable sample rate generated by a numerically controled oscillator for signal processing applications? You need either wizard level programming skills on a microcontroller (and hope for requirements never change), or a bit of additional logic that simply generates the control signals and puts all the data into ring buffers for a Forth program to handle without the need for exact timing, as long as it is fast enough in total to handle all buffers without over- and underflows.

Most FPGA designs therefore combine low-level, fixed function logic specific for driving the electronics that one soldered on, and a softcore processor that orchestrates these parts and handles user-configurable settings and choices.

Yes, it is possible to have multiple processor cores, each bit-banging their peripheral, but this will take much more resources than simply having a counter, a few gates and a buffer. But why, if you are already using an FPGA that can do any logic you wish? The idea with multiple processors will become more interesting in an ASIC where users cannot add their own logic. A very good example is the Parallax Propeller! This is a multicore-microcontroller (look both at the old and the new one) with a really clever mechanism for communication between cores, which are intended for precisely bit-banging as much peripherals as you have cores available. There are many great bit-bang examples for the Propeller, like ethernet, S/PDIF audio, RFID...

The J4A (the barrel Forth processor by James Bowman) steps in a similar direction. These designs are very useful if you cannot simply load a new bitstream if you want to change the functionality available in logic.

Also interesting: See the RP2040 microcontroller, which has small programmable IO helper state machines with buffers (called "PIO", programmable IO) to offload timing-critical tasks from the two main processor cores. These are all variations on the theme.

Thank you. I had no idea how the details worked. But I am not quite sure why memory mapped IO would be needed. The J1 can read and write to the TOS, NOs, return stack, and from memory. Should be easy to create another port, and have it read from there.

Yes, thats perfectly possible, then you have initialised memory with a classic bootloader that runs on the core on boot, and takes a memory image from whatever means of communication you choose. This is what the Transputer does. See chapter 2.10 in http://transputer.net/iset/isbn-063201689-2/inside.pdf

What I tried to describe is the idea used in the GA144 of directly executing a memory mapped IO port, without copying the instructions to RAM first. Then the core does not need initialised memory (expensive in ASIC), and logic for that is minimal - disable incrementing program counter when executing address in IO range, and the port needs to provide handshake, stopping the core until the next instruction arrives at the port.

https://mschuldt.github.io/www.colorforth.com/b900.htm https://mschuldt.github.io/www.colorforth.com/ef.htm https://mschuldt.github.io/www.colorforth.com/etherCode.htm

As for the GA144, I was quite inspired by it. But too little memory, and reportedly communication problems, which I have yet to fully understand.

As for the GA144 matrix, the lesson from the leader of the Core-1 group is that these are FPGA's. No need to do a fixed matrix of cores, just connect them however your application wants to connect them.

I have not used a real GA144, but only studied its datasheets. In my opinion, the problem of the GA144 is that the individual core memories are too small to run common algorithms that are not really suited for parallelisation, and there are too few cores to think of these at the (multiple) gate level. It is somewhere in between of multicore microcontroller and programmable logic. While its capabilities are outstanding, getting it to work is solving a complicated puzzle. In addition, there are many perfectly valid, but oddball choices everywhere one needs to wrap the mind around. Toolchain research would be necessary to fully unlock its potential. My recommendation: Study it in detail for very interesting ideas!

Thank you for the links to files. The three files look really interesting. I will be digging into them. Just the kind of examples I need. Where does one submit pull requests? The URL mecrisp-ice-2.6c does not seem to work.

These are relative paths into the upstream releases you can get in the download section at mecrisp.sourceforge.net and extract on your local storage. For the equivalent of a pull request, send me an E-Mail to the official address m-atthias@users.sf.net with the files to be included. If you want to discuss your changes first, go here: https://sourceforge.net/p/mecrisp/discussion/general/

PythonLinks commented 11 months ago

Thank you for the brilliant GA144 and Transputer links. And the most needed lattice board link. You went way beyond providing Mecrisp tech support and went a long way towards supporting my project. Most appreciated. There are so few people who have any idea at all about what I am talking about. Let alone who can provide me with good technical guidance.

So here is the million dollar question. Do you know of any good applications for 100s of Forth cores? 
 Chris



Mecrisp commented 11 months ago

I had to spin my head on this for a while.

For high performance parallel scientific computing, one needs shared memory and/or extremely high performance crosslinks - see "NUMAlink" for example. The Mecrisp-Ice cores are not suited by design for large and/or shared memories, so high performance computing is out.

For peak IO performance, decicated logic is hard to beat.

We still have applications on the table that are complicated to drive, need reconfigurability on-the-fly, require data processing that does not require a lot of data exchange with neighbour cores, or benefits from computation-in-fabric.

Ideas:

PythonLinks commented 11 months ago

Particle accelerator detectors, in which many, many sensors need to be monitored and realtime data needs to be synced and compressed with fancy algorithms before it can be transferred to data storage

Agreed. Although physics is not that big a market. Basically real time control, where every I/O gets a dedicated tiny processor and no interrupts. That makes life easy for the programmer. timing becomes simple. This will be my second cpu, the Hana II. A 3 x 3 grid. One supervisory computer in the middle talking to all of the others. the others can talk to their two neighbors, but mostly just to their dedicated i/o ports.

The obvious competition is the Propeller Parallax. And it only has 8 cpus. And while Forth is in the Propeller's ROM, each core, only gets 1024 x 32 bits of dedicated memory.

Special purpose image sensors, maybe for hyperspectral imaging

Totally agreed. But it needs to be real time to justify an FPGA. I was only thinking of RGB images. Anything where computation is on a plane, so that the cores only have to talk to their neighbors is a great place to be. And for demos, nothing beats vision. Should be quite doable to do a 20x20 grid. Each core can do convolution algorithms at the same time. With the Altera boards, I can get 444 hard core multipliers, so I could even do a 21 x 21 grid. 441 cores.

I used to be vision impaired, and once tried an edge detection glasses, a great idea, but the CV libraries do not work that well. Quite easy to do edge detection by recognizing regions of the same color. Basically map from RGB to polar coordinates, amplitude and angles, strip out the amplitude, and you can segment the world by color, goes a very long way to providing big blobs for vision impaied people to get around. And if not, someone else will figure out some other vision algorithm. And the processors can interrupt their neighbors once they have figured out something they need to share. Like a sharp edge going in the direction of the neighbor.

Vision impaired is not a big market, but a very poorly served market, and any niche is enough to get ones foot in the door.

. Swarm simulations, multi-agent behavioral models in bioinformatics.

And of course that is the other one I came up with. The actor model, lots of actors interacting with each other, even if it is not nearest neighbors. The conventional microprocessors do not pass messages to each other quickly, and furthermore they have a cache coherency problem, when many are operating on the same data. I wish I knew how GoLang dealt with all of this.

I am not quite sure how to do the networking on such a device. Maybe a hypercube? It would need something like a DNS server and a way to shovel actors back and forth from DDRAM.

In any case it is such a pleasure to be able to chat with people who have good instincts about all of these issues.

Mecrisp commented 11 months ago

It happens that I am a biophysicist, so applications in physics is my natural habitat :-)

Yeah, on how and in which pattern to link cores to each other... That is a research question. The solution of the Parallax Propeller does not scale beyond a few handful. You could take inspiration by diving into the Transputer manuals or into the early Cray designs. Another source of inspiration might be wireless mesh networking. Of course, you do not have to handle moving nodes with unreliable connections, but the routing algorithms are worth a look. Maybe https://en.wikipedia.org/wiki/B.A.T.M.A.N. is a starting point, and you will find active research in that area.

I think your first task is to implement the basic point-to-point link between two Mecrisp-Ice cores. Experiment a little with it, see how it integrates into Forth, make sure it is works under all circumstances, and find an elegant way for handling the link in a handfull of definitions. Keep in mind that every core will have more than one link later. I am interested in this, and I am happy to discuss link implementation with you.

Then scale up, arrange a handfull of cores in, lets randomly say a hexagon pattern, and start thinking on routing. While you are at the routing part, you will discover which link topology will fit your chosen application best.

FPGAs for example have links to neighbours, and links further down. Lattice ECP5 comes with rectangular grid, with span 1 (neighbour), span 2, and span 6 wired: https://prjtrellis.readthedocs.io/en/latest/architecture/general_routing.html