System pseudo module [DRAFT]

zaskar9 / oberon-lang

An LLVM frontend for the Oberon programming language

MIT License

20 stars 3 forks source link

System pseudo module [DRAFT] #16

Closed tenko closed 8 months ago

tenko commented 8 months ago

Minimal implementation of the SYSTEM module described in the Oberon-07 language report.

(NOTE : This used to work OK, until the last changes with removal of the STRING type.)

SYSTEM.ADR, SYSTEM.GET, SYSTEM.PUT and SYSTEM.COPY currently implemented. SYSTEM.VAL I am not sure is possible to implement currently due to the TYPE argument expected. Same for the pervasive procedures MIN & MAX.

This is marked [DRAFT] as I suspect this could be integrated in a better way.

zaskar9 commented 8 months ago

Hey, thanks for working on this! The code looks mostly fine to me as a prototype how this could be implemented. I have two remarks and one question.

I feel like it should be possible to build predefined modules without having to treat them explicitly in Sema and CodeGen (e.g., if (name == "SYSTEM") { ... }) by actually creating an instance of ModuleNode and dropping that into the symbol table.
It would be good to have unit tests for this functionality that demonstrates what works and what doesn't work (but should work). I'm not sure that I understand how the removal of the made-up type STRING would have affected these procedures. But I could imagine that it has something to do with ARRAY and RECORD values being passed by reference now. That means that they need an extra load to get their addresses.
I've been thinking about how to support types as arguments to procedures. To do so, a type needs to be wrapped as an expression, where the value is the type symbol and the type is TYPE. This will work without much effort for named types by simply adding another case to Sema::onQualifiedExpression, i.e., without modification of the parser. My question is whether this is general enough or if anonymous types with a syntax such as the following are also required be supported: s := SIZE(ARRAY 10 OF RECORD x, y: INTEGER END);? This syntax would require (significant) changes to the parser and potentially introduce unwanted ambiguity.

zaskar9 commented 8 months ago

I had a quick look at prototyping what I proposed in point no. 3 of my previous comment today. Unfortunately, even this solution requires a change to the interplay of Parser and Sema: There is an unfortunate ambiguity in the Oberon grammar due to (a) optional parentheses after procedure calls and (b) both actual parameters and type guards being placed in parentheses after an ident.

Currently, the parser does not have enough context to properly distinguish these different cases. This lack of context stems from the fact that, currently, all selectors of a designator are parsed without intermittent call-backs to Sema. As a result, the parser flies mostly blind and will parse an input such as SIZE(INTEGER) as a type guard instead of an actual parameter. I will need to couple Parser and Sema much more closely and do a call-back after each selector to have the Parser correctly maneuver through these inputs. While this is going to take a bit of work, I think it'll make the design more consistent and, hopefully, the parsing more robust.

tenko commented 8 months ago

Yes, it feels a bit bolted-on and is probably sensitive to any changes in the syst, m as it is not a designed feature.
I added a couple of unittests covering ADR, GET, PUT and COPY. The wording in the Oberon-07 report for COPY is "copy n consecutive words from src to dst". I implemented this as a BYTE copy as this makes most sense. I will check other implementations. As of now this fails due to some changes in the CHAR merge. I traced this to LLVMIRBuilder::qualifiedName and node->getIdentifier(). Probably related to the ProcedureNode::isImported() change. I will try further to find the error.
As far as I understand the type parameter is here similar to the parameter in the type guard and extended case statement.

zaskar9 commented 8 months ago

Indeed! Also, eventually the SYSTEM module will need to be able to change depending on whether the compiler is in Oberon, Oberon-2, or Oberon-07 mode. As you will have noticed, I already refactored all the built-in procedures into their own methods. This refactoring is intended as a first step towards moving this code out of LLVMIRBuilder. The same will need to happen for code related to the SYSTEM module.
I'm having a though time imagining how these things interact with each other. Is it just COPY that fails? If so, have a look at void LLVMIRBuilder::visit(AssignmentNode &), where I handle assignment of records and arrays using a blanket LLVM memcpy. Note that this code is still super sketchy. One issue is the naive size calculation, another is that, in case of array assignment, both the length and the content of the target array are overwritten, which can lead to weird behavior. My suggestion would be to write one dependable implementation for COPY and then call that code for the handling of assignments, too.
So, it has the be a named type, right? With a bit of luck, I can have that going through Parser and Sema in the next days. 😎

zaskar9 commented 8 months ago

A question after looking at the code of your unit tests: does the compiler need to guarantee that the following assertions holds?

VAR a: ARRAY 10 OF INTEGER;
ASSERT(SYSTEM.ADR(a) = SYSTEM.ADR(a[0]));

This is true in C/C++, but without additional effort it will not be true in the current implementation of arrays in CodeGen: SYSTEM.ADR(a) is the address of the length of the array, whereas the address of a[0] is SYSTEM.ADR(a) + 8. Recall that internally an array is a struct { i64, [ len x memTy ] }.

zaskar9 commented 8 months ago

Points no. 1 and 2 of my original list are actually linked. It is also related to your observation that aliasing the SYSTEM module does not work as expected.

Your code uses SymbolTable::import() to import a procedure from the SYSTEM module. This method accepts the name (or alias) of a module, but works under the assumption that an instance of ModuleNode (with that name) exists and has been set on the declaration. The following code from SymbolImporter shows the intended use of the import() method and also documents how aliasing is currently implemented.

auto decl = make_unique<ProcedureNode>(std::move(ident), dynamic_cast<ProcedureTypeNode *>(type));
symbols_->import(module->getAlias(), name, decl.get());
decl->setModule(module);
module->procedures().push_back(std::move(decl));

Using the ctor for ProcedureNode shown above, the procedure will be marked automatically as imported as it is the constructor designed to be used for SymbolImporter. This property then triggers a chain reaction leading to the problem you pointed out. Luckily, there's the quick fix to stop Sema from adding predefined procedures as external procedures, which seems to fix the two unit tests that you provided.

Making the aliasing work will, however, require a bit more work into the direction that I outlined in points no. 1 and 2. Nevertheless, I will merge your changes and my fixes into the main branch. While this functionality is not really production ready, it also doesn't interfere with the rest of the compiler.

zaskar9 commented 8 months ago

If fixed the unit tests and adapted Parser and Sema to support type names as arguments to procedures (as qualified expressions). I also implemented SIZE, which is available as built-in procedure (Oberon) and through the SYSTEM module (Oberon 07). I uncommented the following line from the second unit test as it now compiles and works as expected.

SYSTEM.COPY(xadr, yadr, LEN(x) * SYSTEM.SIZE(INTEGER));

Hope this helps to add more functionality to the SYSTEM module!

tenko commented 8 months ago

Excellent,

It now works as expected.

I will do some small adjustments after checking the Project Oberon source at link:

"copy n consecutive words from src to dst". Here word has the meaning INTEGER size in the Project Oberon implementation.
The construct SYSTEM.ADR(a) = SYSTEM.ADR(a[0]) is supported.
Add some further tests of above changes.

zaskar9 commented 8 months ago

Note that my implementation of the SIZE built-in procedure is rather naive as it simply returns the value calculated by Sema during checking. This is a logical size and does not take into consideration physical aspects such as alignment and padding. Probably this is better implemented by returning values obtained via llvm::DataLayout, llvm::StructLayout, etc.

Regarding point no. 2 in your comment: I'm thinking of slightly changing the handling of ARRAYs in the backend (oh, no, not again!). Currently, I model them as a struct a: { int len, type vec[] } and pass a pointer to the beginning of the struct, i.e., a, around. The change I'm planning to do is to use a pointer to a.vec as the handle for the array and access len with a negative offset from that pointer. Recently, I finally figured out how to correctly compute that offset and the relative addresses of struct members. With this change, I believe I can get rid of a lot of special treatment required to skip the len field and still have it there for the LEN procedure. This is especially relevant as I'm planning to use the same approach to extend records with an additional (hidden) field to store the type tag (as record extension is the next thing on my list to add to the compiler).