zaskar9 / oberon-lang

An LLVM frontend for the Oberon programming language
MIT License
17 stars 2 forks source link

Crash with assignment of string literal reference returned from procedure #40

Open tenko opened 5 months ago

tenko commented 5 months ago

This is not so easy to reproduce. I have a custom procedure defined which return a string literal.

From OberonSystem.cpp:

this->createProcedure(ProcKind::COMPILER_ENV, "Env", {{stringType, false}}, stringType, false, true);

And in LLVMIRBuilder.cpp:

Value *
LLVMIRBuilder::createCompilerEnvCall(vector<unique_ptr<ExpressionNode>> &actuals) {
    auto param = actuals[0].get();
    auto arg = dynamic_cast<StringLiteralNode*>(param);
    if (!arg) {
        logger_.error(param->pos(), "expected constant string");
        return value_;
    }
    std::string val;
    auto str = arg->value();
    if (str == "HOST") {
        val = sys::getProcessTriple();
    } else if (str == "FILE") {
        val = param->pos().fileName;
    } else {
        logger_.error(param->pos(), "unknown argumen '" + str + "'");
        return value_;
    }
    auto len = val.size() + 1;
    auto type = StructType::get(builder_.getInt64Ty(), ArrayType::get(builder_.getInt8Ty(), len));
    value_ = strings_[val];
    if (!value_) {
        auto initializer = ConstantStruct::get(type, {builder_.getInt64(len), ConstantDataArray::getRaw(val, len, builder_.getInt8Ty())});
        auto str = new GlobalVariable(*module_, type, true, GlobalValue::InternalLinkage, initializer, ".str");
        str->setAlignment(module_->getDataLayout().getPrefTypeAlign(type));
        strings_[val] = str;
        value_ = strings_[val];
    }
    return value_;
}

The following works until the code crashes on assignment in global scope.

MODULE Test;
IMPORT SYSTEM, COMPILER, RTS := RUNTIME;

VAR
    str : ARRAY 64 OF CHAR;

PROCEDURE test(s : ARRAY OF CHAR);
BEGIN
    str := s;
END test;

BEGIN
    RTS.Println(COMPILER.Env("FILE")); (* OK *)
    test(COMPILER.Env("FILE")); (* OK *)
    RTS.Println(str);
    str := COMPILER.Env("FILE"); (* Crash *)
    RTS.Println(str);
END Test.

Assignment in the test procedure works.

zaskar9 commented 5 months ago

The problem is in the title of the issue. O07.10.1 specifies that "[t]he result type of a procedure can be neither a record nor an array". Since a string literal is a value that is implicitly mapped to ARRAY OF CHAR, this is and should not be supported by the backend. Since the procedure is created via OberonSystem.cpp, the Sema checks that would stop this procedure from being defined are bypassed. Type STRING should never be (explicitly) used in any declaration regardless of whether it is in source code (where it cannot be used) or internally in pseudo-modules. It only exists for Sema to type check string literals.

The best way forward in my opinion would be to model COMPILER.Env in a way that passes and returns the string via a VAR-parameter.

tenko commented 5 months ago

You are right with Oberon-07.

I see XDS the compiler allow return POINTER TO construct. Here it would be POINTER TO ARRAY OF CHAR. This is I believe allowed according to Oberon-2 report?

Edit: I see the language report MODULE Trees example show return of POINTER TO construct.

I will modify here this to take a VAR argument as suggested to be in line with Oberon-07.

zaskar9 commented 5 months ago

Once there is a complete, tested, and stable implementation of Oberon-07, I think such extensions to the language definitely make sense. The fact that the base type of a pointer type can only be a record is currently only enforced in Sema. I don't think lifting that restriction will lead to problems or even changes in CodeGen as I added that Sema check quite late and already had code that made use of types like POINTER TO INTEGER or POINTER TO ARRAY.

I agree that modeling sting literals as POINTER TO ARRAY OF CHAR (rather than the internal type STRING) is a good solution as it is essentially how LLVM-IR represents these literals. But in order to add this extension, there first needs to be a specification what implicit conversions will happen and when a string is passed by value rather than reference. Consider the following four possible ways to define a variable that can be assigned a string literal.

VAR str1: ARRAY 10 OF CHAR;
    str2: ARRAY OF CHAR;
    str3: POINTER TO ARRAY OF CHAR;
    str4: POINTER TO ARRAY 10 OF CHAR;
  1. The first question would be whether there is even a (semantic) difference between str2 and str3 or wether the syntax used for str2 is just syntactic sugar for the same effect as in the case of str3? Open array variables will essentially boil down to pointers as the compiler cannot know how much space to allocate.
  2. The second question is then whether it makes sense to even have the cases for str3 and str4, or whether they are already subsumed by the second case? If they are not necessary, it probably makes sense to keep Wirth's restriction of limiting pointers to record base types.
  3. The third question is whether it is okay (as in expected by and understandable to the developer) to do value-based assignment (copy) for str1 but reference-based assignment (aliasing) for str2, str3, and str4?
  4. The fourth question is whether these four types are compatible to one another and, if so, how? For example, would str1 := str2 be legal? How about str1 := str3 or str1 := str4? These could be implemented as syntactic sugar for str1 := ^str3 and str1 := ^str4. Does going the other way, i.e.,str2 := str1, make sense? Is it necessary or desirable to have this? Should the compiler rewrite this to str2 := SYSTEM.ADR(str1)?
  5. Finally, what happens if these variables are assigned string literals?

I will re-open this issue and mark it as an enhancement for later down the line. I want to keep this on the radar as I believe elegant and efficient string handling to be a key factor to the ease-of-use and utility of a programming language.

tenko commented 5 months ago

Excellent.

I believe it makes sense to check how the XDS compiler behaves. Oberon-2 does not have the assignment copy of arrays and rely on the COPY procedure, so some problematics is avoided here. I have implemented a dynamic string in String for the XDS compiler. Here the string type is POINTER TO ARRAY OF CHAR. Example of use in OSPath. When using the standard string procedures taking and open array as argument the string pointer type is dereferenced in the argument list. The compiler will give you an error message hinting if you forget this dereference. I think this works out quite well in practice.

Also another open question is regarding strings is that many "modern" languages now implement the short string optimization trick, where the pointer and length data is reused for a static string. This needs to be on the compiler side as this is in no way type safe programming, but maybe this only makes sense for languages used for Web style programming where strings are everywhere?