wenyan-lang / wenyan

文言文編程語言 A programming language for the ancient Chinese.
https://wy-lang.org/
MIT License
19.57k stars 1.1k forks source link

[ANNOUNCEMENT] Static Typecheck & Type Inference #486

Open LingDong- opened 4 years ago

LingDong- commented 4 years ago

As another step toward a strongly-typed language with a functional flavor, wenyan-lang now supports ML style static typechecking and inference. When the option is turned on, the compiler will now raise exceptions if your code does not typecheck. Also it is capable of producing type signatures for inspection, e.g. ./example/quicksort.wy produces the following:

[0-347] {
  快排 : (('a) arr) -> (('a) arr)
  己 : (num) arr
  [33-285] {
    首 : ('a) arr
    頷 : ('a) arr
    尾 : ('a) arr
    甲一 : 'a
    甲餘 : ('a) arr
    乙 : ('a) arr
    [136-201] {
      丁 : 'a
    }
  }
}

Each scope is wrapped by indentation and {}. The numbers in [] are character ranges of the scope in the source file.

wenyan-lang uses bi-directional type inference. For example, when it reads that you declared an array, it will be recognized as a polymorphic array 'a arr (alpha-array). It then reads on to find out, say, you're pushing numbers into it, it automatically turns the array into a num arr. Similarly when you start doing multiplication * with two polymorphic types, it will realize that these types must have been nums and change accordingly.

When you have not done anything explicit to hint the type of the array, it will stay as polymorphic. This is very useful if your function should work for any array, e.g. reversing, sorting, splitting, etc.

Usage

Since this feature is still experimental, it is turned off by default. You can turn it on by changing the strict flag to true for parser.compiler. Imported libraries will not be checked to avoid flooding the log, but all the standard libraries is already known to pass all the typechecks so you don't need to worry.

Motivation

This kind of type inference is very suitable for wenyan-lang, because explicit type declarations are not very readable in natural language. 数列 is fine, 数列列列 is harder to read, and if your type signature involve any kind of nested structure, it will not be possible to express without the aid of brakcets, (e.g. 数 -> (数->数列) -> 数, is different from, say, 数 -> (数->数) 列 -> 数). In comparison, the static type inference system only requires you to be self-consistent, achieving type safty without your labor of spelling out all the types.

Details

Limitations

Example

Here is the type signature of the turing machine example ./example/turing.wy. Notice how 2 dimensional arrays are inferred, when user only anotated them as .

[0-1382] {
  圖靈機 : ((('a) arr) arr) -> ((str) -> ((str) -> (nil)))
  製律 : (((str) arr) arr) -> ((str) -> ((str) -> ((str) -> ((str) -> ((str) -> (nil))))))
  諸律 : ('a) arr
  陽符 : str
  陰符 : str
  [51-909] {
    帶 : (str) arr
    針 : num
    左疆 : num
    右疆 : num
    態 : str
    圖靈機畫法 : (nil) -> (nil)
    [160-423] {
      畫帶 : str
      筆 : num
      [197-383] {
        符 : str
      }
    }
      [459-740] {
        律 : ('a) arr
        甲態 : 'a
        讀符 : 'a
        乙態 : 'a
        畫符 : 'a
        移 : 'a
      }
  }
  [990-1055] {
    律 : (str) arr
  }
}

There is still some work to be done to hone this type system, but I hope one day it will be as good as Haskell or SML. Maybe in the extremely distant future we'll be able to make a formal verification :)

Please let me know if you find any issues, and feel free to discuss in the comments!

lymslive commented 4 years ago

初读,感觉“量”这个字不太好。 与“数言列爻物术”放一起看,“量”好像是“变量”的缩写?有种出戏的感觉。 提议用“器”代表任意类型。出自“形上谓道,形下谓器”。 程序两大构件,函数与变量。“术”字的意思与“道”接近,所以可用“器”表示变量。通俗点也能理解为容器,可以放下任意其他“数言列爻物”的容器。

比较读一下: 吾有一量 吾有一器

还有一点,“数言列爻物”读来都是比较具象的名词,但“量”就突然抽象了。“器”也是具象的名字,又可通用。

statementreply commented 4 years ago

It doesn't work for function parameter of object type.

Code:

吾有一術。名之曰「座標之逆」。欲行是術。必先得一物。曰「座標」。是術曰。
    夫「座標」之「「南」」。乘其以負一。名之曰「北」。
    夫「座標」之「「東」」。乘其以負一。名之曰「西」。
    吾有一物。名之曰「逆」。其物如是。
        物之「「南」」者。數曰「北」。
        物之「「東」」者。數曰「西」。
    是謂「逆」之物也。
    乃得「逆」。
是謂「座標之逆」之術也。

Error message:

[Syntax Error] [Type] Property "南" does not exist in obj  }
Line 2, Character 4:    夫「座標」之「「南」」。乘其以負一。名之曰「北」。
[Syntax Error] [Type] * operator expecting (num), found 'undefined' : undefined
Line 2, Character 16:    夫「座標」之「「南」」。乘其以負一。名之曰「北」。
[Syntax Error] [Type] Property "東" does not exist in obj  }
Line 3, Character 4:    夫「座標」之「「東」」。乘其以負一。名之曰「西」。
[Syntax Error] [Type] * operator expecting (num), found 'undefined' : undefined
Line 3, Character 16:    夫「座標」之「「東」」。乘其以負一。名之曰「西」。

Output:

[0-193] {
  座標之逆 : ( }) -> ({ 南 : (num), 東 : (num) })
  [34-193] {
    北 : num
    西 : num
    逆 : { 南 : (num), 東 : (num) }
  }
}
LingDong- commented 4 years ago

@lymslive , thanks for the suggestion!

Polymorphic types are meant to sound abstract, because they ARE abstract ideas. I agree that 量 may not necessarily be the best word, but I don't think 器 is a big improvement either. Like you said, 器 could mean containers, and user might easily confuse with container types, such as arrays, objects, etc. 量 seems like a word commonly used to denote unspecified types, e.g. 標量 向量 应/自變量. I've also considered other words such as 元, 端, 象, 實. As always, I'm happy to hear more wording suggestions.

LingDong- commented 4 years ago

@statementreply , thanks for spotting the bug! I'll look into it right away.

lymslive commented 4 years ago

@LingDong- 当“容器”表达 "container" 之意时,关键在“容”字而不在“器”字,我不觉得它会与 array (列)object(物)混淆。之前我将其解释为“容器”是不太准确的,是通俗地想向新人强行解释。 “器”在古汉语境中将其理解为与“道”相对的概念就比较文雅了。

“標量 向量 应/自變量” 这几个词都是现代汉语词汇,所以我才说会有出戏感。 在你提的另外四字“元, 端, 象, 實“,我觉得”元“与”象“比”量“稍好,但不如”器“好。

LingDong- commented 4 years ago

@lymslive

Notice that 容器 is also a relatively modern word. 器 is more commonly used in classical times to mean containers.

Like you mentioned in your other post, those early translators of western concepts into the Chinese language are well-versed in Classical Chinese. 意譯, unlike 音譯, often draw inspiration from the traditional usage of words. I think the tricky problem of translating the concept of "variable" is to certain extent solved by those who came before us. Unless there is an enormous advantage, there is not much need to reinvent.

LingDong- commented 4 years ago

@statementreply , the bug is now fixed in the newest commit 182d100ed3982cfd69e9dcc860d6484ec3203838

It should now compile without errors and spit out the (more) correct signature:

[0-193] {
  座標之逆 : ({ 南 : ('a), 東 : ('b) }) -> ({ 南 : (num), 東 : (num) })
  [34-193] {
    北 : num
    西 : num
    逆 : { 南 : (num), 東 : (num) }
  }
}

Theoretically the 南 : ('a), 東 : ('b) can be further inferred to be 南 : (num), 東 : (num). I'll investigate.

Implementation detail: For obj in arguments, I now set a flag in its type to true, indicating that the fields should be inferred by the content of the function. More beautiful solution ideas are welcome!

Thanks again for spotting the bug! Let me know if this (or any other) issue remains.

lymslive commented 4 years ago

@LingDong- It may be trivial discussion on this keyword too deep, after all it not effect the compile parser.

You mean not understand the underlying thought of “形上谓之道,形下谓之器” quiet well. I just think the two concept in it can map to the key concenpt of progromming world well: algorithm(function) and data struct(variable), where function if implement or wrapper of algorithm, and variable can represent or reference data struct.

We've used “术" to express function concept, that's good and can found it's origin from 《九章算术》and other math books. It's meaning is close to “道", but ”道" occupies on a higher level, that may corresponding to algorithm in programming. That's why I prefer "器" to express variable concept.

By the way, ”象" is also acceptable, it origin from 《易经》。“易” not means random, but changing, that's also close to variable concept in some aspect。

If you really dislike “器” for variable, you may used for class. Though I donot think you have to implement the full complex of class. OOP is not best nowadays.

statementreply commented 4 years ago

@statementreply , the bug is now fixed in the newest commit 182d100

It should now compile without errors and spit out the (more) correct signature:

I got the same result and error messages. It appears to me that the commit 182d100 contains only changes to test code.

LingDong- commented 4 years ago

@lymslive , I think it is actually great to have an in-depth discussion on keywords, as they’re integral parts of the language. It is better to make good decision on keywords now, than to have the pain of changing them later.

I am totally aware of the meaning of 形上谓之道,形下谓之器, and I enjoyed the way you interpreted it from programming perspective. Similarly, you’re definitely not the only person who love 易經. However, to supplement what I already said, although 器 could be used in this abstract way, once you put a numeral before it, e.g. 吾有二器, it starts to have a concrete meaning, and to readers, this probably means instrument or container.

As for the other keywords I mentioned earlier, here are some thought process.

Here’s another idea. Instead of using pre-Qin philosophical terminology, maybe we can also consider going for the 傳奇戲曲 style, using 奇 and 寶 for the types we not yet know what they actually are. e.g. 吾有一奇 or 必先得一寶. More light-hearted, but could also work.

This discussion is rather unexpected since this post is mainly about static type inference and the benefits it brings for the language (which nobody seems to care :), but I think keywords are equally important. I’ll consider more about this, and let you know if I ever decide to change it.

LingDong- commented 4 years ago

@statementreply Oops, I committed in the wrong folder (again :). It should be fixed now with this fresh commit: 0195caa38623f0ffc6630f0310eb1b08cd793878 , thanks!

lymslive commented 4 years ago

@LingDong- I think it's a bit early to introduce static type if you feel a bit depressed that few people care, since for script language, week type is just OK, especially not in practical large project. While our discussion may also have week relation to this issue. Because of the changing for week/strong type, we have to seek another keyword for generic /polymorphic types.

indeed confused with “人民币”. Maybe it can be used fro class member. 吾有一宝 sound interesting and funny, I think also acceptable in wenyan-lang.

Then I want to talk about another topic. I think you don't worry about the problem with “吾有二”. I think in wenyan-lang, it is not necessary to support “吾有二” and “吾有三” ect, just keep “吾有一”. It is totally different reading fell between “吾有一物” and “吾有二物”。And in practical use of current popular language, many coding guide recommend only define one variable in one statement, seldom use as int i, j, k;.

In traditional Chinese, “吾有一物” not means "I have one" exactly, but "I have a" or just "I have", or "I have the object named ...". So it hardly extend this pattern only increase the apparent number “一” to “二” or “三”.

If we really want define multiple variable in one statement, we may say multiple “一”, such as 吾有一数(名曰),一数(名曰),一言(名曰),一物(名曰)。That means "I have a, and have another". It is bad to have to count the number of variables.

Sometimes only “一” can express the meaning "have", I still remember in one textbook has a sentence “一桌一椅一抚尺而已”. "一" is much more meaningful than number "one" in traditional Chinese. Then I suddenly think up that wan only use “一” to express the unknown type variable, just omit the type keyword if unknown or hardly describe exactly.

吾有一「桌」一「椅」一「抚尺」
吾有一「甲」一「乙」一「丙」一「丁」

Since it's type is unknown, it's value must be undefined either, so no need to say “曰” any more.

Addition argue “一” to “二” comes from “术”. We can only define “吾有一术”, but nearly unreadable to say “吾有二术”. So we should depress using “吾有二” to define normal variable type, only encourage “吾有一” consistently. Otherwise new comer definitely will ask how to define “吾有二术” as “吾有二物”.

Hope helpful fro you.


怎么说呢,虽然“吾有二数”比“吾有一数”多了一数,但它的“逼格”降了不止一个数量级。

LingDong- commented 4 years ago

周棄曰。少昊有四叔。曰重。曰該。曰修。曰熙。

吳王壽夢有四子。長曰謁。次曰餘祭。次曰夷昧。次曰季札。

以荒政十有二聚萬民:一曰散利,二曰薄徵,三曰緩刑,四曰弛力,五曰捨禁,六曰去幾,七曰眚禮,八曰殺哀,九曰蕃樂,十曰多昏,十有一曰索鬼神,十有二曰除盜賊。

凡祭有四時:春祭曰礿,夏祭曰禘,秋祭曰嘗,冬祭曰烝。

子夏曰。君子有三變。望之儼然。即之也溫。聽其言也厲。

故言必有三表。何謂三表。子墨子言曰。有本之者。有原之者。有用之者。

孔子曰。人有五儀。有庸人。有士人。有君子。有賢人。有聖人。

:)

lymslive commented 4 years ago

Yes, many usage can be found in traditional Chinese book, but we have to select and balance (权衡). Syntax in programming language must be stricter than nature language. And I only suggest discourage “吾有二” not diable right now.

Actually in my last post the mainly purpose is to provide another solution for unknown type, if we cannot seek a perfect character, how about just omit type?

In my opinion, repeat “一” is more simple and handy than “曰”. And someone in another issue also mentioned than use “曰” both for variable name and value is not very good, then I prefer to only for value.

In programming world, it is better to distinguish literal number and keyword, where literal number is indeed value just as literal string, for and respectively in wenyan-lang. So it will be nicer to bind Chinese number character to number value, except and only except “一”, since “一” is so fantasy and not only server as number.

When a programmer can write “吾有二数” and “吾有三数”, he may also, in theory, write “吾有十数” and “吾有百数”, oh, that's terrible. And more the “百” “千” “ 万” character sometimes only means many, not exact number. You may limit the number in this case below ten “十”, at most one character.

Then refer to the syntax of “术”, for argument, your first design is “必先得六 数”. When problem came up for different type, you had to patch, say “必先得一 数曰……一言曰” etc. Then back to the same type, it may also possible to split to say multiple “一”. Try and compare like following:

吾有一术……必先得二数曰……曰……
吾有一术……必先得一数曰……一言曰……
吾有一术……必先得一数曰……一数曰……

It seams to have tow syntax pattern for tow same type argument, and much more possible pattern for more same type argument. OK, it may not serious problem but only faw (白璧微瑕). But I want to point out that, if you only allow “一” not “二” or more at first, then there won't have the patch story, and can freely extend to unknown type since no worry how to say “二 unknown type”.

LingDong- commented 4 years ago

Sorry to contradict, but I think the current mechanism makes sense.

As stated in the 1st line of the Feature section of the README, one of the goals of wenyan-lang is to simulate the natural language used by the ancient Chinese people. It is one of the fundamental reasons this project exists at all.

To supplement my examples taken from the classics, here is a simpler example for you: when you're cooking stir-fried tomatos and scrambled eggs, you would say I need five eggs and two tomatos. You wouldn't say I need a tomato, I need a tomato, I need an egg, I need an egg, I need an egg, I need an egg, I need an egg. This is why we have 必先得五數曰⋯⋯二言曰⋯⋯

When in doubt, please assume the language features are carefully planned, thought out and implemented, and not rashly put together and "patched" later. Then we can have a more meaningful discussion about how to improve them further. I would really appreciate it, thanks!

lymslive commented 4 years ago

Yes, current syntax make sense, that why I'm here and cannot restrain myself to discuss with you in detail, or too detail.

But I don't think your cook example above match what I mean exactly. If you need two tomatos and five eggs, it will be tomato[2] and egg[5] in programming language, which are “列” in wenyan-lang, . We just cannot have two variables with the same name. Or if we only concern the count, what we need may be only two int variables named "tomato" and "egg" with value 2 and 5 respectively.

Only when the elements, say the five eggs can be distinguished from each other or have different purpose in the cook, then we can name them separately, for example, 生鸡蛋,熟鸡蛋,鸭蛋,鹌鹑蛋,凤凰蛋. In this case, we can list them one by one. Of course we can first announce the summary count, then list each after that. Actually this is current syntax in wenyan-lang.

吾有五蛋,曰「生鸡蛋」曰「熟鸡蛋」曰「鸭蛋」曰「鹌鹑蛋」曰「凤凰蛋」。

It is supposed we have a type “蛋”, or make “蛋” generic type keyword. What I means is that we can also omit the summary prefix “五蛋”, only list each different variables which may belong to type “蛋”.

吾有一「生鸡蛋」一「熟鸡蛋」一「鸭蛋」一「鹌鹑蛋」一「凤凰蛋」。

This is also valid grammar in traditional Chinese, and in some context it may have more 文采 than plain 曰.

Now that we don't have type “蛋”, and impossible use it as generic or unknown type, then we have to say with another keyword, for example:

吾有五量,曰「生鸡蛋」曰「熟鸡蛋」曰「鸭蛋」曰「鹌鹑蛋」曰「凤凰蛋」。
吾有五器,曰「生鸡蛋」曰「熟鸡蛋」曰「鸭蛋」曰「鹌鹑蛋」曰「凤凰蛋」。
吾有五象,曰「生鸡蛋」曰「熟鸡蛋」曰「鸭蛋」曰「鹌鹑蛋」曰「凤凰蛋」。
吾有五元,曰「生鸡蛋」曰「熟鸡蛋」曰「鸭蛋」曰「鹌鹑蛋」曰「凤凰蛋」。

吾有一「生鸡蛋」一「熟鸡蛋」一「鸭蛋」一「鹌鹑蛋」一「凤凰蛋」。

In my option, any candidate keyword sound strange. So I propose just omit the prefix summary count and unknown type. While in concrete type case, it is still nice optionally prefix with summary count and postfix with value.

Another trivial benefit omit summary count number is a bit easier to modify code, if we add or delete another “蛋”, we only need modify one place, no need to remember modify to count accordingly. It will let programmer make less mistake in practical coding. I think one purpose of strong type/check is also advancing to practical coding, make programmer less mistake.

Sorry my post may pollute this issue. I also expect more technique discussion under this issue.

LingDong- commented 4 years ago

I think omitting type for polymorphic types using only is a valid proposal and is implementable. However, it might be an over-complication to introduce new syntax just to bring some syntactic sugar to only 1 feature. Let us consider more about this before deciding.