zbraniecki / message-format-2.0-rs

MessageFormat 2.0 Prototype in Rust
https://github.com/unicode-org/message-format-wg/issues/93
Other
6 stars 1 forks source link

Initial AST dump #2

Open zbraniecki opened 4 years ago

zbraniecki commented 4 years ago

For the initial work, I suggest we take the fluent-rs AST: https://github.com/projectfluent/fluent-rs/blob/master/fluent-syntax/src/ast.rs

and design a vastly simplified subset of it that captures a single Message.

Something along the lines of:

pub struct Message {
    pub value: Pattern,
    pub comment: Option<String>,
}

pub struct Pattern {
    pub elements: Vec<PatternElement>,
}

pub enum PatternElement {
    TextElement(String),
    Placeable(Expression),
}

pub struct Variant {
    pub key: VariantKey,
    pub value: Pattern,
    pub default: bool,
}

pub enum VariantKey {
    Identifier(Identifier),
    NumberLiteral(String),
}

pub enum InlineExpression {
    StringLiteral {
        value: String,
    },
    NumberLiteral {
        value: String,
    },
    FunctionReference {
        id: String,
        argument: Option<Identifier>,
    },
    VariableReference {
        id: Identifier,
    },
}

pub struct Identifier {
    pub name: String,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
        selector: InlineExpression,
        variants: Vec<Variant>,
    },
}
zbraniecki commented 4 years ago

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

zbraniecki commented 4 years ago

I chose this subset because I think it captures the essence of multiple valuable traits of Fluent that I would like to offer for consideration for MF 2.0:

This allows per-environment to do:

// Function as a formatter
Today is { DATETIME($now) }.

and

// Function as a selector
You have { PLURAL($emailCount) ->
    [one] one email
  *[other] { $emailCount } emails
}

which addresses the part of the MF2.0 purpose of "being more flexible" - https://github.com/unicode-org/message-format-wg/pull/84

In particular, it makes PLURAL just one of many possible formatters/selectors ensuring that any system that will support PLURAL, will support all of functions. I'm not strongly opinionated whether functions as formatters/selectors should be the same thing, but haven't find a reason not to be, so initially offering them as the same AST node.

echeran commented 4 years ago

I have some comments:

I'll stop there, and hopefully some of that makes sense. I may have misunderstood things about Fluent, so please correct (and @mihnita, chime in on corrections).

filmil commented 4 years ago

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

How about if we started by example, in terms of the use cases we'd like to handle? I personally find it hard to figure out whether an AST or EBNF actually supports what I'd like to do by staring at a wall of text. :)

filmil commented 4 years ago
  • I have been thinking (#3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.

IMHO, examples of what we want structured comes even before that.

  • I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior. [...] Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?

There is a practical point to keeping the "computational" part of a format string separate from human-readable (human-translatable) string as well.

At some point (looking back to the ICU conference last October), it seemed to make sense to separate out parameter binding, values based on those parameters and pattern matching. Especially because I'd like to expand the set of possible transformations beyond plural and gender into inflections and then things get increasingly more interesting.

* I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.  The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a `Placeholder`.
zbraniecki commented 4 years ago
  • Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

  • I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior.

Good point. We can achieve it by doing:

#[derive(Debug, PartialEq)]
pub struct Variant {
-    pub key: VariantKey,
+    pub key: Vec<VariantKey>,
    pub value: Pattern,
    pub default: bool,
}

pub enum Expression {
    InlineExpression(InlineExpression),
    SelectExpression {
-        selector: InlineExpression,
+        selector: Vec<InlineExpression>,
        variants: Vec<Variant>,
    },
}

Does it sound good?

  • I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.

They don't yet in Fluent :( We so far only got to do it via nested selectors:

key = { PLURAL($num) ->
    [one] { GENDER($user) ->
        [masculine] Foo
       *[other] Bar
    }
   *[other] Baz

and plan to get back to flatten selectors here: https://github.com/projectfluent/fluent/issues/4 to get

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, other] Bar
   *[other] Baz
}

or

key = { PLURAL($num), GENDER($user) ->
    [one, masculine] Foo
    [one, *other] Bar
    [*other] Baz
}

I believe we should support the flatten approach in MF 2.0.

zbraniecki commented 4 years ago

@stasm

stasm commented 4 years ago

Some high-level thoughts about the things mentioned in this thread so far:

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

It allows placeables to be selector expressions or inline expressions

It would be interesting to experiment with a different approach than the one we know from MessageFormat and Fluent where select expressions go into placeables. I mean the approach where the branching logic happens first, before patterns are defined. I call this the exploded message approach; I'm sure there are better names ;)

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

filmil commented 4 years ago

Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf

I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

zbraniecki commented 4 years ago

Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).

I agree that it would be interesting to try that. But we need to answer the question about nested selections in such a case.

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.

This may be relatively easy to represent in the datamodel, but may be very very hard to represent in textual form. Maybe it's ok to have a more open datamodel, and let the textual representation be capable of expressing just some of the metadata.

Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.

We're not certain yet. For now we focus on non-textual representation, but I expect that for the Web usage we'll want a resource format, similarly to how we don't encode CSS in JSON/YAML, but rather have its own dedicated textual format. There are many reasons for which YAML/JSON is not really the best target for l10n resource format, and I think we'll want to have l10n-tailored one later on, maybe even multiple, but the one that will get standardized for the Web is likely to be the dominant in the forseeable future.

Bottom line is - I think for now we should focus on AST and data model, but the way we imagine what we want to express should take into account that one day we'll want to express it in a human-readable/writable format.

zbraniecki commented 4 years ago

I opened #6 to discuss AST of selectors vs placeholders.

stasm commented 4 years ago

What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?

That's a great question, and I think it's something we can answer with a prototype :) Thanks for filing #6, I'll continue there.

mihnita commented 4 years ago

each variant could have its own comments. I don't know if there's a value to that?

I think there is value.

mihnita commented 4 years ago

Should we overall start with AST or EBNF? Fluent's EBNF is here:

TLDR: I am with stasm@ on this one

"I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax."

So just data model + examples to show that it works.


I think that EBNF focuses too much on the syntax part.

It says stuff like:

  foo := '[' listItems ']';
  listItems := item [',' listItems;

when what we want is really: foo is an array of item(s)

If we look at the EBNF doc used by https://www.ics.uci.edu/~pattis/ICS-33/lectures/ebnf.pdf they have a section named "1.6 Syntax versus Semantics" that starts with "EBNF descriptions specify only syntax: the form in which something is written. They do not specify semantics: the meaning of what is written"

So in this respect the rust code is more readable:

 pub elements: Vec<PatternElement>,

(or the same thing in proto syntax, repeated PatternElement elements)

mihnita commented 4 years ago

is there a requirement that MessageFormat 2.0 be encodable as a string

I think it is. But likely not at this stage. My hope is that we can come up with a data model, and then define one / several string representations.

That would have several benefits: