Open zbraniecki opened 4 years ago
Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf
I chose this subset because I think it captures the essence of multiple valuable traits of Fluent that I would like to offer for consideration for MF 2.0:
Pattern
as a list of textual parts and placeables (see #4)This allows per-environment to do:
// Function as a formatter
Today is { DATETIME($now) }.
and
// Function as a selector
You have { PLURAL($emailCount) ->
[one] one email
*[other] { $emailCount } emails
}
which addresses the part of the MF2.0 purpose of "being more flexible" - https://github.com/unicode-org/message-format-wg/pull/84
In particular, it makes PLURAL
just one of many possible formatters/selectors ensuring that any system that will support PLURAL, will support all of functions.
I'm not strongly opinionated whether functions as formatters/selectors should be the same thing, but haven't find a reason not to be, so initially offering them as the same AST node.
I have some comments:
Variant.value
be Message
? I think that better captures the relationship of Variant
being a superset/wrapper of a Message
in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)Variant
to support the possibility of a Message
having more than one placeholder that triggers the "switch/case" behavior. If we have 2 plurals, or a plural and a gender, etc in one message, then our "cases" correspond to the Cartesian product of the possible values that the placeholders can taken on (ex: #{ [ONE, female], [ONE, male], [ONE, other], [OTHER, female], [OTHER, male], [OTHER, other] } ). So instead of Variant.key: VariantKey
, maybe Variant.case_vals: HashMap<Identifier, String>
? This assumes that we ensure that there is a concept of Placeholder
that has a field of type Identifier
. And if that makes sense so far, in this scenario I'm describing, the "switch" (select) part of the "switch/case" scenario to which Variant
belongs is implicitly defined by the use of placeholders whose types take on a finite enumerated set of values. Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?Placeholder
. I'll stop there, and hopefully some of that makes sense. I may have misunderstood things about Fluent, so please correct (and @mihnita, chime in on corrections).
Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf
How about if we started by example, in terms of the use cases we'd like to handle? I personally find it hard to figure out whether an AST or EBNF actually supports what I'd like to do by staring at a wall of text. :)
- I have been thinking (#3) about the API input data in a way that I think allows us to decouple serialization concerns (file -> syntax -> parsing -> AST). EBNF seems like a cleaner, better approach to address serialization concerns, but the task of structuring the data comes first, I think.
IMHO, examples of what we want structured comes even before that.
- I think we want to generalize the notion of
Variant
to support the possibility of aMessage
having more than one placeholder that triggers the "switch/case" behavior. [...] Maybe the "switch" (select) should be explicit, in which case are we able to support 2 select placeholders/args/vars of different types (ex: one plural, one gender)?
There is a practical point to keeping the "computational" part of a format string separate from human-readable (human-translatable) string as well.
At some point (looking back to the ICU conference last October), it seemed to make sense to separate out parameter binding, values based on those parameters and pattern matching. Especially because I'd like to expand the set of possible transformations beyond plural and gender into inflections and then things get increasingly more interesting.
* I'm not sure if or how often Fluent's selector functions operate > 1 placeholder. The alternate way I'm suggesting here assumes that the formatting function operates on just 1 placeholder, and the formatting fn is determined by the placeholder type attached to a `Placeholder`.
- Could the type of Variant.value be Message? I think that better captures the relationship of Variant being a superset/wrapper of a Message in a particular situation -- when that message belongs to a group of messages connected to each other as the "cases" of a "switch/case". And of course, the "switch/case" is implicitly triggered when the placeholders' types have enumerated categories of values (plurals, gender, etc.)
The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?
- I think we want to generalize the notion of Variant to support the possibility of a Message having more than one placeholder that triggers the "switch/case" behavior.
Good point. We can achieve it by doing:
#[derive(Debug, PartialEq)]
pub struct Variant {
- pub key: VariantKey,
+ pub key: Vec<VariantKey>,
pub value: Pattern,
pub default: bool,
}
pub enum Expression {
InlineExpression(InlineExpression),
SelectExpression {
- selector: InlineExpression,
+ selector: Vec<InlineExpression>,
variants: Vec<Variant>,
},
}
Does it sound good?
- I'm not sure if or how often Fluent's selector functions operate > 1 placeholder.
They don't yet in Fluent :( We so far only got to do it via nested selectors:
key = { PLURAL($num) ->
[one] { GENDER($user) ->
[masculine] Foo
*[other] Bar
}
*[other] Baz
and plan to get back to flatten selectors here: https://github.com/projectfluent/fluent/issues/4 to get
key = { PLURAL($num), GENDER($user) ->
[one, masculine] Foo
[one, other] Bar
*[other] Baz
}
or
key = { PLURAL($num), GENDER($user) ->
[one, masculine] Foo
[one, *other] Bar
[*other] Baz
}
I believe we should support the flatten approach in MF 2.0.
@stasm
Some high-level thoughts about the things mentioned in this thread so far:
Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf
I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.
It allows placeables to be selector expressions or inline expressions
It would be interesting to experiment with a different approach than the one we know from MessageFormat and Fluent where select expressions go into placeables. I mean the approach where the branching logic happens first, before patterns are defined. I call this the exploded message approach; I'm sure there are better names ;)
Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3)
, the exploded approach would encode the translation as (select with text 1, text 2, text 3)
.
The main difference, in this mini-AST, would be that then each variant could have its own comments. I don't know if there's a value to that?
I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.
Should we overall start with AST or EBNF? Fluent's EBNF is here: https://github.com/projectfluent/fluent/blob/master/spec/fluent.ebnf
I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax.
Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.
Rather than allow (text 1), (select with text 2a, text 2b, text 2c), (text 3), the exploded approach would encode the translation as (select with text 1, text 2, text 3).
I agree that it would be interesting to try that. But we need to answer the question about nested selections in such a case.
What happens when you have PLURAL, GENDER
selector, and GENDER differs only in category one
. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER
?
I think there is! In fact, I think it would be intersting to consider what happens if all or most data nodes can have meta data attached to them. Things like: context, comments, examples, whether it can be translated, whether it can be re-positioned in the sentence, which grammatical case is used, etc.
This may be relatively easy to represent in the datamodel, but may be very very hard to represent in textual form. Maybe it's ok to have a more open datamodel, and let the textual representation be capable of expressing just some of the metadata.
Thinking aloud: is there a requirement that MessageFormat 2.0 be encodable as a string? If it were encoded as a struct, it seems like the parsing machinery would not even be needed; or could reuse existing generic parsers like YAML.
We're not certain yet. For now we focus on non-textual representation, but I expect that for the Web usage we'll want a resource format, similarly to how we don't encode CSS in JSON/YAML, but rather have its own dedicated textual format. There are many reasons for which YAML/JSON is not really the best target for l10n resource format, and I think we'll want to have l10n-tailored one later on, maybe even multiple, but the one that will get standardized for the Web is likely to be the dominant in the forseeable future.
Bottom line is - I think for now we should focus on AST and data model, but the way we imagine what we want to express should take into account that one day we'll want to express it in a human-readable/writable format.
I opened #6 to discuss AST of selectors vs placeholders.
What happens when you have PLURAL, GENDER selector, and GENDER differs only in category one. What happens when you have three selectors (I know, edge case), say, PLURAL, PLURAL, GENDER?
That's a great question, and I think it's something we can answer with a prototype :) Thanks for filing #6, I'll continue there.
each variant could have its own comments. I don't know if there's a value to that?
I think there is value.
Should we overall start with AST or EBNF? Fluent's EBNF is here:
TLDR: I am with stasm@ on this one
"I'd suggest starting with the data model alone. No parsing, no EBNF. I think the prototype should be a vehicle for discussion about semantics, use-cases and requirements rather than about the syntax."
So just data model + examples to show that it works.
I think that EBNF focuses too much on the syntax part.
It says stuff like:
foo := '[' listItems ']';
listItems := item [',' listItems;
when what we want is really: foo is an array of item(s)
If we look at the EBNF doc used by https://www.ics.uci.edu/~pattis/ICS-33/lectures/ebnf.pdf they have a section named "1.6 Syntax versus Semantics" that starts with "EBNF descriptions specify only syntax: the form in which something is written. They do not specify semantics: the meaning of what is written"
So in this respect the rust code is more readable:
pub elements: Vec<PatternElement>,
(or the same thing in proto syntax, repeated PatternElement elements
)
is there a requirement that MessageFormat 2.0 be encodable as a string
I think it is. But likely not at this stage. My hope is that we can come up with a data model, and then define one / several string representations.
That would have several benefits:
For the initial work, I suggest we take the fluent-rs AST: https://github.com/projectfluent/fluent-rs/blob/master/fluent-syntax/src/ast.rs
and design a vastly simplified subset of it that captures a single Message.
Something along the lines of: