whatwg / infra

Infra Standard
https://infra.spec.whatwg.org/
Other
119 stars 93 forks source link

Define numbers (waiting on Number / BigInt) #87

Open annevk opened 7 years ago

annevk commented 7 years ago

Should we define numbers and there various notation schemes? (Mathematical operators?)

It might also make sense to define null as being roughly analogous to JavaScript's null and a good initial value for variables.

domenic commented 7 years ago

We may want to wait to see how https://github.com/littledan/proposal-integer/issues/10 goes with regard to ES conventions, where this is more of an issue.

I agree something to make it clearer when e.g. integer division or floating-point addition is being used would be helpful. Not sure what a good solution looks like.

Null is probably harmless to add.

domenic commented 7 years ago

On "null": in https://github.com/whatwg/html/pull/2421 I actually went with "Let serialized be an uninitialized value" because I wanted it to be very clear that every branch of the algorithm was going to end up initializing it somehow. That felt clearer than "null", especially for the other direction where null is a valid JavaScript value. Not sure if we want to take that as precedent though.

annevk commented 7 years ago

That kind of edge case we could also catch with an Assert perhaps (which we should also add).

annevk commented 7 years ago

Also, when this happens, revisit whether we need to move anything from https://html.spec.whatwg.org/#numbers.

annevk commented 6 years ago

https://github.com/tc39/ecma262/pull/1135 is pretty close to done it seems. Once that lands we can start aligning things.

littledan commented 4 years ago

The BigInt PRs are all landed. In the final notation, subscripts are used on values, not operations. Values without subscripts are Numbers. I'm not sure exactly how this should apply to web specs. cc @caiolima @ms2ger

annevk commented 4 years ago

An issue is that IDL talks about many distinct numeric types. In prose this mostly comes down to integers with range restrictions or float/double. Perhaps if we can make that more concrete we could have integers, with range restrictions at the IDL boundaries, and Numbers, with range restrictions at the IDL boundaries. And the IDL boundaries would also take care of conversion (e.g., some integers become a Number, others a BigNum).

The type in specification-prose could be inferred from the IDL, but we could copy the explicit notation in case there's room for ambiguity.

littledan commented 4 years ago

Hmm, I'm not sure if we ended up with any notation for those sorts of restrictions that would be meaningful to copy here, unfortunately.

annevk commented 4 years ago

That's fine, I was thinking of those more as IDL-level Asserts that specification algorithms would have to abide by. User agents could turn them into more dedicated types as they see fit.

msporny commented 4 years ago

First, thank you for the Infra spec, it's making my life as a spec Editor that has to manage the expectations of a WG when writing spec text wrt. abstract data models much easier. We're using it for non-browser use cases, and things are working out fairly well.

As a data point, we've converted the entire W3C Decentralized Identifier (DID) specification to use Infra and it addressed some gnashing of teeth the group was having around the abstract data model in the specification (we have serializations to pure JSON, JSON-LD, CBOR, CBOR-LD... so needed a way to talk about primitive types w/o giving preference to those serializations). Example of how we're using Infra here (still a bit rough, but it's getting there):

https://w3c.github.io/did-core/#metadata-structure

We're probably going to follow suit for the next Verifiable Credentials specification, and many of the other "Decentralized Identity" specifications as well.

The only sharp bump we hit while using Infra was the lack of a primitive "number" type. Reviewers of the specification kept raising an issue because we didn't link to the infra spec for a number type. Once we explained that Infra didn't have number, it kicked off concern about the stability of Infra "How could it not define a number type while it goes into excruciating detail about UTF-16 code points!?"

Is the Infra spec ready for a PR for a "number" primitive type? If you give me some general guidelines, I can write a first cut at a PR and we can revise from there. Ideally, it would be done so we could reference it via the DID spec, which goes to W3C CR in a month or so.

domenic commented 4 years ago

Is the Infra spec ready for a PR for a "number" primitive type? If you give me some general guidelines, I can write a first cut at a PR and we can revise from there. Ideally, it would be done so we could reference it via the DID spec, which goes to W3C CR in a month or so.

It's hard to say. Ideally we'd like to get multi-stakeholder discussion on the various requirements, and nail them down. But that might be hard with only you having a pressing need. I'll try to ask around internally and drum up that interest... Read on for what we might do even in the absence of other feedback.

To give a sample of the questions at hand, I think we have at least two number types that are important for specifications:

Should these be separate types? They probably have to be, since NaN and +/-Infinity are not mathematical numbers. (Especially not NaN.) That is, you don't necessarily need to separate out an "integer" type or uint32 type from mathematical numbers, but you do need to separate out floats.

Should we be concerned that almost all mathematical numbers are not computable, and thus it's strange to use a mathematical number type in specifications meant for implementation on computers? I initially was concerned about this point, but the JavaScript specification editors have gone back to mathematical numbers, so maybe I should reconsider.

If we do have mathematical numbers, how should we help Infra-using specifications convert them into machine types, like IEEE floats or integers with C-style overflow behavior? Or do we implicitly require that every Infra-using specification use a BigDecimal (BigDouble? BigInteger?) implementation? Browsers certainly won't do that, and I suspect DID software wouldn't appreciate it either.

Similarly, if we don't add a specific type for 32-bit single-precision floats, how do we help specifications work with them, for cases where the intention is to implement using those?

Existing specifications just kind of assume nothing will overflow, or if they're good, they take special care to specify how to process things when they could. (E.g., they specify detailed processing when when parsing a user-supplied value which could be arbitrarily large, and similarly they specify what happens when doing math on arbitrarily-large values.)

With all these things in mind, the best proposal I have is having Infra supply some basic concepts that can be referenced, and some guidance for how specifications can be precise. For example, something like:

A number is a mathematical real number. TODO maybe link to Wikipedia or something.

A float is an IEEE 754 floating point value, assumed by default to be 64-bit double-precision. Specifications using floats must specify if they want to use any of the other variants provided by IEEE 754.

Specifications must take care when converting user input into either of these Infra types, to make sure that the values are representable using a reasonable machine-storage technology. For example, not all input digit strings are representable as [floats], so any conversion algorithm needs to deal with finding the "closest" float. (TODO maybe Infra can supply this algorithm? HTML has one already.) Similarly, JavaScript Number values represent [floats], so if a specification wants to convert them to a [number], it needs to detail the handling of NaN and infinities.

Most specifications are implemented by machines without infinite memory, and so [numbers] usually do not have unbounded range, but instead are limited to being in some power-of-two range. A specification using numbers should indicate how to handle very positive or very negative values, or ensure that they could never arise.

Arithmetic on [floats] is well-defined by IEEE 754. Arithmetic on [numbers] is more problematic, as mathematical arithmetic definitions do not always map straightforwardly to implementations, due to considerations like overflow or rounding. For any cases that are potentially ambigious, e.g. those which could result in numbers outside the range (-2(-32),2(32)), or those involving division in a way that could leave fractional parts, specifications should state in more detail how the operation is to be implemented. TODO maybe define integer division.

What do folks think of that?

syg commented 4 years ago

A related data point: ecma262 currently is trying to fix its definition of arithmetic so all actual operations are done on the reals.

Waldemar lists good reasons to do this in https://github.com/tc39/ecma262/issues/1964. It basically comes down to the properties you want of math operations do not hold on IEEE754 doubles.

Depending on the operations web specs want to do, especially if transcendental functions are involved, I imagine this pitfall applies to other web specs as well.

msporny commented 3 years ago

What do folks think of that?

It's a solid start, and would probably meet our needs in the W3C Verifiable Credentials WG, W3C JSON-LD WG, and W3C Decentralized Identifiers WG. A few minor comments:

BigDecimal (BigDouble? BigInteger?) implementation? Browsers certainly won't do that, and I suspect DID software wouldn't appreciate it either.

DID software deals with crypto all of the time, so asking to do BigInteger isn't a big deal. BigDouble would raise eyebrows.

I think we have at least two number types that are important for specifications

Yes, and we'd need both to be defined. The way that INFRA handles that today, specifically with https://infra.spec.whatwg.org/#code-points, provides a solid path forward.

Start with the most abstract concept (mathematical numbers), and then talk about the serializations and their limitations wrt. mathematical numbers. So, start abstract and whittle down to implementations. For example:

A number is a mathematical real number.

A float is a number that is expressed as an IEEE 754 floating point value, assumed by default to be 64-bit double-precision. Specifications using floats must specify if they want to use any of the other variants provided by IEEE 754.

A integer is a number that is an integral type expressed as an n-bit two's-complement value, assumed by default to be 64-bits in size. Specifications using integers must specify if they want to use any other variants, such as limiting the bit size to 8-bits or 32-bits.

There is a question around where we stop (why not short integer or unsigned integer), but perhaps we can just start with the three items above (and your explanatory text) and add things if spec writers need it later. We can always expand the base number types.

dlongley commented 3 years ago

DID software deals with crypto all of the time, so asking to do BigInteger isn't a big deal. BigDouble would raise eyebrows.

While that's true, I would expect there to be problems if everything had to be a BigInteger. Not every number people may use will be part of some crypto scheme/protocol. I would expect us to also keep in mind that there will be transformations to/from serializations like JSON -- and most parsers don't use BigInteger. I would expect some kind of advice or requirement around numbers in specs that allow for large ranges or high precision to be serialized as strings when traveling in JSON. For numbers of this sort, BigInteger or BigDecimal would be acceptable.

An integer is a number that is an integral type expressed as an n-bit two's-complement value, assumed by default to be 64-bits in size. Specifications using integers must specify if they want to use any other variants, such as limiting the bit size to 8-bits or 32-bits.

This doesn't really work for JS, right? Perhaps integers either need to fall into the range supported by JS or they have to be represented by a BigInteger and considerations around serialization should be discussed. Or we may need more classes of integers. Perhaps the integer classes should be differentiated by whether or not they can be represented precisely as a 64-bit double precision float -- as this maps more to implementation decisions around serialization and internal representation in JS, since JS doesn't expose integer primitives that are available in other languages. We should consider these sort of interoperability concerns as mentioned in the I-JSON spec. It may be simplest for the Infra spec to only offer up number representation primitives/classes that are going to encourage interoperability in this way.

Number boundaries are real to implementations, so the Infra spec should provide some assistance and guidance to spec writers according to the tools that are available to implementers, even such boundaries don't exist in the abstract.

Edit to hopefully add a bit more clarification: I think specs will want to define numbers according to their behavior (to avoid issues as highlighted by @syg), but also need to consider the ranges imposed on them by their actual representations and other interoperability concerns so, something like: "Class 1: it behaves like an integer, but its range is such that it fits in a 64-bit float, Class 2: it's a 64-bit float, Class 3: It's a BigInteger, Class 4: It's a BigDecimal" ... or something along those lines.

domenic commented 3 years ago

My hope was that Infra wouldn't need to define integers, since they are so situation-specific. (unlimited precision vs. 64-bit vs. 32-bit, unsigned vs. signed vs. positive vs. negative, ...). Instead they would just fall into a catch-all category of "if you have specific requirements, your spec will need to enforce them".

I'm still struggling with whether we should include mathematical definitions at all. The issues @syg points out are precisely what I'm concerned about: in particular, it's literally impossible to implement associative addition, or non-overflowing counters, or similar, on computers with finite memory. And the majority of the time the issue isn't absurdly large numbers that overflow all 64 GiB of your RAM, it's somewhat-large numbers that overflow your 64-bit int64_t storage, or just normal IEEE 754 weirdness when people don't want to break out their BigAlgebraicNumber library but still need non-integral values.

syg commented 3 years ago

Recapping some off-thread discussion: it seems to come down to whether specs should mandate rounding be cumulative on intermediate results, or only observable on the final result if the intermediate states aren't observable.

For JS, we have math equations, so we are using the reals. For web specs, that might not translate. I'll stay out of recommending a default, since I'm not sure the kind of math operations web specs may be doing.

However, I do think it'd be good to make web spec authors aware that should they opt to use IEEE754 doubles to define a computation using a series of steps, they're signing up for requiring cumulative rounding errors and things like 2**1000 being non-representable, etc. (Specifically, I'm imagining something like specifying an algorithm that uses real numbers for simplicity of understanding with one rounding step at the end, but there are well-known closed form approximations that implementations would use.)

OR13 commented 3 years ago

Anyone trying to use Infra for cryptography related specs is likely to encounter this issue, as others have noted... here is a helpful table of the complexity we are walking into just for CBOR:

https://github.com/StableLib/stablelib/blob/b2cd7b5b42eba8d6c7bbdb2263d8808e267e8d42/packages/cbor/cbor.test.ts#L12

I don't see infra escaping this issue without some kind of table for "number-like" things... that gets really specific....

As a spec editor and developer... I prefer to be as precise as possible, and infra is a tool I want to use... I can hear my father yelling at me right now not to use vicegrips when the wrench set is right there....

We don't need all the number types at once... but we need support for JSON number types for sure...

Related:

annevk commented 3 years ago

I think Infra should have (unsigned) integers and IDL can convert its IDL integer values to these Infra integers. Specifications would be responsible for staying within limits. I think we should also allow specifications to state limits (e.g., 16-bit unsigned integer) if that can help with clarity, e.g., consider a URL's port.

As for mathematical operations, it might be good to study canvas/WebGL/SVG/CSS/Web Audio to see what the expectations are. (Looking at https://webaudio.github.io/web-audio-api/ just now it does mention a number of mathematical operations and defines features in terms of them.)

domenic commented 3 years ago

I definitely don't think we should have unsigned integers, i.e. integers which wrap at specific powers of 2.

I'm unsure whether we should have integers at all, or just generically let specs specify their restrictions on real numbers---which could be integer, or could be nonnegative, or could be between 0 and 2**64-1, or could be between 0 and 360... I think IDL would work fine saying "these are real numbers which are always integers and always between 0 and 2**(type-dependent value)".

annevk commented 3 years ago

That seems fine, but shorthand phrases would be useful and if we don't make them, others will.

bakkot commented 3 years ago

For JS, we have math equations, so we are using the reals. For web specs, that might not translate.

A little more detail here: there's a few kinds of arithmetic in the JS spec.

I don't think you can really get away from having to worry about this stuff. Notably, when we were re-introducing the distinction between mathematical values and JS Numbers to ECMA-262, we came across multiple divergences in engines which we believe to have been a direct result of the previous conflation of those types, on top of issues with nonsensical definitions. From my experience working on this in the JS spec, I think that having this distinction makes it more likely that authors and readers will consider edge cases with floating point arithmetic which must be explicitly handled before any arithmetic can occur.