Looking for scanf() (formatted input on streams)

hdante commented 2 years ago

Zig Version

0.9.1

Steps to Reproduce

Hello, is there a formatted input function in the standard library ?

Example in C:

#include <stdio.h>

int main(void)
{
        unsigned int age;

        puts("How old are you ?");

        if (scanf("%u", &age) > 0) {
                printf("Your age: %u\n", age);
        }

        return 0;
}

Expected Behavior

Example in Zig:

const std = @import("std");
const stdout = std.io.getStdOut().writer();
const stdin = std.io.getStdIn().parser();

pub fn main() void {
    var age: u32 = undefined;

    stdout.writeAll("How old are you ?\n") catch {};

    if (stdin.parse("{}", .{&age}) > 0) {
            stdout.print("Your age: {}\n", .{age}) catch {};
    }
}

Actual Behavior

I can't find a parser API.

tauoverpi commented 2 years ago

Questions like this are better to ask in one of the communities rather than the issue tracker as the issue tracker is only for bug reports and this is not a bug.

hdante commented 2 years ago

Hello, I've noticed the distinction and decided in the end that lacking formatted input is a bug in the standard library. If there is already support for it, then it's ok to close the bug report.

pfgithub commented 2 years ago

There are integer parsing functions available in the standard library.

const std = @import("std");
const stdout = std.io.getStdOut().writer();
const stdin = std.io.getStdIn().reader();

pub fn main() !void {
    const alloc = std.testing.allocator;

    stdout.writeAll("How old are you ?\n") catch {};

    const age_str = try stdin.readUntilDelimiterAlloc(alloc, '\n', 100);
    defer alloc.free(age_str);
    const age = try std.fmt.parseInt(u32, age_str, 10);
    stdout.print("Your age: {}\n", .{age}) catch {};
}

Do you have an example of a use case where a scanf like function would be useful?

wooster0 commented 2 years ago

I actually wanted to look into making such a function a while ago because it can actually be pretty useful if you know how your input will look and you just want to parse some specific values inbetween some specific characters. Currently you have to do all that manually using std.io and std.mem functions etc. and parsing it that way can become quite long and hard to read. Compared to that scanf is very easy to read and understand. Often it's probably pretty efficient too. We could call it std.fmt.scan or std.fmt.parse.

@pfgithub I do: https://github.com/ziglang/zig/blob/8ad67cf30864d0087424a681bb1f61c2c51950c2/lib/std/Progress.zig#L240-L245 So if we had scanf and we called it std.fmt.scan, I could've done something like this:

var row: u8 = undefined;
var column: u8 = undefined;
std.fmt.scan(file, "\x1b[{d};{d}R", .{&row, &column});

which I think would be very cool and would be infinitely more readable than what I wrote in that PR.

We might want to put it into std.io though because it takes an std.fs.File in my example. Having a version that accepts a buffer would also be useful though.

My final proposal is to have two versions:

std.io.Reader.scan
std.fmt.scan - takes a buffer and internally uses std.io.fixedBufferStream + std.io.Reader.scan.

So my example above would become:

var row: u8 = undefined;
var column: u8 = undefined;
file.reader().scan("\x1b[{d};{d}R", .{&row, &column});

and if I had a buffer I could do std.fmt.scan(buf, "\x1b[{d};{d}R", .{&row, &column});.

FireFox317 commented 2 years ago

If somebody starts implementing this, you can use the following closed PR as a starting point (or do it all over again): #7158

hdante commented 2 years ago

Yes, I'm currently using the individual buffer parsing functions, like std.fmt.parseUnsigned(). But they're different, though, scanf() parses formatted input, not only the individual data types (it "parses a format", not the data types). To make it clear what I'm looking for, I'll define what I mean by "formatted input on streams":

Parse a format, not just the data types
Support implicit buffer handling (optional, but required for minimally fast text parsing)
Formatted output easily parsed by the formatted input functions ("reversibility", optional, but it would make the formatted input functions useless for certain types of programs)

For example, in the original C code, scanf() is parsing this following input description: skip any leading whitespaces (including spaces, tabs, newlines, etc.), interpret input as a modular number, interpret one unsigned integer:

[hdante@dragonmount tmp]$ ./age 
How old are you ?

      1001
Your age: 1001
[hdante@dragonmount tmp]$ ./age 
How old are you ?
                -40
Your age: 4294967256
[hdante@dragonmount tmp]$

So, besides being visually very different, and very different to write too, @pfgithub's example is also fundamentally different, because it's parsing the format in an ad-hoc way, instead of parsing it by a description. It's also not equivalent to the C version in its implied input description.

In the specific case of scanf(), the format is described by a domain-specific mini-language (%-something), but so that we don't get stuck with scanf() as the only example, one notorious formatted input parser is available in the iostreams C++ library:

input >> name >> age >> coffee;

In this case, parsing the format does not use a mini-language, it uses language features (shift operator, operator overloading, friend classes) for the format description. The iostream libraries accepts primitive types, I/O manipulators and custom types to describe the input format.

I think the original C example is already a good example for comparision between parsing a format and parsing primitive data types, but it's possible to expand it slightly and complicate the equivalent manual parsing considerably (if you need to parse the same input format), similar to the example given for the terminal cursor position query:

#include <stdio.h>
#include <string.h>

int main(void)
{
        char name[30];
        unsigned age;
        char coffee[30];

        puts("What's your name ? How old are you ? Do you want coffee ?");
        if (scanf(" %29s %u %29s", name, &age, coffee) == 3) {
                printf("Your name is: %s.\n", name);
                printf("Your age is: %u.\n", age);
                if (strstr(coffee, "yes"))
                        puts("Here is your coffee: ☕.");
        }

        return 0;
}

Off course, there's no need to be able to parse exactly the same language as the C version, but the idea is that text input parsing has some goals and some applications, which explain this partly free-form way of interpreting text input. I can think of at least 5 use cases (some more historical than others) there will probably be more:

Implementing interactive terminal programs (examples: text-mode games, cryptographic key and certificate files generation)
Parsing some types of configuration file formats and command line arguments (examples: parsing host and service names, resolv.conf, some files in /etc suitable for textual parsing)
Parsing textual data files (graph plot, 3D mesh data)
Implementing text filter and processing programs (examples: any Unix filter program, such as sed, awk, grep, which have their own domain specific languages)
Parsing textual communication protocols (example: the GPIB/SCPI command set is fully textual and might, with some effort, be parsed with scanf())

I think the last example is really good, because it may be used in embedded systems with very little effort and very low buffering requirements (and no need for an allocator). Here's an example a single shot oscilloscope data acquisition using GPIB commands:

*RST
*IDN?
DATA:SOURCE CH2
DATA:ENC SRI; DATA:WIDTH 2; DATA:START 1; DATA STOP 2500
SELECT:CH1 ON
CURVE?
*WAI

Nobody gave a definitive answer, but right now I'm inferring by the previous comments that the Zig standard library does not have functions for doing formatted text input. I think having something like fmt.parseFormatBuffer() that only implements rule number 1 described above would be a great start. Also, maybe some function that can match the scanf() mini-language, like C++ does, would be a good idea ?

pfgithub commented 2 years ago

An alternative possible approach to a dsl/formatted scan function could be like this:

char name[30];
unsigned age;
char coffee[30];

puts("What's your name ? How old are you ? Do you want coffee ?");
if (scanf(" %29s %u %29s", name, &age, coffee) == 3) {

→

var name_buf: [30]u8 = undefined;
var coffee_buf: [30]u8 = undefined;

try scan.whitespace(reader);
const name = try scanString(reader, &name_buf);
try scan.whitespace(reader);
const age = try scan.int(reader, u64);
try scan.whitespace(reader);
const coffee = try scan.str(reader, &coffee_buf);

This would likely be easier to implement and extend but might not be as nice to code with

Update: I realise now that scan.whitespace() would be impossible to implement on top of the reader api. This would either have to only support fixedBufferStreams or be changed so methods like scanString allow you to specify that they should ignore preceding whitespace.

hdante commented 2 years ago

Yes, it would be pretty good already,

wooster0 commented 2 years ago

One thing I wanted to note is that I think the way C does it in that scanf("%u", &age) implicitly ignores leading and trailing whitespace is bad and we should make it explicit. So I think what we're really looking for here is just a powerful way of expressing the input. I think we want something flexible. And honestly regular expressions come to mind here because it would allow us to describe the input in a really versatile way.

@pfgithub I think that is also an interesting approach. We would really describe it in terms of language constructs here rather than a string that we have to parse and then interpret. It is however quite long of course and I think it's too far from the way the input looks. "{d};{d}" to match "256;256" is just much easier to understand and to read so ultimately I think we'll want to go with a format described in text.

So with regex we could do scanf("\s*\d\s*", &age) instead of scanf("%u", &age) in the original. This would be a very versatile way to describe the format. We could entirely parse the regex format at comptime of course. However I don't think this is great anyway. We can come up with something better. It would be very complicated. And also I think regex is very much a write once read never thing.

@hdante what did you think of my approach above, anyway? Maybe we can extend it to allow the specification of "ignore any whitespace here" etc.? Maybe we can add special specifier types? For example maybe std.fmt.scan("%w*{d}%w*", .{&age}) can be the scanf("%u", &age) from your example and %w means whitespace and * means "any length". However, actually, which characters exactly does %w ignore? That is not clear. I don't think this is good either.

Anyway if we even want to go this direction, we should create a specification for the actual format of course. Maybe we can come up with something simple but powerful.

However maybe we'll come to the conclusion that we might just reinvent regex.

Conclusion and my actual proposal

Ultimately I think we'll just want a simple format that doesn't let you do those advanced things in a text format. It should be a similar format to the one std.fmt.format has. So for simple things like my example, std.fmt.scan("\x1b[{d};{d}R", .{&row, &column}); will work perfectly. But then let's say you want to do something advanced like ignore any whitespace after the first semicolon in that format. We could just do this:

    var row: u8 = undefined;
    var column: u8 = undefined;
    var scanner = std.io.fixedBufferStream("hello").reader();

    try scanner.scan("\x1b[{d}", .{&row});
    while (true) {
        const byte = try scanner.readByte();
        if (!std.ascii.isSpace(byte)) {
            scanner.context.pos -= 1;
            break;
        }
    }
    try scanner.scan(";{d}R", .{&column});

Which as you can see makes use of the std.io.Reader.scan variant I proposed above. So I stay with my idea to introduce std.io.Reader.scan(fmt, args) and std.fmt.scan(buf, fmt, args).

Of course you can always abstract the while that skips whitespace but that's a different thing.

So this is actually a hybrid of describing the input format through language constructors and strings. I think this'll take the best of both worlds and be a very simple solution. Any thoughts on this?

hdante commented 2 years ago

@r00ster91, in my opinion there's room for a complete regex API, plus a strict compatible scanf(), plus the more typical, default, easy to use, expressive and recommended API that maps well to the language features, all three either in the standard library, or in third party libraries, depending on the design goals of the standard library.

In my opinion, the input API should closely resemble the output API, so, giving an example for all the cases cited:

// Trivial data type parsing
n = try stdin.scan("{}", .{&age});
// Skip specific characters and parse 2 values
n = try stdin.scan("\x1b[{};{}R", .{&row, &column});
// 3 values, skipping whitespace before each one
n = try stdin.scan(" {} {} {}", .{&name, &age, &coffee});
// 1 value, skip whitespace before and after
n = try stdin.scan(" {} ", .{&age});
// 2 values, skipping fixed characters and whitespace before second
n = try stdin.scan("\x1b[{}; {}R", .{&row, &column});

In my opinion, strict scanf() compatibility should be avoided, because it would require carrying historical design bugs forward:

[hdante@dragonmount tmp]$ ./age 
How old are you ?
10
Your age: 10
[hdante@dragonmount tmp]$ ./age 
How old are you ?
4294967306
Your age: 10
[hdante@dragonmount tmp]$ ./age 
How old are you ?
18446744073709551626
Your age: 4294967295

Notice in this example that, since scanf() parses an unsigned int with an unsigned long parser internally, it has modular behavior between UINT_MAX (32-bit) and ULONG_MAX (64-bit), but saturating behavior after ULONG_MAX. This is no design goal, it fullfils no one's needs, it's just a design bug. That's why I think it makes sense to have a separate strict compatibility scanf() parser for those who might need to port code, for example, even though I'm not sure if it would be better to just link with libc for that. To implement something like this explicitly, it would be necessary to add some attributes to the input format, so, if scanf() compatibility is actually desired, an explicit description for the original C program would be something like this:

n = try stdin.scan(" {lu%}", .{&age});

The parameter description should state which parser it should use, in this case, "lu" for strtoul() and the percentage symbol for modular behavior within the parsed range. Leading whitespace is also necessary to skip initial whitespace.

On the other hand, things that scanf() does right should be preserved, the obvious aspect is the compact, expressive syntax of the mini-language, and the other relevant example in this case is that it gives enough importance to skipping whitespace that it has its own trivial explicit symbol: a single whitespace:

scanf(" %c", &not_whitespace);

Whitespace skipped by scanf() is defined by isspace(). isspace() only supports single-byte characters, so Unicode spaces are not supported. isspace() for ascii means ' ', '\f', '\n', '\r', '\t' and '\v'. scanf() is really well explained in the linux manual pages, it describes the POSIX version very easily, plus GNU extensions. I recommend everyone reading it: scanf(3).

In my opinion giving access to the parser cursor and mixing API functions with hand-written code would be a great plus, as long as it doesn't complicate or slow down the API code (which I don't think it will). Example code is ok, but it shouldn't be necessary for skipping whitespace though, which is important enough to have built-in support.

wooster0 commented 2 years ago

As a side note I think it might be useful if we refactor std.fmt.format so that we can reuse most of its parsing code etc. in std.fmt.scan.

ziglang / zig