yshavit / mdq

like jq but for Markdown: find specific elements in a md doc
Apache License 2.0
5 stars 0 forks source link

Add a way to find and filter tabular data #141

Closed themccubbins closed 1 month ago

themccubbins commented 1 month ago

Find: there's no titles for tables so maybe the selector could be by column titles, or by some selector on the data.

Select: It would be really cool if you could select particular rows, columns and cells of tables.

yshavit commented 1 month ago

Two questions for you:

  1. How did you want to select particular rows, columns, cells? By index, by searching within them? If searching, do you then want to return only the cell that matched, or something like "return all rows that have foo anywhere in any cell"?
  2. Does any particular syntax come to mind for you? That's something I really struggle with for this.

One thing that just occurred to me is that since bash strings are easy to make multiline, I could potentially have the table selector syntax be multiline somehow. Maybe something like:

mdq '|-|
     | /regex for columns, by header/ |
     | /regex for row; output the row if any column matches /

for example, if I had this table:

hello fizz world
one three two
four six five

then this:

|-|
| /o/  |
| five |
would result in: hello world
four five

In this proposed syntax, |-| is a table selector, but that alone would just match all tables, and return all their data; if you want the additional selector, you have to specify both the row and column selector. That makes it a lot more obvious which is which. Of course, you can always use the empty or * matcher for "any":

mdq '|-|
     | /o/ |
     | *   |

One additional wrinkle is that right now, I intentionally have the syntax such that every character is unambiguous as you read it -- the only real exception being escapes within quotes. This would break that, since |-| could either be "one token: table selector" or "three tokens: [pipe, list with empty matcher, pipe]". My concern isn't the parsing complexity, but rather the obviousness to a human reader. One option is to use curlies, which I've already considered for advanced #56. that would look something like:

mdq '{|-|}'  # just select tables

mdq '{|-|
      | /o/ |
      | *   |
     }'

How does that strike you?

yshavit commented 1 month ago

Oh, I could do :-: maybe? That mirrors the separator between the header row and data rows:

| Name | Value |
|:----:|:-----:|  <-- this bit
| Foo  | 123   |

The colons on both sides mean "column is centered", so it's a bit of a misnomer to use them for "tables" in general -- but that's probably okay.

Since I always have two matchers, maybe I can do away with all the other table markdown.

mdq ':-: /o/ *'

Hm, maybe I should repeat that between the row and column matchers?

mdq ':-: /o/ :-: five'
mdq ':-: /o/ :-: *'

I think that may be the winner so far.