Open dabrahams opened 3 years ago
This kind of information (line number, column number) will be added in the next versions to the Token
structure. It is important for error localisation in the parser.
The SwiLex's default Token
type uses Substrings as value (and not Strings) the goal was to keep a relation (a view of the base string with different indices) between the given input and the tokens (in both the Lexer and the Parser).
As you mentioned, this solution can not be used with SwiParse since the lexing is done manually.. and using a Lexer (without separators, ...) adds some unnecessary constraints..
I suggest the following solution:
We need the indices of all the \n
in the input string
// the indices of all the newline characters in the base string
let newlineIndices = input.indices.filter{ input[$0] == "\n" }
And with position(of: token)
we can have the line and column numbers (starting from (0,0)) of a given token:
/// - Returns: A tuple of the line number and the column number of
/// a token in its base string.
func position(of token: Substring) -> (Int, Int) {
// the index of the previous newline character or -1 if no
// newline is found before the token
let previousNewlineIndex = previousIndex(of: token.startIndex,
in: newlineIndices)
let lineStartIndex: String.Index
if previousNewlineIndex >= 0 {
// start counting the column number from the character after
// the previous newline character in the base string
lineStartIndex = token.base.index(after: newlineIndices[previousNewlineIndex])
} else {
// If no previous newline is found then the column number
// counting starts from the beginning of the base string
lineStartIndex = token.base.startIndex
}
let line = previousNewlineIndex + 1
let column = token.base.distance(from: lineStartIndex,
to: token.startIndex)
return (line, column)
}
/// Binary search.
/// - Parameters:
/// - element: a `Comparable` item.
/// - array: array of `Comparable` items.
/// - Returns: the index of the element of `array` preceding `element`
/// or -1 if no element in `array` precedes it.
func previousIndex<T:Comparable>(of element: T, in array: [T]) -> Int {
var beg = 0
var end = array.count-1
while end >= beg {
let mid = beg + (end-beg) / 2
if array[mid] == element { return mid }
else if array[mid] < element { beg = mid + 1 }
else { end = mid - 1 }
}
return beg-1
}
This solution can be used directly in the reduction actions
of the SwiParse parser. In fact each token is by default terminal (since it was generated with a Lexer) so it will be reduced into a non terminal using a grammar rule (and a reduction action). This reduction action will take the token's substring as parameter. The position (line column) can be known at this point (during the parsing).
Here is a simple example using the solution. The parser here concatenates the positions (line number, column number) of each token in a String:
enum Terminal: String, SwiLexable {
static var separators: Set<Character> = [" ", "\n"]
case text = "[A-Za-z]*"
case number = "[0-9]*"
case eof
case none
}
enum NonTerminal: SwiParsable {
case start // starting token
case concat
}
let input = """
Hello
this is
1 test
"""
let newlineIndices = input.indices.filter{ input[$0] == "\n" }
func singlePosition(s: Substring) -> String { "\(position(of: s))" }
func concatPosition(s1: String, s2: Substring) -> String { s1 + "\(position(of: s2))" }
let parser = try SwiParse<Terminal, NonTerminal>(rules: [
.start => [Word(.concat)], // accepting rule
.concat => [Word(.text)] >>>> singlePosition,
.concat => [Word(.number)] >>>> singlePosition,
.concat => [Word(.concat), Word(.text)] >>>> concatPosition,
.concat => [Word(.concat), Word(.number)] >>>> concatPosition,
])
let result = try parser.parse(input: input)
print(result)
// Optional("(0, 0)(1, 0)(1, 5)(2, 0)(2, 2)")
As mentioned before this information (line number, column number, size, ...) will be in the Token
struct in the next versions (especially for localised parsing errors).. I just need to figure out how the user could interact with it (since the user only provides reduction actions on the substring (and not the Token
itself)). The goal is also to keep the API as simple as possible.
An abstraction of the Lexer (with a generic Token), supporting different parsers, adding some debugging tools (with graphviz), visualisation of the AST, supporting files, ... are planned. I just need some free time (and maybe contributors 😉)
For example, I want to keep track of source location (line, column, etc.) and bundle that in with my tokens. One way I can do that is to not have any
separators
, recognize whitespace explicitly,and then filter the tokens myself:
But there's really no way to interpose any of this into the process. Ideally the parser would be able to operate on a generic token type and could take a token stream rather than a string as input.
I can probably get out of this particular jam by mapping substring indices back into source locations, but it's quite awkward, and I'm pretty sure people will want to make their own lexers.