Tight binding of SwiLex to SwiParse makes some jobs difficult.

This kind of information (line number, column number) will be added in the next versions to the Token structure. It is important for error localisation in the parser.

The SwiLex's default Token type uses Substrings as value (and not Strings) the goal was to keep a relation (a view of the base string with different indices) between the given input and the tokens (in both the Lexer and the Parser).

As you mentioned, this solution can not be used with SwiParse since the lexing is done manually.. and using a Lexer (without separators, ...) adds some unnecessary constraints..

I suggest the following solution:

We need the indices of all the \n in the input string

// the indices of all the newline characters in the base string
let newlineIndices = input.indices.filter{ input[$0] == "\n" }

And with position(of: token) we can have the line and column numbers (starting from (0,0)) of a given token:

/// - Returns: A tuple of the line number and the column number of 
/// a token in its base string.
func position(of token: Substring) -> (Int, Int) {
    // the index of the previous newline character or -1 if no
    // newline is found before the token
    let previousNewlineIndex = previousIndex(of: token.startIndex,
                                             in: newlineIndices)
    let lineStartIndex: String.Index

    if previousNewlineIndex >= 0 {
        // start counting the column number from the character after
        // the previous newline character in the base string
        lineStartIndex = token.base.index(after: newlineIndices[previousNewlineIndex])
    } else {
        // If no previous newline is found then the column number
        // counting starts from the beginning of the base string
        lineStartIndex = token.base.startIndex
    }

    let line = previousNewlineIndex + 1
    let column = token.base.distance(from: lineStartIndex,
                                     to: token.startIndex)
    return (line, column)
}

/// Binary search.
/// - Parameters:
///   - element: a `Comparable` item.
///   - array: array of `Comparable` items.
/// - Returns: the index of the element of `array` preceding `element`
/// or -1 if no element in `array` precedes it.
func previousIndex<T:Comparable>(of element: T, in array: [T]) -> Int {
    var beg = 0
    var end = array.count-1
    while end >= beg {
        let mid = beg + (end-beg) / 2
        if array[mid] == element { return mid }
        else if array[mid] < element { beg = mid + 1 }
        else { end = mid - 1 }
    }
    return beg-1
}

This solution can be used directly in the reduction actions of the SwiParse parser. In fact each token is by default terminal (since it was generated with a Lexer) so it will be reduced into a non terminal using a grammar rule (and a reduction action). This reduction action will take the token's substring as parameter. The position (line column) can be known at this point (during the parsing).

Here is a simple example using the solution. The parser here concatenates the positions (line number, column number) of each token in a String:

enum Terminal: String, SwiLexable {
    static var separators: Set<Character> = [" ", "\n"]

    case text = "[A-Za-z]*"
    case number = "[0-9]*"

    case eof
    case none
}

enum NonTerminal: SwiParsable {
    case start  // starting token

    case concat
}

let input = """
Hello
this is
1 test
"""

let newlineIndices = input.indices.filter{ input[$0] == "\n" }

func singlePosition(s: Substring) -> String { "\(position(of: s))" }
func concatPosition(s1: String, s2: Substring) -> String { s1 + "\(position(of: s2))" }

let parser = try SwiParse<Terminal, NonTerminal>(rules: [
    .start => [Word(.concat)], // accepting rule

    .concat => [Word(.text)] >>>> singlePosition,
    .concat => [Word(.number)] >>>> singlePosition,
    .concat => [Word(.concat), Word(.text)] >>>> concatPosition,
    .concat => [Word(.concat), Word(.number)] >>>> concatPosition,
])

let result = try parser.parse(input: input)

print(result)
// Optional("(0, 0)(1, 0)(1, 5)(2, 0)(2, 2)")

As mentioned before this information (line number, column number, size, ...) will be in the Token struct in the next versions (especially for localised parsing errors).. I just need to figure out how the user could interact with it (since the user only provides reduction actions on the substring (and not the Token itself)). The goal is also to keep the API as simple as possible.

An abstraction of the Lexer (with a generic Token), supporting different parsers, adding some debugging tools (with graphviz), visualisation of the AST, supporting files, ... are planned. I just need some free time (and maybe contributors 😉)

yassram / SwiParse

Tight binding of SwiLex to SwiParse makes some jobs difficult. #1