One evening I though it would be cool to implement parser combinators in Scheme. The results are explained in this blog post.
Download: parser-combinators.scm
The code will run in Chicken Scheme.
We can define a parser as a procedure which takes an input string and an index (i.e. the index of the current character on the input). If the parse succeeds, it returns a value and an index ≥ the original index to the unconsumed input. On failure it returns #f for the both the value and index (the index is checked for failure and the value is ignored).
The two simplest parsers are fail, which always fails, and (return v), which always succeeds and returns v without consuming any input.
(define fail (lambda (s i) (values #f #f)))
(define (return v) (lambda (s i) (values v i)))
The parser any-char removes one character of the input or fails if the end of the input is reached.
(define any-char
(lambda (s i)
(if (< i (string-length s))
(values (string-ref s i) (+ i 1))
(values #f #f))))
On their own, parser procedures are cumbersome. Haskell provides a neat solution: do notation. Now for some Macro Magic. Let’s write a macro called parser which allows us to write Haskell-style code.
(define-for-syntax *v-name* (gensym 'v))
(define-for-syntax *s-name* (gensym 's))
(define-for-syntax *i-name* (gensym 'i))
(define-for-syntax (expand-parser-body forms)
(match forms
[(_ '<- _)
(error "parser must end with non-binding form")]
[(p)
`(,p ,*s-name* ,*i-name*)]
[(v '<- p . xs)
`(let-values ([(,v ,*i-name*) (,p ,*s-name* ,*i-name*)])
(if ,*i-name*
,(expand-parser-body xs)
(values #f #f)))]
[(p . xs)
`(let-values ([(,*v-name* ,*i-name*) (,p ,*s-name* ,*i-name*)])
(if ,*i-name*
,(expand-parser-body xs)
(values #f #f)))]))
(define-macro (parser . body)
`(lambda (,*s-name* ,*i-name*)
,(expand-parser-body body)))
Now we can write parsers which look and work just like Haskell code!
(define two-chars-swap
(parser
a <- any-char
b <- any-char
(return (string b a))))
This example shows how we can use the any-char parser to write a parser to read two characters returning the string of the characters in reverse order. The results of any-char are bound to local variables which can be used immediately after definition.
How does this work? Let’s look at what two-chars-swap expands into.
1: (lambda (s3 i4)
2: (let-values ([(a i4) (any-char s3 i4)])
3: (if i4
4: (let-values ([(b i4) (any-char s3 i4)])
5: (if i4
6: ((return (string b a)) s3 i4)
7: (values #f #f)))
8: (values #f #f))))
In line 1 we see the parser is actually a function, which takes a string (s3) and an index (i4), as expected (these are gensym’d variable names). Line 2 binds the result of calling any-char with the string and index to the a variable defined by the user. Notice how the call is implicit in the unexpanded form. The new index is re-bound to i4, shadowing the original index. Line 3 tests the index to see if the parse failed. Failure here causes failure for the whole parser (line 8). Otherwise we continue and call any-char again with the new index, binding the result to b and shadowing i4 just like in line 2. Line 5 checks for failure (line 7). Finally we call the (return (string b a)) parser, whose result is also the result of the whole parser.
No parser demo is complete without an implementation of an infix calculator language. One disadvantage of parser combinators is that left-recursion is not allowed (it will cause an infinite loop). This can be overcome by using a loop to slurp up sequences of operators of the same precedence, e.g. 1 + 2 + 3. See this page for a good explanation. Curiously, most examples I found on the web were wrong (they would parse "1 + 2" ignoring "+ 3"). I guess nobody actually tests their grammars.
Before we can write interesting parsers we need a few low-level utilities. The choice procedure takes a list of parsers and returns a parser which tries each parser in sequence until one of them succeeds. This gives us basic backtracking.
(define (choice . ps)
(lambda (s i)
(let loop ([p ps])
(if (pair? p)
(let-values ([(v i) ((car p) s i)])
(if i
(values v i)
(loop (cdr p))))
(values #f #f)))))
The matches procedure returns a parser which matches a particular string or fails. This is useful for matching symbols or keywords.
(define (matches m)
(lambda (s i)
(let ([n (string-length m)])
(if (and (<= (+ i n) (string-length s))
(string=? m (substring s i (+ i n))))
(values (substring s i (+ i n)) (+ i n))
(values #f #f)))))
The while-char procedure returns a parser which accepts characters while the character predicate holds. while1-char works similarly but requires at least one character.
(define (while-char pred)
(lambda (s i)
(let ([len (string-length s)])
(let loop ([j i])
(if (and (< j len) (pred (string-ref s j)))
(loop (+ j 1))
(values (substring s i j) j))))))
(define (while1-char pred)
(parser
s <- (while-char pred)
(if (> (string-length s) 0) (return s) fail)))
In the calculator language we need to match decimal integers. Thanks to while1-char this is very easy!
(define decimal
(parser
s <- (while1-char digit?)
(return (string->number s))))
To accept spaces around numbers and operators we use token which slurps up spaces before calling the given parser.
(define (token p)
(parser
(while-char space?)
x <- p
(return x)))
Finally we have enough to define the classical term/factor/expr parser. This version returns the s-expression instead of calculating the result.
(define expr
(parser
lhs <- term
(let loop ([lhs lhs])
(choice
(parser
opr <- (token (choice (matches "+") (matches "-")))
rhs <- term
(loop (list (string->symbol opr) lhs rhs)))
(return lhs)))))
(define term
(parser
lhs <- factor
(let loop ([lhs lhs])
(choice
(parser
opr <- (token (choice (matches "*") (matches "/")))
rhs <- factor
(loop (list (string->symbol opr) lhs rhs)))
(return lhs)))))
(define factor
(choice
(token decimal)
(parser
(token (matches "("))
e <- expr
(token (matches ")"))
(return e))
(parser
(token (matches "-"))
e <- factor
(return (list '- e)))))
Now we can use test-parser (see the code file for details) to test the expression parser.
#;1> (test-parser expr)
>> 1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2
Parsed : "1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2" (40 characters)
Returned : (+ (+ 1 (/ (* (/ 5 3) (+ 8 (- 9 (- 4)))) (+ (* 7 7) 6))) 2)