Perpetual Weekend

11 March 2008

Haskell-style Parser Combinators in Scheme

Filed under: Uncategorized — Tags: , , , , , , — shaurz @ 2:53

One evening I though it would be cool to implement parser combinators in Scheme. The results are explained in this blog post.

Download: parser-combinators.scm

The code will run in Chicken Scheme.

We can define a parser as a procedure which takes an input string and an index (i.e. the index of the current character on the input). If the parse succeeds, it returns a value and an index ≥ the original index to the unconsumed input. On failure it returns #f for the both the value and index (the index is checked for failure and the value is ignored).

The two simplest parsers are fail, which always fails, and (return v), which always succeeds and returns v without consuming any input.

(define fail (lambda (s i) (values #f #f)))
(define (return v) (lambda (s i) (values v i)))

The parser any-char removes one character of the input or fails if the end of the input is reached.

(define any-char
  (lambda (s i)
    (if (< i (string-length s))
        (values (string-ref s i) (+ i 1))
        (values #f #f))))

On their own, parser procedures are cumbersome. Haskell provides a neat solution: do notation. Now for some Macro Magic. Let’s write a macro called parser which allows us to write Haskell-style code.

(define-for-syntax *v-name* (gensym 'v))
(define-for-syntax *s-name* (gensym 's))
(define-for-syntax *i-name* (gensym 'i))

(define-for-syntax (expand-parser-body forms)
  (match forms
    [(_ '<- _)
       (error "parser must end with non-binding form")]
    [(p)
       `(,p ,*s-name* ,*i-name*)]
    [(v '<- p . xs)
       `(let-values ([(,v ,*i-name*) (,p ,*s-name* ,*i-name*)])
          (if ,*i-name*
              ,(expand-parser-body xs)
              (values #f #f)))]
    [(p . xs)
       `(let-values ([(,*v-name* ,*i-name*) (,p ,*s-name* ,*i-name*)])
          (if ,*i-name*
              ,(expand-parser-body xs)
              (values #f #f)))]))

(define-macro (parser . body)
  `(lambda (,*s-name* ,*i-name*)
     ,(expand-parser-body body)))

Now we can write parsers which look and work just like Haskell code!

(define two-chars-swap
  (parser
    a <- any-char
    b <- any-char
    (return (string b a))))

This example shows how we can use the any-char parser to write a parser to read two characters returning the string of the characters in reverse order. The results of any-char are bound to local variables which can be used immediately after definition.

How does this work? Let’s look at what two-chars-swap expands into.

1:  (lambda (s3 i4)
2:    (let-values ([(a i4) (any-char s3 i4)])
3:      (if i4
4:          (let-values ([(b i4) (any-char s3 i4)])
5:            (if i4
6:                ((return (string b a)) s3 i4)
7:                (values #f #f)))
8:          (values #f #f))))

In line 1 we see the parser is actually a function, which takes a string (s3) and an index (i4), as expected (these are gensym’d variable names). Line 2 binds the result of calling any-char with the string and index to the a variable defined by the user. Notice how the call is implicit in the unexpanded form. The new index is re-bound to i4, shadowing the original index. Line 3 tests the index to see if the parse failed. Failure here causes failure for the whole parser (line 8). Otherwise we continue and call any-char again with the new index, binding the result to b and shadowing i4 just like in line 2. Line 5 checks for failure (line 7). Finally we call the (return (string b a)) parser, whose result is also the result of the whole parser.

No parser demo is complete without an implementation of an infix calculator language. One disadvantage of parser combinators is that left-recursion is not allowed (it will cause an infinite loop). This can be overcome by using a loop to slurp up sequences of operators of the same precedence, e.g. 1 + 2 + 3. See this page for a good explanation. Curiously, most examples I found on the web were wrong (they would parse "1 + 2" ignoring "+ 3"). I guess nobody actually tests their grammars.

Before we can write interesting parsers we need a few low-level utilities. The choice procedure takes a list of parsers and returns a parser which tries each parser in sequence until one of them succeeds. This gives us basic backtracking.

(define (choice . ps)
  (lambda (s i)
    (let loop ([p ps])
      (if (pair? p)
          (let-values ([(v i) ((car p) s i)])
            (if i
                (values v i)
                (loop (cdr p))))
          (values #f #f)))))

The matches procedure returns a parser which matches a particular string or fails. This is useful for matching symbols or keywords.

(define (matches m)
  (lambda (s i)
    (let ([n (string-length m)])
      (if (and (<= (+ i n) (string-length s))
               (string=? m (substring s i (+ i n))))
          (values (substring s i (+ i n)) (+ i n))
          (values #f #f)))))

The while-char procedure returns a parser which accepts characters while the character predicate holds. while1-char works similarly but requires at least one character.

(define (while-char pred)
  (lambda (s i)
    (let ([len (string-length s)])
      (let loop ([j i])
        (if (and (< j len) (pred (string-ref s j)))
            (loop (+ j 1))
            (values (substring s i j) j))))))

(define (while1-char pred)
  (parser
    s <- (while-char pred)
    (if (> (string-length s) 0) (return s) fail)))

In the calculator language we need to match decimal integers. Thanks to while1-char this is very easy!

(define decimal
  (parser
    s <- (while1-char digit?)
    (return (string->number s))))

To accept spaces around numbers and operators we use token which slurps up spaces before calling the given parser.

(define (token p)
  (parser
    (while-char space?)
    x <- p
    (return x)))

Finally we have enough to define the classical term/factor/expr parser. This version returns the s-expression instead of calculating the result.

(define expr
  (parser
    lhs <- term
    (let loop ([lhs lhs])
      (choice
        (parser
          opr <- (token (choice (matches "+") (matches "-")))
          rhs <- term
          (loop (list (string->symbol opr) lhs rhs)))
        (return lhs)))))

(define term
  (parser
    lhs <- factor
    (let loop ([lhs lhs])
      (choice
        (parser
          opr <- (token (choice (matches "*") (matches "/")))
          rhs <- factor
          (loop (list (string->symbol opr) lhs rhs)))
        (return lhs)))))

(define factor
  (choice
    (token decimal)
    (parser
      (token (matches "("))
      e <- expr
      (token (matches ")"))
      (return e))
    (parser
      (token (matches "-"))
      e <- factor
      (return (list '- e)))))

Now we can use test-parser (see the code file for details) to test the expression parser.

#;1> (test-parser expr)
>> 1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2
Parsed    : "1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2" (40 characters)
Returned  : (+ (+ 1 (/ (* (/ 5 3) (+ 8 (- 9 (- 4)))) (+ (* 7 7) 6))) 2)
About these ads

6 Comments »

  1. Hehe, these parser combinators really seem to generate quite some noise these days. Cool to see that people like them … I have written all my parsers that way since years, but only learned of the proper term some months ago via gilad.

    Comment by Adrian Kuhn — 29 April 2008 @ 1:33

  2. I haven’t used them in practice (this is my first stab at implementing them), but I like the idea. Most of the parsers I’ve written were good old recursive descent. PEGs also interest me (especially the work of VPRI & co), and they which have a nice solution to the left-recursion problem.

    Comment by shaurz — 4 May 2008 @ 5:19

  3. [Scheme] Monadic Parser Combinators – Haskell 風パーサー・コンビネーターの実装…

    Abstract for English readers: This article describes an implementation of a purely functional, monadic parser combinator library in PLT Scheme. With this library, one can easily build non-ambiguous, recursive-descent style parsers for string of charac…

    Trackback by (new Hatena).blog() — 12 August 2008 @ 7:30

  4. That’s very elegant, but it lacks a key feature of Haskell parser combinators: limited lookahead. One of the nice things about Parsec is that its “choice” combinator only does 1-token lookahead by default, so it’s usually as memory-efficient as a linear parser, but it also provides a “try” combinator which allows unlimited lookahead for the rare cases it’s necessary.

    Comment by Adrian Quark — 20 February 2009 @ 4:39

  5. Here is a modernized macro by a person learning macro programming:

    (define-syntax parser
    (syntax-rules (<-)
    ((parser)
    (error "empty parser"))
    ((parser v <- p)
    (error "parser must end with non-binding form"))
    ((parser p)
    (error "parser without binding"))
    ((parser v <- p f)
    (lambda (s i)
    (let-values (((v i) (p s i)))
    (if i
    (f s i)
    (values #f #f)))))
    ((parser v <- p . …)
    (lambda (s i)
    (let-values (((v i) (p s i)))
    (if i
    ((parser . …) s i)
    (values #f #f)))))
    ((parser p f)
    (lambda (s i)
    (let-values (((v i) (p s i)))
    (if i
    (f s i)
    (values #f #f)))))
    ((parser p . …)
    (lambda (s i)
    (let-values (((v i) (p s i)))
    (if i
    ((parser . …) s i)
    (values #f #f)))))))

    Comment by Karel — 24 March 2012 @ 22:53

  6. Actually, you don’t need do-notation to write nice parsers. You just need the right combinators.

    Comment by Matt — 8 October 2012 @ 16:24


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: