Perpetual Weekend

2 June 2008

Finding in Assembly

Filed under: Uncategorized — Tags: , , , , , , , — shaurz @ 11:03

A question was posted to reddit and I couldn’t help but indulge my curiosity. System calls on Linux x86 traditionally use the interrupt vector 0x80. Modern processors have special system call instructions which are much faster, so Linux has an entry point called “″ (a virtual DSO) which will choose the best system call method for your machine. I found a webpage which explains about linux-gate and another which explains how to find the ELF auxv (auxiliary vector) which contains the address of the entry point. So I put together a small assembly program which finds the linux-gate syscall entry point and uses it to print a message to stdout. I haven’t written any assembly for quite a while :-) Use FASM to assemble the code.

Download: linuxgate.asm

11 March 2008

Haskell-style Parser Combinators in Scheme

Filed under: Uncategorized — Tags: , , , , , , — shaurz @ 2:53

One evening I though it would be cool to implement parser combinators in Scheme. The results are explained in this blog post.

Download: parser-combinators.scm

The code will run in Chicken Scheme.

We can define a parser as a procedure which takes an input string and an index (i.e. the index of the current character on the input). If the parse succeeds, it returns a value and an index ≥ the original index to the unconsumed input. On failure it returns #f for the both the value and index (the index is checked for failure and the value is ignored).

The two simplest parsers are fail, which always fails, and (return v), which always succeeds and returns v without consuming any input.

(define fail (lambda (s i) (values #f #f)))
(define (return v) (lambda (s i) (values v i)))

The parser any-char removes one character of the input or fails if the end of the input is reached.

(define any-char
  (lambda (s i)
    (if (< i (string-length s))
        (values (string-ref s i) (+ i 1))
        (values #f #f))))

On their own, parser procedures are cumbersome. Haskell provides a neat solution: do notation. Now for some Macro Magic. Let’s write a macro called parser which allows us to write Haskell-style code.

(define-for-syntax *v-name* (gensym 'v))
(define-for-syntax *s-name* (gensym 's))
(define-for-syntax *i-name* (gensym 'i))

(define-for-syntax (expand-parser-body forms)
  (match forms
    [(_ '<- _)
       (error "parser must end with non-binding form")]
       `(,p ,*s-name* ,*i-name*)]
    [(v '<- p . xs)
       `(let-values ([(,v ,*i-name*) (,p ,*s-name* ,*i-name*)])
          (if ,*i-name*
              ,(expand-parser-body xs)
              (values #f #f)))]
    [(p . xs)
       `(let-values ([(,*v-name* ,*i-name*) (,p ,*s-name* ,*i-name*)])
          (if ,*i-name*
              ,(expand-parser-body xs)
              (values #f #f)))]))

(define-macro (parser . body)
  `(lambda (,*s-name* ,*i-name*)
     ,(expand-parser-body body)))

Now we can write parsers which look and work just like Haskell code!

(define two-chars-swap
    a <- any-char
    b <- any-char
    (return (string b a))))

This example shows how we can use the any-char parser to write a parser to read two characters returning the string of the characters in reverse order. The results of any-char are bound to local variables which can be used immediately after definition.

How does this work? Let’s look at what two-chars-swap expands into.

1:  (lambda (s3 i4)
2:    (let-values ([(a i4) (any-char s3 i4)])
3:      (if i4
4:          (let-values ([(b i4) (any-char s3 i4)])
5:            (if i4
6:                ((return (string b a)) s3 i4)
7:                (values #f #f)))
8:          (values #f #f))))

In line 1 we see the parser is actually a function, which takes a string (s3) and an index (i4), as expected (these are gensym’d variable names). Line 2 binds the result of calling any-char with the string and index to the a variable defined by the user. Notice how the call is implicit in the unexpanded form. The new index is re-bound to i4, shadowing the original index. Line 3 tests the index to see if the parse failed. Failure here causes failure for the whole parser (line 8). Otherwise we continue and call any-char again with the new index, binding the result to b and shadowing i4 just like in line 2. Line 5 checks for failure (line 7). Finally we call the (return (string b a)) parser, whose result is also the result of the whole parser.

No parser demo is complete without an implementation of an infix calculator language. One disadvantage of parser combinators is that left-recursion is not allowed (it will cause an infinite loop). This can be overcome by using a loop to slurp up sequences of operators of the same precedence, e.g. 1 + 2 + 3. See this page for a good explanation. Curiously, most examples I found on the web were wrong (they would parse "1 + 2" ignoring "+ 3"). I guess nobody actually tests their grammars.

Before we can write interesting parsers we need a few low-level utilities. The choice procedure takes a list of parsers and returns a parser which tries each parser in sequence until one of them succeeds. This gives us basic backtracking.

(define (choice . ps)
  (lambda (s i)
    (let loop ([p ps])
      (if (pair? p)
          (let-values ([(v i) ((car p) s i)])
            (if i
                (values v i)
                (loop (cdr p))))
          (values #f #f)))))

The matches procedure returns a parser which matches a particular string or fails. This is useful for matching symbols or keywords.

(define (matches m)
  (lambda (s i)
    (let ([n (string-length m)])
      (if (and (<= (+ i n) (string-length s))
               (string=? m (substring s i (+ i n))))
          (values (substring s i (+ i n)) (+ i n))
          (values #f #f)))))

The while-char procedure returns a parser which accepts characters while the character predicate holds. while1-char works similarly but requires at least one character.

(define (while-char pred)
  (lambda (s i)
    (let ([len (string-length s)])
      (let loop ([j i])
        (if (and (< j len) (pred (string-ref s j)))
            (loop (+ j 1))
            (values (substring s i j) j))))))

(define (while1-char pred)
    s <- (while-char pred)
    (if (> (string-length s) 0) (return s) fail)))

In the calculator language we need to match decimal integers. Thanks to while1-char this is very easy!

(define decimal
    s <- (while1-char digit?)
    (return (string->number s))))

To accept spaces around numbers and operators we use token which slurps up spaces before calling the given parser.

(define (token p)
    (while-char space?)
    x <- p
    (return x)))

Finally we have enough to define the classical term/factor/expr parser. This version returns the s-expression instead of calculating the result.

(define expr
    lhs <- term
    (let loop ([lhs lhs])
          opr <- (token (choice (matches "+") (matches "-")))
          rhs <- term
          (loop (list (string->symbol opr) lhs rhs)))
        (return lhs)))))

(define term
    lhs <- factor
    (let loop ([lhs lhs])
          opr <- (token (choice (matches "*") (matches "/")))
          rhs <- factor
          (loop (list (string->symbol opr) lhs rhs)))
        (return lhs)))))

(define factor
    (token decimal)
      (token (matches "("))
      e <- expr
      (token (matches ")"))
      (return e))
      (token (matches "-"))
      e <- factor
      (return (list '- e)))))

Now we can use test-parser (see the code file for details) to test the expression parser.

#;1> (test-parser expr)
>> 1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2
Parsed    : "1 + 5/3 * (8 + (9 - -4)) / (7*7 + 6) + 2" (40 characters)
Returned  : (+ (+ 1 (/ (* (/ 5 3) (+ 8 (- 9 (- 4)))) (+ (* 7 7) 6))) 2)

26 January 2008

Bit Twiddling abs() with GCC

Filed under: Uncategorized — shaurz @ 14:05

Recently on Reddit there was a link to a page which listed a lot of clever “bit twiddling” hacks. I decided to try the hack for computing abs() against the standard library function and the naive implementation:

/* Clever version */
static inline int32_t
abs_1(int32_t v)
    const int32_t m = v >> sizeof(int32_t) * 8 - 1;
    return (v ^ m) - m;

/* Naive version */
static inline int32_t
abs_2(int32_t v)
    return v < 0 ? -v : v;

Compiling with gcc -O3 produces exactly the same code for all three versions!

  c1 fa 1f    sar    $0x1f,%edx
  31 d0       xor    %edx,%eax
  29 d0       sub    %edx,%eax

If I also specify -march=athlon-xp to optimize for my processor, the sar instruction is replaced with cltd (also known as cdq).

  99          cltd
  31 d0       xor    %edx,%eax
  29 d0       sub    %edx,%eax

23 January 2008

Script: check_dvd

Filed under: Uncategorized — shaurz @ 21:12

The main operating system I use is Arch Linux. My favourite feature of the distribution is the package management system which, although glacially slow and insanely inefficient, makes it very easy to make your own packages and has quite a lot of obscure but useful packages in the AUR. I am still using the original installation (version 0.5 if I remember correctly) which I have continually kept up-to-date with pacman. The distribution has undergone several big changes (such as switching from devfs to udev) and many small ones which sometimes go undetected. The accumulation of old cruft (such as out-of-date configuration files) means that things don’t work as well as a fresh installation unless you are willing to spend hours tracking down every little change.

For example, K3B used to correctly re-insert and check a DVD after burning but now it just hangs waiting for the tray to close (closing the tray does not help). So I wrote a little script to help me check DVDs. It takes a list of directories and checks the contents of the DVD against files with the same name in those directories. It does not scan recursively since I use this mainly for large video files which I put on the top-level directory of the disc.



if [ $# -le 0 ]; then
    echo "usage: $0 DIRS..."
    exit 1

mount | grep $DVD >/dev/null || mount $DVD
pushd $DVD >/dev/null

for file in *; do
    for dir; do
        if [ -f "$dir/$file" ]; then
            echo "checking: $file"
            diff "$DVD/$file" "$dir/$file" >/dev/null
            if [ $? -ne 0 ]; then
                echo "*** error in file: $file"

popd >/dev/null
umount $DVD

I really should just fix the problem with K3B :-)

Edit: It’s a bug in K3B. Thankfully there is a simple workaround.

18 January 2008

Argument decoupling in concatenative languages

Filed under: Uncategorized — shaurz @ 2:56

An interesting point about so-called concatenative languages is the decoupling of arguments from procedure calls due to the stack. Here is a simple example:

  : foo ( x c -- x' ) if bar else baz then ;  ( Forth )
  : foo ( x c -- x' ) [ bar ] [ baz ] if ;    ( Factor )

The word foo selectively executes bar or baz depending on the value of the condition argument. Notice the argument x is decoupled from the application of bar or baz.

The applicative example might be:

  (define (foo x c)
    ((if c bar baz) x))   ; Scheme

This procedure is equivalent to the Forth word, but is subtly different. The conditional expression evaluates to a procedure value (bar or baz) which is then used inside the application expression. The Forth version does not evaluate the words as values, it simply executes them directly!

What if we wanted to change the definition of baz so that it takes two arguments. Say we want to give it x twice, we could change the definition to:

  : foo ( x c -- x' ) if bar else dup baz end ;  ( Forth )
  : foo ( x c -- x' ) [ bar ] [ dup baz ] if ;   ( Factor )

Very easily done. In Scheme we would have to change the procedure to:

  (define (foo x c)
    (if c (bar x) (baz x x)))   ; Scheme

Which now means we have to use two separate application expressions, or a more direct translation:

  (define (foo x c)
    ((if c bar (lambda (a) (baz a a))) x))   ; Scheme

Which is quite bizarre and forces us to create a superfluous lambda. Both solutions are unpleasant and I expect most people would opt for the first version for clarity.

There are advantages and disadvantages to this decoupling. On the plus side, it allows greater factoring opportunities and conciseness. But we pay for this by having to deal with the complexity of stack management and reduced readability since it is not obvious at-a-glance what arguments feed into what words.

Is it possible to invent a syntax which combines the advantages of applicative and concatenative languages? I would really like to see what that would look like.

1 July 2007

Filed under: Uncategorized — shaurz @ 19:07

The last stable version of Amp (from 15th April) is available here. I am planning a new direction so I will not be working on it any more.

6 April 2007

Integrated Record Types

Filed under: Uncategorized — shaurz @ 17:30

The interpreter for Amp (the code name for my Lisp-like language) has been largely re-written since my first release. The code is not yet available. I am going to set up a darcs repo to reduce the effort of distributing new versions.

I have switched back to an AST representation of programs. My next goal is to use explicit continuations in the interpreter. This will stop the C stack from blowing up and allow implementation of first-class continuations and tail-call optimisation.

I decided to switch the implementation language from C to C++ (eventually it will be self-compiling). I’m not sure if this was a good idea or not but it has worked out quite well so far. I have lightweight classes abstracting values (pointers or fixnums) and heap objects (which have a header with type information) so I can easily change the low-level representation when needed.

I have implemented a rudimentary record type protocol similar in style to R6RS. They are integrated in the implementation rather than botched on top of vectors as a second-class feature. The current interface is simplistic but will suffice for now.

(make-record-type name slot-names)
(constructor type)
(accessor type slot-name)
(mutator type slot-name)

Pairs (a.k.a. “cons cells”) are simply records with two slots, car and cdr. The functions cons, car and cdr are no longer defined as primitives since they are merely constructors and accessors of the pair type.

(define cons (constructor <pair>))
(define car (accessor <pair> 'car))
(define cdr (accessor <pair> 'cdr))
(define set-car! (mutator <pair> 'car))
(define set-cdr! (mutator <pair> 'cdr))

The pair type must still be defined as a primitive since knowledge of its definition is required by the implementation – in other words it is also defined as a C struct. We could define it like this:

(define <pair> (make-record-type 'pair '(car cdr))

Please disregard the quasiquote code from my previous posts. After reading the standard I realised that nesting worked quite differently than I thought. In the latest unreleased code the examples in R5RS now work as advertised. Hopefully I got it right this time.

Scheme’s quasiquote seems to be subtly different than Common Lisp’s. For example, ``,,0 evaluates to 0 in SBCL and (quasiquote (unquote 0)) in MzScheme.

3 March 2007

Forth as a Haskell DSL, or Lambda: The Ultimate Stack?

Filed under: Uncategorized — shaurz @ 20:05

I had an idea that passing arguments on the stack, in a language like Forth, is like partially applying a continuation with the contents of the stack (does that make any sense?) We represent a Forth program in continuation-passing style, passing extra arguments to the continuation to “push the stack”, taking arguments and not passing them on to “pop the stack”. In Haskell we can define the following “words”:

> push x k = k x
> pop k a = k
> dup k a = k a a
> swap k b a = k a b
> over k b a = k a b a
> nip k b a = k b
> cond k f t p = (if p then t else f) k
> when k t p = if p then t k else k

Haskell provides us the correct polymorphic types automatically.

push :: a -> (a -> b) -> b
pop  :: a -> b -> a
dup  :: (a -> a -> b) -> a -> b
swap :: (a -> b -> c) -> b -> a -> c
over :: (a -> b -> a -> c) -> b -> a -> c
nip  :: (a -> b) -> a -> c -> b
cond :: a -> (a -> b) -> (a -> b) -> Bool -> b
when :: a -> (a -> a) -> Bool -> a

Now we can write Forth-like programs by nested partial application. To avoid getting Lost In Superfluous Parentheses we use the dollar ($) operator. Welcome to “DollarForth”.

> binop op k b a = k (op a b)
> add = binop (+)
> sub = binop (-)
> mul = binop (*)
> lt k b a = k (a < b)
> neg k x = k (-x)

> square k = dup $ mul $ k

: square dup * ;

> absolute k = dup $ push 0 $ lt $ push (neg) $ when $ k

: absolute dup 0 < if negate then ;

> boolToStr k = push (push "yes") $ push (push "no") $ cond $ k

: boolToStr if "yes" else "no" then ;

> main = do
>     print $ push 8 $ square $ id
>     print $ push 10 $ push 20 $ dup $ add $ swap $ sub $ id
>     print $ push 4 $ absolute $ push (-4) $ absolute $ binop (,) $ id
>     putStrLn $ push "hello, " $ push "world!" $ binop (++) $ id
>     print $ push True $ boolToStr $ push False $ boolToStr $ binop (,) $ id
>     print $ push "x" $ push "y" $ over $ binop (++) $ binop (++) $ id
>     print $ push 1 $ push 2 $ swap $ nip $ id

The good news: we now have a statically-typed Forth, with static stack inference, for free. The bad news: it is very difficult to program in this style – type errors are common and utterly indecipherable. It is also very hard to understand how it works – I have only a vague idea myself.

I have failed to define a loop operator or use recursion in DollarForth. Is it even possible within the bounds of Haskell’s type system?

18 January 2007

Adventures in Booting Up

Filed under: Uncategorized — shaurz @ 20:31

There was a power-cut today due to the 70 mph winds. I always know when this has happened because I come home to find my PC powered off.

…Which for me is more of a pain than you would first think. My PC has a strange problem where it only boots up about 1% of the time. It takes roughly half an hour of power cycling until I can get the BIOS to appear, so normally I never switch it off.

I think the problem is with the graphics card since I get cyan, white and magenta lines flashing on the screen instead of the video BIOS message (and it goes no further from there). The fan died on it years ago anyway. I can get something about 10 times faster for next to nothing these days.

2 December 2006

A Better Let

Filed under: Uncategorized — shaurz @ 19:42

The “let” form is, in my humble opinion, the worst bit of syntax in Lisp, requiring lots of superfluous parenthesis. I have devised a new syntax which removes these redundancies and improves readability.

<let> = ( let [<name> <exp>]* <exp>+ )

For example, the below expression:

(let ((x (* a a))
      (y (+ a a)))
  (- x y))

Can be expressed as:

(let x (* a a)
     y (+ a a)
  (- x y))

Which is much more pleasant. It is also quite easy to parse – here’s my implementation, which converts the bindings into an immediate lambda application:

(define (expand-let x vars vals)
  (if (or (pair? (car x)) (nil? (cdr x)))
      `((lambda ,(reverse vars) ,@x) ,@(reverse vals))
      (expand-let (cddr x)
                  (cons (car x) vars)
                  (cons (cadr x) vals))))

(define-macro (let . x)
  (expand-let x '() '()))

Another approach, which I am quite fond of, is extending the basic syntax of lambda so that we can introduce new bindings using an infix equality. The bindings can appear anywhere in the body except the very end since it must evaluate to a value.

<lambda> = ( lambda <formals> <stmt>* <exp> )
<stmt>   = <name> = <exp>
         | <exp>

The syntax is quite cute and convenient. It is mostly unambigous since there is no valid reason to evaluate a variable followed by an “=” in a lambda body. For example:

(lambda (a)
  x = (* a a)
  y = (+ x x)
  (- x y))

When we introduce a binding in the lambda body, it is as if the rest of the body is wrapped in another lambda taking the new binding as an argument. The above expression is equivalent to:

(lambda (a)
  ((lambda (x)
    ((lambda (y) (- x y))
     (+ x x)))
   (* a a)))

It would make sense to allow the binding syntax in other block forms such as begin. This is pretty trivial:

(define-macro (begin . exps)
  `((lambda () ,@exps)))

At the moment I am reading many papers (Citeseer and ACM Portal can be addictive!) on functional language implementation. My original interpreter which I wrote before the one I released here used an internal pre-compiled AST with parent-linked array activation frames. My current rewritten version uses a much less efficient direct interpretation of the list structure with association list environments. This was mainly done as a fun exercise – now I want to move onto something better.

Programs will again be represented by an AST. Variable nodes were previously represented as a (level, offset) pair where the level was the number of parent links you had to follow up from the current environment. I will now have three types of variable: locally bound (simply a stack offset), free (an offset into a free variable frame – see below), or global.

Closures will be represented by a code pointer and an array of free variables. Bound variables will be passed on the stack which we can deterministically push and pop, so we don’t need to garbage-collect normal activation frames, only free variable frames. If you don’t used closed variables you don’t have to pay the price. Mutable closed variables will in general require a ref cell.

A nice optimisation, if we create multiple lambdas which share the same free variables, would be to share a single free variable frame. The compiler could determine which implementation will use less space. In some cases this will eliminate the need for ref cells. This will make object-oriented techniques built on lambdas acceptably efficient.

I am also going to change my object representation so I can have unboxed tagged types such as fixed-sized integers, chars, booleans and nil.

Older Posts »

The Shocking Blue Green Theme. Blog at


Get every new post delivered to your Inbox.