Alex Aiken
CS 264
Spring, 1995


	Code Generation via Conversion to Continuation-Passing Style
	------------------------------------------------------------


I.  The Problem with Stacks

In C or FORTRAN, all procedure calls can be stack allocated.  This remains
true even in a language such as Pascal where procedure definitions can be
nested, so long as one does not allow procedures to be returned as the
result of procedures.

In a language in which functions are first class (i.e., can be passed as
arguments and returned as results) and function definitions can be nested,
it is no longer clear how to stack allocated all values.  The following
ML example illustrates why:

   (let fun f x = 
	let fun g y = x in
	    g
	end
   in
	f 3
   end)  5

Now, the function "g" is returned as a result of evaluating "f x".
Note that even though "g" is defined lexically inside the scope of
"f", the value returned by the outer let-expression is actually "g" in
an environment where x = 3.  The result of the entire expression is,
of course, 3.

The important point of this example is that a value referring to "g"
has greater extent than the function "f" that creates it, so the calls
to "f" and "g" cannot be stack allocated.  In LISP, this situation has
been named (not very mnemonically) the "upward funarg problem".  The
standard implementation of escaping functions is to create a *closure*.
Formally, a closure captures the function and the environment in which
it was defined.  Operationally, a closure is a pair consisting of a
pointer to the code for the function and a pointer to the activation
record defining the variables free in the function definition.  A more
refined implementation is to make a closure a flat record, with the
first element being the code pointer and the remaining elements the
environment variables.

For example, in the execution of the code above the sequence of events might 
be:

evaluate (...) 5
  evaluate f 3
   create activation record for f, slot x = 3
   evaluate body of f
      define function g with free variable x.  Since g will escape,
         create closure C with pointer to g and a pointer to x.
      return C
   pop AR for f
  evaluate 5 (returns 5)
  invoke C with y = 5
  evaluate body of g
end
   
Note that neither the environment containing "x" nor the closure for
"g" can be allocated in "f"'s activation record; these values must be
allocated on the heap.

Compiling languages with functions that escape can be painful.  To do
a good job, one must distinguish between functions that may escape
(e.g., g) and those that cannot escape (e.g., f), because more
efficient implementations are possible for the latter than the former.
In addition, there are many choices to be made in the representation
of a closure and its environment.  For example, it may be desirable to
avoid allocating the closure at all if the value will be immediately
used; in this case, it is better to store the two values of the
closure in separate registers.

Traditionally, the decisions about the representations of closures and
environments have been made during code generation.  This design has
at least two disadvantages: (1) it makes the code generator
complicated and therefore more difficult to debug and retarget, and
(2) it prevents optimizations from being applied to the choice of
closure and environment representations, because the optimizer works
on a higher level form in which these representation decisions are
hidden.

A consistent theme of modern compiler technology is: MAKE RUNTIME
REPRESENTATIONS EXPLICIT IN THE COMPILER'S INTERMEDIATE FORM.  That
is, if one is going to use closures, have a representation for
closures in the intermediate language.  Making all representation
decisions explicit makes code generation very simple (it's just a
change of syntax from the intermediate form to object code) and, most
importantly, exposes the runtime representations to the compiler,
where program analysis and optimization can be used to make
improvements.

II. Conversion to Lambda Calculus

The first step in the Appel/Jim compiler for ML is to translate ML
into a version of the lambda calculus.  The reason for this translation
is that ML is a big language, with many constructs.  The translation
reduces the number of constructs drastically, while preserving the semantics
of the language.  Thus, the translation simply reduces the number of cases
that subsequent phases must consider.

For example:

   fun (x) => e       ==> \x.e
   let x = e in e'    ==> (\x.e') e
   
III. Continuation-Passing Style

In the case of closures, the runtime property we are making explicit
is control-flow.  That is, during execution control jumps around from
point to point in the assembly code.  In C or Fortran, the order is
easy to compute from the program text.  In a language like ML, the
order is not so easy to see, because a closure may be created, passed
around in a data structure, and then invoked at some place far removed
from the context where it was defined.

Continuation-passing style makes arbitrary control explicit.  The idea
is simple: every function takes an additional argument, the address of
the instruction it should jump to when the function completes.  In
very concrete terms, one can think of a programming language in which
the completion of every function involves a "goto" to jump to the
next instruction to be executed.  The difference is that the
continuation is an *argument* to the function, so that it may jump to
different locations in different evaluations.  What have we just described?
The return address of a function invocation!  Normally this is a "hidden"
argument, known only to the code generator, but CPS makes it explicit.

We'll consider a little lambda-calculus language for our presentation of
CPS:

e ::= x | e1 e2 | \x.e1 | if e1 e2 e3 | e1 + e2 | i

CPS conversion takes an expression e and produces a function that
takes a continuation k.  The idea is that we should evaluate e, and
then continue with the execution of k.  But what happens to the value
that e produces?  The answer is that it is the input to the continuation.
That is:

CPS(e,k) = k e

This definition tells us what we want: evaluate e and pass its result to
the rest of the computation k.  Of course, we also want to recursively
convert subexpressions of e.  We will allow ourselves a little bit of
syntactic sugar in the target language to help make things clear.
We add the expression let x = e1 in e2.  The rule is that e1 must
be a syntactic value---either an abstraction \z.e or an integer.
The standard conversion algorithm is:

CPS(x,k) = k x
CPS(\x.e,k) = let y = \(x,k').CPS(e,k') in k y
CPS(e1 e2,k) = CPS(e1, \f. CPS(e2,\x. f(x,k)))
CPS(if e1 e2 e3) = CPS(e1,\b.if b = 0 then CPS(e2,k) else CPS(e3,k))
CPS(e1 + e2,k) = CPS(e1,\a.CPS(e2,\b.let c = a + b in k c))
CPS(i,k) = k i

Note that as a result of the conversion, every function takes an extra
continuation argument k'.  This extra argument is exactly the return
address of the call---where the function should return when it is done.
Note also that the CPS conversion doesn't get rid of the primitive operations
such as conditionals, addition, and function application, but those primitives
now appear in a very restricted form: their arguments are always variables.
In addition, lets always bind syntactic values; nothing needs to be evaluated.

In operational terms, this representation forces all intermediate
values to be given names, which can be conveniently thought of as
registers or memory locations. Because of their restricted form, code
generation is very easy for CPS converted terms.

Here is an example:
(\x.x + 1) 3

CPS((\x.x +1) 3,k) =
CPS(\x.x + 1,\f.CPS(3,\x.f(x,k))) =
let g = \(x,k').CPS(x+1,k') in \f.CPS(3,\x.f(x,k)) g =
let g = \(x,k').CPS(x+1,k') in CPS(3,\x.g(x,k)) =
let g = \(x,k').CPS(x+1,k') in \x.g(x,k) 3 =
let g = \(x,k').CPS(x,\a.CPS(1,\b.let c = a + b in k' c) in g(3,k) =
let g = \(x,k').(\a.CPS(1,\b.let c = a + b in k' c) x) in g(3,k) =
let g = \(x,k').CPS(1,\b.let c = x + b in k' c) in g(3,k) =
let g = \(x,k').(\b.(let c = x + b in k' c) 1) in g(3,k) =
let g = \(x,k').(let c = x + 1 in k' c) in g(3,k) =


This CPS term would convert into the following pseudo-machine code:

(* k contains the address of the current continuation on entry
   result of function is always stored at address a  *)
g := LABEL   	(* code for g(3,k) *)
store 3 as first argument of g
store k as second argument of g 
jump LABEL
...
LABEL:          (* code for \(x,k').let c = x + 1 in k' c *)
a := first argument + 1
jump to second argument

The CPS conversion given above is a sugared version the traditional
one in the literature.  The vanilla CPS conversion does not use let
expressions, which, while operationally equivalent, makes the code
quite a bit harder to read. Appel also uses a different syntax, which
is also hard to read but more suited for implementation as a
datatype in a compiler.

IV.  Closure conversion

Although CPS conversion makes call/return addresses for functions explicit,
it hasn't done anything to help with generating code for closures.
A function with free variables must be able to access those free variables
at runtime.

The idea behind closure conversion is to make the environment explicit
by passing it as an extra argument to each function.  This basic idea can
be improved in the following way.

Def.  A function f is "known" if the only uses of are as  f x; i.e., f
always appears in function position.

Known functions can have all of their uses statically identified, and they
do not escape.  Thus, at every call site of a known function, the set of
free variables of that function are also in scope.  To see this, consider
the expression 
		let f = ...
  			in .... f x ....

Clearly, all the free variables in the body of f are still accessible
inside the body of the let, modulo values that are shadowed by bindings
of the same name.

To handy constructs are used heavily in closure conversion: record and
select.  record(x1,...,xn) makes a record of n things and may be bound in
a CPS-let.  select(i,m) selects the ith component of record m and may also
be bound in a CPS-let.  Records are used heavily to represent the closure
vallue of a function.

To make closures explicit, every function is modified as follows:

- For an unknown function f,
	an extra closure argument is added to the definition of f.
	That is, let f = \(x,k).e in ...
                 is converted to 
		 let f = <\(c,x,k).e,v1,...,vn> in ...
	where v1,...,vn are the free variables of the function.
		 
	Every use of an unknown function f(x,k) (i.e., application of a variable
	not syntactically bound to a function body) is replaced by
		f.0(f,x,k)

	By storing the code pointer at offset 0 inside the closure
	we don't have to know how big the record is to find
	the pointer.  References to free variables in the body of f
	are replaced by appropriate selections from the closure record.

- For a known function f,
	as many additional arguments are added to the function as it has
	free variables.  That is, 
		let f = \(x,k).e 
		is converted to
		let f = \(x,k,v1,...,vn).e
	where v1,...,vn are the free variables of the function.
	
	Every use of a known function is replaced by 
		f x1,...,xn  where x1,...,xn are the names for the
				free variables of f in the current context.
	Note that x1,...,xn will in general be different names, because
	all definitions and uses are converted to have no free variables.


V. Optimizations and Simplifications

Once CPS conversion and closure conversion are performed, a large
number of optimizations/simplifications can be usefully peformed. 
The most important are:

1. Unnest all nested function definitions.  Since there are no longer
any free variables, this is trivially correct.

2. Replace every 

	let x = record(...,vi,...) in
		.....
		let y = select(i,x) in e
 
   by   let x = record(...,vi,...) in
		.....
		e[vi/y]

3. Beta-reduce any function that is called only once or whose
   body is small.

4. Perform eta-reduction
	f(x,y) = g(x,y)  => replace all uses of f by g

5. Perform register allocation.  Here the idea is that an expression requires
a number of registers equal to the number of free variables it refers to.
If it has more free variables than the number of available registers, then
some free variables must be "spilled" by packing them into a record and
referencing them using selects (note that these selects cannot be optimized
using 2!).


VI.  Register Allocation

The code generator is easy except for register allocation.  In principle, this
phase could also be performed on the CPS and closure-converted code by
renaming all variables to from a fixed set r1-rn (the names of actual machine
registers).  The principle ideas here are to:

1. Register targetting: arrange for a function to expect an argument in a 
particular register where all calls can place it.  Successful register
targetting makes a function call as cheap as a jump instruction.

2. Register anti-targetting: if an expression will call function f and f
expects a particular argument in register i, avoid putting anything else
in register i to avoid a potential move instruction prior to the function call.