Elsa Design
Work in Progress
By Daniel Wilkerson and Scott McPeak
This file is an attempt to articulate the design of Elsa to someone who would like to work on it.
Elsa attempts to parse C and C++:
Note that Elsa does not try to reject all invalid programs. The checking that Elsa does is primarily for the purpose of ensuring that its interpretation of the code is correct. For example, Elsa does check that argument types match parameter types (which helps to catch bugs in Elsa's type computation logic), but does not enforce the access control rules ("public", "private", etc.) of C++.
Elsa is extensible; that is one may add additional syntactic features to the language being parsed, without directly modifying the files that define the base language. This "base-and-extension design pattern" occurs frequently in the design of Elsa, and in fact is used to support for C99 and GNU extensions to C++ (consequently, it is easy to remove such support).
Processing of an input file proceeds in the following three stages.
Lexing (a.k.a. scanning) is the process of partitioning a flat sequence of characters into a sequence of tokens. In addition to being a partition, the tokens represent classifications of the partitions: the character sequence "123" might be called an "integer literal", and the sequence "abc" an "identifier". The Lexer discards comments and whitespace (rather than passing them on to the parser).
As mentioned above, much of the Elsa design involves extension mechanisms, and the Lexer is no exception. A base description is combined with one or more extension descriptions to arrive at the full lexical language:
Above, solid lines indicate direct flow, and dashed lines indicate where one file is #included by another (i.e. both directly flow into a program further down but not shown). Files are shown in ellipses and programs are shown in rectangles.
cc.lex is the base lexer description. It is written in the Flex language. gnu.lex is an extension lexer description; it contains definitions for GNU and C99-specific lexical elements. These two descriptions are combined with merge-lexer-exts.pl to produce lexer.lex. lexer.lex is subsequently read by Flex to generate lexer.yy.cc, a C++ module that does the actual scanning.
Build process invariant: Any time a script or tool produces a text file, the build process marks it read-only. This makes it harder to accidentally edit a file that is automatically generated. Thus, all the files in the diagram above that are the output of a rectangle are marked read-only.
Unlike Bison, Elkhound does not automatically choose the mapping from lexer token codes (like "4") to conceptual lexical elements (like "integer literal"). So the Elsa build process uses a script called make-token-files to assign the mapping. It uses the token descriptions supplied by cc_tokens.tok and gnu_ext.tok.
lexer.lex specifies how to partition the input characters into tokens. Most of the actions are straightforward. One tricky point is the notion of "separating tokens" and "nonseparation tokens", which is explained at the top of lexer.cc. Another is that "#line" directives are handled by recording them with the hashline module, which can then be used to map a raw input source location to a the location designed by the #line directives.
The baselexer module is responsible for coordinating the activity of a Flex lexer and an Elkhound parser. It inherits from LexerInterface (lexerint.h), which defines three fields (type, sval, and loc) that the Elkhound parser reads. BaseLexer updates these fields during lexing according to what the lexer actions do.
BaseLexer is extended by the lexer module, which defines the Lexer class, which contains the methods that the lexer actions invoke.
If you would like to see the results of just lexing an input file, the tlexer program (tlexer.cc) will read in a preprocessed C/C++ source file and print out the sequence of tokens that would be yielded to the parser.
Parsing is the process of converting a token stream into an Abstract Syntax Tree (AST). In Elsa, the AST produced by the parser is not necessarily a tree at all, but a Directed Acyclic Graph (DAG) in general, because of ambiguities. However, we still call it an AST.
The parser is written in a mixture of Elkhound and C++; the Elkhound language defines how terminals, nonterminals and productions are defined, while the reduction actions associated with the productions are written in C++. The parser description is a combination of three files:
There are three output files from Elkhound:
The AST is described by a language that is input to the astgen tool. The description is comprised of several files:
The output files are:
Most of the parser actions are straightfoward: combine the AST elements that comprise the production right-hand side (RHS) into a single AST element that represents the left-hand side (LHS).
Two issues are explained at the top of cc.gr are the various source of names, and handling of destructive actions.
Most instances of syntactic ambiguity are handled by using the ambiguity fields of certain AST nodes to explicitly represent the different alternatives. The type checker then has the responsibility for picking one alternative. For example, the ambiguous syntax "return (x)(y);" would be represented as shown in the following diagram, where the cast interpretation "return (x)y;" is shown in green, and the function call interpretation "return x(y);" is shown in purple. Note that the nodes for "x" and "y" are shared by both interpretations.
A few instances of ambiguity are handled at parse time, rather than deferring to the type checker as in the diagram above. This is done by writing keep functions that cancel a reduction if it can be determined that the syntax giving rise to the reduction has another (better) interpretation. For example, there is an ambiguity in template parameters because "<class T>" could be a type parameter called "T", or it could be a non-type parameter of (existing) type "class T" but with no parameter name. As the Standard specifies that this is always a type parameter, the reduction for non-type parameters cancels itself if the type is like "class T" (see the TemplateParameter -> ParameterDeclaration reduction and the associated keep function).
The tracing flag "printAST" to ccparse will print the (possibly ambiguous) AST as it exists before type checking.
The tracing flag "parseTree" will print the full parse tree. The parse tree shows the structure of reduction action calls by replacing every reduction action with one that builds a generic parse tree node out of its subtrees. This is useful for debugging ambiguities, since it shows exactly what happens in the parser, without interference from the actual reduction actions.
The type checker (cc_tcheck.cc) does five major jobs:
The fundamental data structure on which the type checker does its work is the AST, documented in cc.ast.html.
The tracing flag "printTypedAST" will print the AST after type checking.
AST disambiguation means choosing a single interpretation for AST nodes that have more than one (i.e. a non-NULL ambiguity field). Then, the surrounding AST nodes are modified to reflect the chosen choice and forget about the others. The tracing flag "disamb" will report disambiguation activity.
Most disambiguation is done by the generic ambiguity resolver, in generic_amb.h. The resolveAmbiguity function simply invokes the type checker recursively on each alternative by invoking the mid_tcheck method. If exactly one alternative successfully type-checks (i.e. does not report any errors), then that alternative is selected. The ambiguity link for the selected node is nullified, and the selected node returned so the caller can update its AST pointer accordingly. It is an error if more or less than one alternative type-checks.
As recursive type checking can sometimes involve doing computations unrelated to the disambiguation, such as template instantiation, at certain points the type checker uses the InstantiationContextIsolator class (cc_env.h) to isolate those computations. They will only be done once (regardless of how many times they occur in ambiguous alternatives), and any errors generated are not considered by the ambiguity resolver.
Not every ambiguous situation will be resolved by the generic resolver. In particular, there is a very common ambiguity between E_funCall and E_constructor, since the parser can almost never tell when the "f" in "f(1,2,3)" is a type or a function. If the generic procedure were used, this would lead to exponential type checking time for expressions like "f1(f2(f3(...(fN())...)))". Since this disambiguation choice depends only on the function/type and not the arguments, Expression::tcheck explicitly checks for this case and then just checks the first component by invoking the inner1_tcheck method. Once the selection is made, inner2_tcheck is invoked to finish checking th argument list.
There are a few other ad-hoc disambiguation strategies here and there, such as for deciding between statements and declarations, and resolving uses of implicit int (when K&R support is enabled).
Declared entities are represented by Variable objects (variable.h). In general, lookup is the process of mapping a name (which is a string of characters) and a context (scopes, etc.) to a Variable. AST nodes that contain names subject to lookup, such as E_variable or E_fieldAcc, contain a var field. The var field is initially NULL, and the type checker sets it to some non-NULL value once it figures out which Variable the name refers to.
There are many kinds of entities represented by Variables, as shown in this diagram:
On the left half of the diagram are names corresponding to types, and on the right half are non-type entities. Types are introduced by a typedef or by a class declaration. Non-types are introduced by a variable declaration, or a function prototype or definition. A few oddballs, such as enumerators and namespaces, round out the set. The neighborhoods of the class and function template boxes are expanded in a later diagram, below.
Every Variable has a name (but it might be NULL), a type (only NULL for namespaces), and some flags. The name is how the entity is found by lookup. The type is either the denoted type (for type entities) or the type of the variable (for non-types). The flags say what kind of entity a given Variable is; by interrogating the flags, one can determine (for any given Variable object) to which box in the diagram it belongs.
It may seem odd that so many kinds of entities are represented with the same Variable class. The reason is that all of these entities are looked up in the same way, and all of these entities' names hide each other (when scopes are nested), so the Variable is the fundamental element of a Scope (cc_scope.h). The word "name" in quotes suggests this connection, as all of these entities correspond to what the C++ Standard simply calls a "name".
In C++, function names and operators can be overloaded, meaning there is more than one entity with a given name. The name is mapped to an entity by considering the context in which it is used: for a function call, the argument types determine the overloaded entity; when taking the address of a function, the type of the variable receiving the address determines which entity has its address taken; etc.
Elsa represents overload sets by having a single representative Variable contain a linked list, the contents of which are the overload set (including the representative itself). Initially, typechecking an E_variable or E_fieldAcc that refers to an overloaded name will set the node's var field to point at the set representative. Later, it uses the call site arguments to pick the correct entity. Then, the E_variable or E_fieldAcc node's var field is modified so that it points directly at the chosen element.
At the moment, there is no way to distinguish between a Variable object denoting an overloaded set, and a Variable object denoting just the specific entity that happens to be the set representative, so this distinction must be inferred by context (i.e. before overload resolution has occurred, or after it). This design might be changed at some point.
When Elsa finds that an operator is overloaded, it again uses the arguments to select the proper operator. If the selected operator is a built-in operator, the (say) E_binary node is left alone. But if a user-defined operator is chosen, then the node is changed into an E_funCall to reflect that, semantically, a function call occurs at that point. One way to observe this change is to pretty-print the AST (see pretty printing).
Expressions (and a few other nodes) have a type associated with them. The type checker computes this type, and stores it in the type field of the node.
Types themselves have internal structure, which is explained in cc_type.html.
When an object is (say) passed as an argument to a function, depending on the types of the argument and parameter, an implicit conversion may be required to make the argument compatible with the parameter. This determination is made by the implconv module. Among the kinds of implicit conversions there are user-defined conversions, conversions accomplished by calling a user-defined function. When Elsa finds that user-defined conversion is required, it modifies the AST to reflect the use of the conversion function, as if it had been written explicitly.
Bug: While Elsa currently (10/12/04) does all the necessary computation to determine if a user-defined conversion is needed, in some cases it fails to rewrite the AST accordingly. This will be fixed at some point (hopefully soon).
Elsa does template instantiation for two reasons. First, instantiation of template classes declarations is required in order to do annotation such as for expression types, since the type of an expression involving a member of a template class is dependent on that template class' definition. Second, instantiation of function templates (including members of class templates) lets analyses ignore the template (polymorphic) definitions and concentrate on the (monomorphic) instantiations, which are usually easier to analyze.
Function templates are represented with a Variable (variable.h) to stand for the function template, and an associated TemplateInfo (template.h) structure to remember the template parameters (including default arguments), and any instantiations that have been created:
Class templates are also represented by a Variable/TemplateInfo pair. The wrinkle is that template classes can have explicit specializations, user-provided classes for use when certain template arguments are supplied (for example, a generic Vector template might have an explicit specialization for Vector<char> that uses a more efficient representation):
Function templates are instantiated as soon as a definition and a use (the use supplying the template arguments) have been seen. This is done by calling Env::instantiateFunctionTemplate (template.cc), which returns a Variable/TemplateInfo pair that represents the instantiation. If the instantiation has already been made, the existing one is returned. If not, the template definition AST is cloned (deeply copied), the template parameters are bound to their arguments, and the entire definition re-type-checked.
Class template are instantiated as soon as a use is seen; a program is ill-formed if a definition has not been seen by the time of first use. Instantiation is done by calling Env::instantiateClassTemplate (template.cc). As above, if the instantiation already exists it is re-used, otherwise the template definition is cloned and re-type-checked.
Function members of class templates are not instantiated until a use of the member is seen. For members whose definition appears "inline" in the class body, the MR_func::f field points at the uninstantiated template body. The body will be cloned and type-checked only when it is instantiated. One consequence of this design is that analyses (usually) need to avoid looking at such uninstantiated members; one way to do this is by using ASTTemplVisitor (cc_ast_aux) to do the traversal, as it automatically skips such methods.
The C++ Standard has fairly elaborate rules for deciding when a type or a name in a template definition is dependent on the template parameters. Furthermore, it specifies that names and types that are not dependent must be looked up in the context of the original template definition, not the instantiation context (as is the case for dependent names).
To implement this (and to disambiguate template definition ASTs), Elsa type-checks function template definitions in advance of any instantiation. A dependent type is represented by the ST_DEPENDENT pseudo-type (see enum SimpleTypeId in cc_flags.h).
Furthermore, while type checking the template definition, if a name lookup is determined to not be dependent, the nondependentVar field is set to the same thing as the var field (both are fields of AST nodes that have names subject to lookup). Later, when an instantiation is created, the nondependentVar value is preserved by cloning, and used instead of doing a new lookup, if it is not NULL.
When a class template instantiation is requested but one or more arguments is dependent, a PseudoInstantiation type (template.h) is created. This is more precise than simply yielding ST_DEPENDENT, and that precision is necessary in some cases, and much cleaner than doing a full "instantiation" with incomplete information.
Similarly, when type checking a template definition, the template type parameters are bound to (unique) instances of TypeVariable (template.h) objects.
Bug: There are additional cases where Elsa needs to use something more precise than ST_DEPENDENT, but does not do so currently. An example of code that fails because of this bug is in/t0290.cc.
(sm: 10/12/04: stopped here)
Elsa will perform various post-processing on request.
Tracing flag "printHierarchies" will print inheritance hierarchies in Dot format. Interesting in that virtual inheritance is represented properly; for example in/std/3.4.5.cc yields 3.4.5.png.
Tracing flag "prettyPrint" will print out the AST as C++. This is still somewhat incomplete. Maybe say something here about how this can be used an an extension to do source to source translation the way oink/cc_qual infers dataflow annotations and then prints them out again; this is worth mentioning as it is a requested feature
Maybe I should point out that in oink you can print out the control flow graph as a dot file as well; oink will also soon contain a way to print the data flow graph at both the expression and the type-component granularities.
Say something here about how to extend Elsa.
Elsa was designed to be extended with various backends, such as
program analysis tools; One might easily extend it to be a compiler.
Not to self-advertise, but it might be helpful to mention Oink here
and any other extensions you know about