Initial Outline
By Daniel Wilkerson and Scott McPeak
This file is an attempt to articulate the design of elsa to someone who would like to work on it.
The goals of elsa are as follows.
To accommodate the multiple goals, elsa is extensible; that is one may arrange one's build process to build elsa and incorporate extensions to lexing, parsing, and to some extent typechecking without modifying the original elsa distribution files. This "base-and-extension design pattern" occurs frequently in the elsa build process. The ability to read C99 and GNU extensions to C++ are implemented within elsa using these extension mechanisms.
Processing of an input file proceeds in the following three stages.
cc.lex is the base lexer file written in flex. To extend elsa one may write a file containing a flex lexer fragment, say an "_ext.lex" file, and have it merged into cc.lex by merge-lexer-exts.pl to produce lexer.lex; gnu.lex is one such lexer extension file used to add the lexing of gnu token extensions.
lexer.lex is subsequently compiled into lexer.yy.cc by flex; note that we make flex generate a C++ lexer.
Multiple parts of the system need to know various properties of the tokens that the lexer may produce.
These three files share partly redundant information and so they are all generated by elkhound/make-token-files from ".tok" files; The language is so simple that the base cc_tokens.tok and any extension ".tok" files can simply be all passed in at once or concatenated first.
The elsa parser is written in elkhound. Elkhound is a language reminiscent of yacc/bison and with the same purpose: to allow the client to declaratively describe a grammar with user actions at each parsing stage and have a parser for that grammar which executes those user actions generated automatically. Elkhound is different in that it allows ambiguous grammars; it uses the GLR algorithm to handle this. Elkhound was written and is maintained by Scott McPeak.
The elkhound source files in elsa end in ".gr". The base file is cc.gr which is meant to be a manifestation of strict C++ 98. The extension files for gcc are in gnu.gr. If your favorite editor has a C++ mode it is likely to work well for ".gr" files.
How much about elkhound should go here and how much in a similar file for users of elkhound that should probably be in the elkhound tree? Need to educate the user about verbatim sections, merge, dup and del actions, how to detect when ambiguity has exploded, how precedence is represented. sm: see elkhound/index.html, linked above
There are times when the GLR algorithm cannot resolve all ambiguity itself within the parser, but instead comes up with two valid parses of a substring of the input. This results in calling the "merge" action on a non-terminal. This merge action can do anything, but in elsa it always simply appends one subtree onto the 'ambiguity' linked-list of the other. These ambiguities will be resolved during the typechecking phase.
You will notice that most of the rules in the parsing user actions make Abstract Syntax Tree (AST) nodes on the heap; the general pattern is that they
The AST classes so constructed are rather tedious to write as traditional C++, as one soon discovers upon attempting it. In elsa they are built in elsa using Scott McPeak's AST description language generator called astgen; the input files are named ".ast". If your favorite editor has a C++ mode it is likely to work well for ".ast" files as well.
The base C++ AST is defined in cc.ast (and the gnu extension is in gnu.ast) and is documented separately. Talk about how .ast files merge together, how the argument lists concatenate and hierarchies merge, the purpose of each .ast file in elsa, such as how typechecking concerns are factored from raw parsing concerns; warn the user of the screwy feature that you can leave out the pointer and that is still legal and also means something else, point out how it will only generate a 2-level deep hierarchy.
Tracing flag "printAST" will print the (possibly ambiguous) AST before type checking.
The typechecking phase accomplishes five semantically independent results.
Type annotation: Some objects, such as expressions, have a well-defined type associated with them. After typechecking, these objects have been annotated with type objects. Type objects are declared in cc_type.h and are documented separately. The notion of identity for types is quite messy and subtle; mention that the type factory may be overloaded. The tracing flag "env" will print typechecking environment modifications as they happen.
Variable resolution: Identifiers name a variable. This is how different parts of a file may refer to the same thing; therefore all occurrences of a variable mean to refer to the same thing are all annotated with a pointer to the same Variable object.
Disambiguation: As was mentioned in the parsing section, the parse may be ambiguous: certain classes AST nodes have a linked list of alternative parses. One of these must be chosen; the general strategy is to typecheck all of them and keep the one that results in no errors. It should not be possible for more than one to typecheck and it is simply a user error if none of them typecheck. Much of this is generic Describe generic ambiguity resolution. Some of it is special-cased for efficiency of common ambiguities that tend to cause an exponential blowup mention function call / ctor call ambiguity. The tracing flag "disamb" will print disambiguation activity as it occurs. The tracing flag "mustBeUnambiguous" will cause elsa to verify after type checking that the AST to verify there are no remaining ambiguities and if there are, abort.
There is a fifth thing but I can't remember what it is; it goes here.
Elaboration: Unlike in C, in C++, much syntax is implied.
The tracing flag "printTypedAST" will print the AST after type checking.
Elsa will perform various post-processing on request.
Tracing flag "printHierarchies" will print inheritance hierarchies in Dot format. Interesting in that virtual inheritance is represented properly; for example in/std/3.4.5.cc yields 3.4.5.png.
Tracing flag "prettyPrint" will print out the AST as C++. This is still somewhat incomplete. Maybe say something here about how this can be used an an extension to do source to source translation the way oink/cc_qual infers dataflow annotations and then prints them out again; this is worth mentioning as it is a requested feature
Maybe I should point out that in oink you can print out the control flow graph as a dot file as well; oink will also soon contain a way to print the data flow graph at both the expression and the type-component granularities.
Say something here about how to extend elsa.
Elsa was designed to be extended with various backends, such as program analysis tools; One might easily extend it to be a compiler. Not to self-advertise, but it might be helpful to mention Oink here and any other extensions you know about