Elsa: The Elkhound-based C/C++ Parser
Elsa is a C and C++ parser. It is based on the
Elkhound parser generator.
It lexes and parses input C/C++ code into an abstract syntax tree.
It also does some type checking, but (currently) only insofar as that
is required to disambiguate the syntax. The only major C++ features
still not implemented are namespaces and template partial
specializations.
To download Elkhound and Elsa, see the
Elkhound distribution page.
Additional documentation:
- cc.ast.html: The C/C++ abstract syntax tree
created by the parser.
- cc_type.html: The type representation
objects created by the type checker.
Elsa requires the following external software:
- elkhound, a GLR parser generator.
- ast, a system for making abstract syntax trees.
- smbase, my utility library.
- Flex,
a lexical analyzer generator.
Build instructions:
$ ./configure
$ make
$ make check
./configure understands
these options. You can also
look at the Makefile.
Parsing some sample input:
$ ./ccparse in/t0001.cc
The above command will parse and type check the given file. To
make it print the annotated, post-type-check AST, say
$ ./ccparse -tr printTypedAST in/t0001.cc
Additional -tr flags of interest:
- printAST: Print the (possibly ambiguous) AST before type checking.
- printTypedAST: Print the AST after type checking.
- env: Print environment modifications as they happen.
- disamb: Print disambiguation activity.
- printHierarchies: Print inheritance hierarchies in
Dot format.
Interesting in that virtual inheritance is represented properly;
for example in/std/3.4.5.cc yields
3.4.5.png.
- mustBeUnambiguous: After type checking, scan the AST to verify there
are no remaining ambiguities. If there are, abort.
- prettyPrint: Print out the AST as C++. This is still somewhat incomplete.
The -tr flags can be passed separately, or strung together
separated by commas (e.g. "-tr env,disamb,printAST").
Module List:
- cc.ast:
C/C++ Abstract Syntax Tree. This file is the most important
one in the parser, since it defines the interface between
the parser and everything else that comes after it. It is
documented separately in cc.ast.html.
- cc.gr:
C/C++ parsing grammar. This is the second-most important file,
as it tells Elkhound how to parse the token stream. This grammar
is based on that in the C++ Standard document, but then modified
to remove unnecessary ambiguities and improve the grammar's ability
to extract structure.
- cc_ast_aux.cc:
Some auxilliary functions for cc.ast.
- cc_env.h,
cc_env.cc:
Env, the type checking environment. Fundamentally just a stack of
Scopes (cc_scope.h), plus some global
type checking state.
- cc_err.h,
cc_err.cc:
ErrorMsg, an object for representing type checking errors. For now
it's just an error string plus some metadata (like source location),
but I plan to evolve it to include more structured data like pointers
to (instead of just string representations of) the types involved in
the error.
- cc_flags.h,
cc_flags.cc:
This module defines a variety of enums relevant to parsing and
type checking C++, including enums for all the built-in types,
operators, etc.
- cc_lang.h,
cc_lang.cc:
CCLang, a package of language dialect options. Setting flags in
this class tells the lexer, parser and type checker what language
options to support (e.g. C vs. C++).
- cc_print.ast,
cc_print.h,
cc_print.cc:
cc_print is a module to pretty-print the AST using C++ syntax. It
extends the AST with entry points for printing.
- cc_scope.h,
cc_scope.cc:
A Scope is three maps: variables, compounds, and enums. The
environment (cc_env.h) consists of a stack
of them.
- cc_tcheck.ast,
cc_tcheck.cc:
cc_tcheck is the type checker. It consists of an AST extension to
add type checking entry points and annotations, and an implementation
of all of those type checking functions. It's the most complicated
part of the parser.
- cc_tokens.tok:
This file lists all of the kinds of tokens the lexer recognizes. It's
designed to be extended simply by appending. The script
elkhound/make-token-files
takes this as input, and generates
cc_tokens.h,
cc_tokens.cc and
cc_tokens.ids. This last file is then
included into cc.gr (the others participate in
compilation in the obvious way).
- cc_type.h,
cc_type.cc:
cc_type defines a hierarchy of type representation objects. These
form the core data structure manipulated by the type checker.
They are documented separately in
cc_type.html.
- ccparse.h,
ccparse.cc:
This module defines part of the parser context class, and assists
minimally with parsing.
- lexer.lex,
lexer.h,
lexer.cc:
This module chops up a given C++ source file into tokens. It does
not do any preprocessing, so one must use an external preprocessor
first.
- main.cc:
This module contains the main() function of the parser. It's a simple
driver around the other modules, and can be extended to invoke
other tasks that come after parsing.
- parssppt.h,
parssppt.cc:
This is a poorly-designed module intended to abstract some of the
functionality otherwise common to main()-providing modules. It
needs to die.
- tlexer.cc:
Simple test driver program for the lexer.
- variable.h,
variable.cc:
Variable, a class for holding information about names in the
"variable" namespace. See
variable.h for a list of the kinds
of things that get represented with Variables. This module
is closely related to cc_type.
Module dependency diagram:

Or, in Postscript.
Miscellanous files:
- chop_out:
This script extracts pretty-printed C++ syntax from the other
debugging output produced by ccparse.
- extradep.mk:
Build-time dependencies among auto-generated source files.
Produced by
elkhound/find-extra-deps.
- idemcheck:
Script to verify that parsing then pretty-printing is idempotent.
- in:
Directory with testcases.
- include:
When preprocessing, add this directory to the preprocessor's
search path. It contains compiler-specific headers. Generally
I just use gcc's headers, but some of gcc's headers use syntax
that Elsa doesn't (yet?) understand, so this directory contains
my replacements.
- regrtest:
Regression tests.
- test-parse:
Script to parse a file, making sure the parse is unambigous.