Research objectives
The research objectives of the CompCert project are twofold: (1) to investigate the development and correctness proof of verified compilers, and (2) to improve the tools needed by such a proof, such as program proof technology and mechanized semantics. More specifically, the research conducted in CompCert falls in the following four themes.
A formally verified C compiler
We have developed and formally verified a realistic compiler for a large subset of the ISO C 1999 language, called CompCert C. More details on this subset of C and on the compiler itself can be found on this separate page. We use the Coq proof assistant not only to conduct the correctness proof, but also to program most of the compiler; the code extraction facility of Coq then generates executable Caml code from the Coq functional specifications.
The compiler is structured in 16 passes and uses 10 intermediate language. One intermediate language called Cminor is a much simplified version of C, typeless but still structured. Other intermediate languages are variations on register transfer language (RTL), also known as "3-address code". The compiler implements state-of-the-art register allocation by graph coloring (using the Appel-George coloring algorithm), as well as several dataflow-based optimizations: constant propagation, common subexpression elimination (by value numbering applied to extended basic blocks), dead code elimination.
To deploy additional, more ambitious optimizations, we have studied the "verified verifier" approach, where advanced optimizations such as trace-based instruction scheduling and lazy code motion are performed by untrusted Caml code, then verified a posteriori for correctness using symbolic interpretation. Only the verifier needs to be written and proved correct in Coq.
Mechanized semantics
A central issue in compiler verification is to define formal semantics for the source, intermediate and target languages that lend themselves to machine-assisted reasoning over program transformations. Initially, we used natural semantics, a.k.a "big-step" semantics, for most of our languages. These semantics could only describe terminating executions and would let us observe only the exit code of the whole program, resulting in rather weak semantic preservation theorems. We then added traces to the semantics, to define the behavior of a program as the trace of all input/output events it performs, and developed coinductive natural semantics, a novel form of big-step semantics that describe non-terminating executions.
We now use small-step semantics throughout the CompCert compiler, so that all languages and semantic preservation proofs can be cast in a common framework of labeled transition systems (LTS). For the source and high-level intermediate languages, which feature structured control, we use a "focused" form of small-step semantics where we reduce not full terms, but pairs of a subterm under focus and a context (a.k.a. a continuation). The small-step semantics for CompCert C was made executable, so that it serves as the basis for a reference interpreter for CompCert C, which animates the semantics on actual C programs.
To validate these operational semantics and pave the way to connections with program provers, we also studied axiomatic semantics for Cminor and Clight, based on separation logic.
Improving Coq as a programming language
The Coq proof assistant plays a crucial role in the CompCert development: it is both the programming language in which the verified parts of the compiler are written, and the prover used to establish semantic preservation results.
We participated in the development of the Function
mechanism introduced in version 8.1 of Coq to facilitate definition
and reasoning over functions defined by deep pattern-matching and
structural or well-founded recursion. This mechanism automates the
generation of rich, easy-to-use induction and inversion principles
over such definitions.
Coq is based on a logic of total (terminating) functions. However, it is difficult to prove termination of some of the functions involved in a compiler, and such termination proofs are not really useful since we are mostly interested in the partial correctness of the compiler. We investigate an approach to defining and reasoning over partial functions defined in Coq by general recursion. This approach is based on Tarski's fixpoint theorem and involves classical logic (an axiom of description) that can nonetheless be realized by a Caml fixpoint combinator.
The availability of efficient, verified, purely functional ("persistent") data structures plays a crucial role in the CompCert development. We improved the FSet library of finite sets (implemented efficiently as balanced binary trees), and developed and verified an FMap library of finite maps based on similar principles. Both are now part of the Coq standard library. We developed and verified several other useful data structures: radix-2 trees, heaps, union-find, ...
Other trusted execution paths
We investigate the use of CompCert's parts and methodology in verified compilers and code generators for other languages.
We have developed and verified a compiler from mini-ML (the pure, strict functional programming language that is the target of Coq's extraction mechanism) to the Cminor intermediate language of CompCert. This compiler can then reuse the back-end of the CompCert C compiler to generate provably correct PowerPC executable code. However, the run-time system (memory allocator, garbage collector) was not verified.
We also investigated several approaches to proving the correctness of a code extraction mechanism similar to that of Coq. One approach relies on proving directly the correctness of the extractor. Another approach is to provide tools to establish semantics preservation for one particular run of extraction.
More recently, we participated in the Velus project, which builds a verified compiler for the Lustre reactive language, producing C code in the form of Clight abstract syntax that can be compiled directly by CompCert. The end-to-end formal verification of Velus + CompCert proves that the trace of observable events for the executable machine code is a solution to the set of equations described by the Lustre source code.