GCC Architectural Goals
Towards a more hackable compiler
Joseph Myers Diego Novillo
6 Dec 2011
GCC Architectural Goals
GCC is now a mature and established code base, but parts of it have grown crusty, difficult to maintain and modify. Additionally, its sheer complexity makes it difficult for contributors to understand the code, often requiring a significant and concerted effort before being able to make changes to the compiler.
We argue that GCC must become more amenable to hack on; it should adopt modern software maintenance standards. In particular, it needs to adopt modern software engineering design principles, be adequately documented so that it never depends on the specific knowledge or skills of an individual developer.
This will improve the effectiveness of current contributors and help new and casual developers become proficient more quickly. New developers and casual developers are important as they spread the workload among more people; new features and fixes are easier to develop; and improving the core technology keeps the compiler relevant and competitive.
In this document we propose a set of design and development principles geared towards simplifying GCC development. Specifically, we aim to:
- Reduce the learning curve for new and casual contributors
For a project to succeed in the long term, it needs a continuous influx of new full time developers and casual developers (note that we are explicitly ignoring users, as that is not the focus of this document).
New developers are those who have the potential for becoming maintainers, while casual developers are those who approach the project infrequently with specific contributions. Both groups are important and their needs are not very different from the needs of established and experienced contributors.
- Improve the effectiveness of current contributors
Current contributors, particularly experienced ones, have developed their own work flow and are generally content with the status quo. However, that does not mean that it could not be improved. Some aspects of their work flow has been automated via ad hoc methods, not commonly available. They use knowledge that is often undocumented and rarely automated. They have learnt how to deal with quirks and inefficiencies of the code base; and, as a consequence, are rarely motivated to change it because (a) it works for them, and, (b) their focus is getting that feature implemented or that bug fixed.
Achieving these goals will require some substantive changes to the organization of GCC’s code base. Several of the changes we propose will not be transparent and may even be controversial. Our intent is to reach consensus with the GCC developer community about the new development principles and document them in the GCC home page.
There is an underlying assumption here that where there is a mixture of styles in GCC, some code is operating in the preferred way and some in the less preferred way. New code should use the preferred way where possible and old code should be converted. Additionally, the existing Coding Standards should be updated to reflect new C++ coding guidelines.
In this section we describe, at a high level, the main design principles we propose. These serve as a reference framework for all the specific guidelines and projects discussed later on. The discussion in this section tries to stay away from specifics. Instead, a general overview of each principle is discussed, leaving the gory details for the pages dedicated to each principle.
Modularity is one of the key weaknesses of GCC. There is little or no clear separation between major components and they all need to be built together as a monolithic unit. This causes several issues:
- Developers working on one module are forced to build everything else. Modules cannot be tested or developed in isolation. No module-based or unit testing is possible. The full compiler binary is built at all times and the only testing possible is integration testing.
- There are few clear APIs, many functions perform multiple actions and often have hidden side-effects (e.g., DECL_ASSEMBLER_NAME not only computes the assembler name for a declaration, it also sets it, which has code generation consequences).
- The code is riddled with global state. Many global variables are used to coordinate the actions of independent modules of the compiler.
- There is a heavy reliance on pre-processor macros. These macros are not always documented, unnecessarily expose the fields of the data structures and increase debugging complexity, since it is generally not possible to call macros from gdb to query state (-g3 helps somewhat, but not always) and it is impossible to set breakpoints at invocation points.
- The central data structures are organized in a non-standard object-based scheme with no static typing. Every function handling an instance of these data structures, receives the base class as an argument and needs to decide, at run-time, whether the argument is valid. For instance, a function that only handles TREE_DECL trees ought to declare its argument as tree_decl, not tree.
The above issues introduce additional complexities on an already complex code base. They make re-factoring harder and adding new features becomes an exercise in frustration. GCC should be re-organized to address the above issues:
- The source code should be organized into modules. Each one with a well defined interface, independently built and tested. This approach reduces the interaction between components, reduces the time needed to fix secondary effects to a change, and leads to better testability. Where possible, modules should be designed to support modern test practices (e.g., test-driven development, dependency injection and unit tests).
- Macros should not be so heavily used for referencing data structure fields nor used as placeholders for target/language specific hooks. They should be replaced with functions or function callbacks depending on the context.
- The central data structures used for the intermediate representations (tree, gimple, rtl) should become their own language with its own syntax, type system and serializable representation.
- Inline functions inside headers should be replaced with out-of-line functions in C files. Instead, inlining should be decided at link-time. If one component of GCC cannot see the internals of another component at compile time, it cannot use them but must use the actual interface of that component, but inline functions often require internals to be visible. Note that this becomes less important when using C++, thanks to private class members.
Code generated by GCC for a given target should not depend on the host system. In particular, either it should not depend on whether HOST_WIDE_INT is 32-bit or 64-bit, or HOST_WIDE_INT should be forced to 64-bit. Host floating point operations should not be used in any way that could affect GCC's output. Similarly, the results of compiling with debug info then stripping that info should be the same as the results of compiling without debug info.
To facilitate implementation, the code base needs to be accessible. This includes several layers of documentation, from high-level overviews to detailed API documents. Additionally, tools should exist to automate canonical ways of managing common activities: configuration, building, patch submission, testing, etc.
Accessibility also requires using common design and development frameworks, programming idioms, etc. New and casual developers should be able to apply what they learned elsewhere to GCC development. This reduces the learning curve.
GCC lacks high-level architectural documentation. It is hard to understand how different modules tie together. This information is spread throughout the minds of several maintainers, but it is not written nor maintained.
While internal documentation exists in the doc/ directory, it is not properly maintained and it is generally limited to describing some aspects of a specific module of the compiler. It is also limited to specific implementation issues with little/no design descriptions.
GCC also lacks accessible API documentation. It is hard for a new developer to decide what functions or data structures to use for doing various tasks. The advice generally given is to copy from an existing similar feature or pass. This is sub-optimal and does not always work. Developers do not understand the code they are copying from, which in turn leads to common failures. Two types of documentation would help with this issue:
- API documentation. Automatically generated from the source tree. Several tools exist to simplify this process. Notably, doxygen, which is also used by libstdc++. The important feature of this kind of documentation is that it needs no manual maintenance. It is generated automatically and provides a quick reference for every day use. In fact, it does not even need changes to the source code; these tools extract documentation from source code comments and declarations.
- Tutorials. The GCC wiki contains several of these already. More should be added and a section organized with documents outlining common idioms, patterns, debugging tips, building, etc. A good amount of documentation has been accumulating in GCC’s wiki at http://gcc.gnu.org/wiki/Document%20GCC%20Internals and http://gcc.gnu.org/wiki/GettingStarted. However, this needs better organization and maintenance.
Some of the development processes used by GCC developers require too much manual intervention. We need more sophisticated tools to assist the full development cycle
- Building. In general, building GCC is a slow process but it is one of the better documented and supported aspects of development. What we are missing are automated bots to perform continuous or one-off builds to allow developers to more quickly test their changes across a set of machines.
- Patch submission and review. This is one of the most time consuming and irritating aspects of development. Our development conventions require that patches be posted on gcc-patches. The review process takes place via an e-mail conversation. This satisfies the requirement of keeping an archived log of the patch and related discussion, but it does not mean that we have to make it manual process. The several requirements that a patch needs to meet currently require too much manual work from the patch submitter. Many of these actions can be automated by scripts. There exist patch review systems that facilitate the mechanical aspects of the review (while maintaining the archival copies on gcc-patches). None of them are completely suited to GCC’s development workflow, however. Currently, some contributors use Rietveld, but it has a major deficiency in that review emails do not include meaningful context (do not quote all the diff hunks to which a review comment relates).
- Debugging. While some progress has been done in this area, debugging the compiler is still unnecessarily complicated. For instance, when the compiler ICEs or emits an error message or any other exceptional event, it would be useful if instead of just showing the location of the ICE, the compiler would show a stack dump, or offer the option of attaching a debugger when the event triggers.
When debugging the compiler, pretty printers could be developed to inspect data structures in the compiler. Replacing macros with function calls would allow better support for inspecting compiler state from the debugger.
- Lint tools. Reviewers need to rely on manual inspection to determine whether a patch follows the GCC coding conventions. A lot of unnecessary noise during a patch review could be saved if there existed an automatic tool that checked for such problems. In turn, reviewers would be able to do more substantive reviews if they were not distracted by simple coding convention violations. A unification of tools to address patch submission, review and linting would make it possible to offer a single script that developers would use to start the review. This script would mail the patch, register for review, check coding conventions, etc.
Current testing guidelines are too lax with respect to testsuite failures. Developers are trained to ignore certain failures and they only watch out for introducing no new failures. This practice is counterproductive and wasteful:
- It forces developers to have two builds of the compiler: the clean build and the patched build. The patched build is declared to pass if no new failures were introduced compared to the clean one.
- Instead, the execution of make check should unequivocally determine whether the patch introduced new regressions. If the exit code from make check is 0, the testsuite succeeded. That is all a developer should be concerned about.
It is certainly the case that the testsuite is not always clean for all targets. But the testsuites for primary platforms should always remain clean. Primary platforms are release dependent. The current list (from GCC 4.7’s release criteria):
Related to this issue is the problem of dealing with committed patches that manage to introduce failures in primary platforms other than the one tested by the developer. When such a regression is reported the developer should fix their patch to address the new failures. In the meantime, the new failures are marked XFAIL until a new fix is available.
Additionally, patches that break the testsuite on primary platforms are subject to the 48 hour reversal rule.
See the GCC Development Conventions page.
See the GCC Improvement Projects page.