Final Report

RTEMS Fault Tolerance: Get a Fault Injection Tool to Work with RTEMS

ESA Summer of Code in Space Program 2015 Project Final Report

Saeed Ehteshamifar

salpha.2004@gmail.com

Technical University of Darmstadt

Darmstadt, Germany

September 2015

This document serves as a rewrite of http://rtems-fi.blogspot.de/ blog posts in a more organized fashion. Comparing to the blog, it might also cover more details in some aspects.

Table of Contents

1. Introduction

1.1. Basics of fault injection and terminology

1.1.1 CRASH scale

2. Slingshot

2.1. Adaption of Slingshot to RTEMS

2.1.1. Challenge one: Different process model

2.1.2. Challenge two: No dynamic linking

2.2. Configuring RTEMS for Slingshot use

3. GRINDER

3.1. Extending GRINDER to new targets

4. Future works

4.1. Distribute Slingshot-RTEMS as a stand-alone repository

4.2. Integrate GRINDER-RTEMS’ functions into Slingshot-RTEMS

4.3. Use dynamic linking for tests generation and execution

4.4. Continue tests execution in case of failure

4.5. Statistics gathering and analysis

4.6. Cover more POSIX functions

4.7. Extend to Classic API

4.8. Use RTEMS Source Builder to facilitate invoking the injection procedure

4.9. Extend set of supported fault models, e.g., by resource exhaustion models or hardware single event upsets

5. Conclusion

1. Introduction

RTEMS^[1] has been chosen as a mentoring organization for European Space Agency’s Summer of Code in Space (SOCIS) 2015. Among many open projects^[2], there is a RTEMS fault-tolerance project^[3] which is space-oriented due to the dependability achieved via fault-tolerance.

The one-sentence goal of this open project is to “get a fault injection tool to work with RTEMS and create tutorials and examples.”^[4] During this SOCIS^[5], I have proudly worked on this project to achieve this goal by introducing Slingshot^[6] and GRINDER^[7], two fault injection tools.

1.1. Basics of fault injection and terminology

The ultimate goal of fault injection is to measure how much a system, which is usually called system under test (SUT), is tolerant against faults. This means how much a system, in case of triggering faults, can tolerate them and continue failure free operation.

Consider a system, defined as any machine running RTEMS, intended to provide a service. As long as the system is doing the intended function (service), it is working correctly. Failures are events that lead to deviation of system service. This deviation is assumed to be in different forms, called service failure modes, and is ranked according to failure severities. In this project, CRASH scale has been used to model failure modes, which is explained in detail later on.

There are two steps before a failure happens: fault, and error. A fault is a problem in the system that might lead to a failure. For instance, consider this piece of code:

#include <iostream>

unsigned int sum (int a, int b) {

return (a + b);

}

int main () {

std::cout << sum (-1, 4) << std::endl;

return 0;

}

Function “sum” is supposed to add both positive and negative numbers. But since it returns an unsigned integer as the result of addition, passing negative numbers to this function triggers this fault. An error is a triggered fault. In the code above, the result of addition is indeed positive, and the error is masked. Therefore calling sum (-1, 4) does not lead to a failure and outputs the result:

But when we call sum (-1, 0), this fault in implementation of “sum”, leads to a deviation of what this function is supposed to do, therefore a failure occurs. This time the output is:

4294967295

which is definitively wrong.

We call this type of fault injection, that aims to trigger faults in the system by passing special values as function arguments, data-type based fault injection. There are other alternatives for fault injection like code mutation that tries to flip bits of an executable binary file. In this project we took data-type based fault injection approach.

In brief, we can put everything together in the Figure 1, taken from Basic Concepts and Taxonomy of Dependable and Secure Computing.^[8]

1.1.1 CRASH scale

As said in the previous section, failures of a SUT can be modeled using a scale called CRASH (taken from Ballista Project^[9]) which is simply an acronym for the following:

C: Catastrophic
R: Restart
A: Abort
S: Silent
H: Hindering

Consider a task (process), called Test Case Executor, which is consisted of several subtasks, each called a Test Case.

If the execution of any of Test Cases, lead to the system crash, we put this type of failures in Catastrophic class. In another word, Catastrophic failures lead to crash/hanging of not only test application (i.e. Test Case Executor plus all Test Cases), but also operating system and simply SUT.

Restart class denotes those type of failures that are timed-out. This means, when a Test Case neither finishes execution in a certain amount of time, nor halts. These type of failures can be detected via a watchdog timer deployed by Test Case Executor.

When a Test Case abnormally terminates, we put in under Abort class. Note that termination of Test Case in Abort class does not lead to the system crash, hence it is different than Catastrophic class.

All these three classes -Catastrophic, Restart, Abort- show a “robustness failure” in the SUT. The rest -Silent, Hindering- show a problem in the SUT specification but are not measured as a “robustness failure”.

Silent class is for Test Cases that had some errors in the execution, but silently passed them without an error code being generated.

Hindering class is like Silent class, except that here an error code is generated, but a wrong one.

We will refer to CRASH scale later on on Slingshot chapter.

2. Slingshot

Slingshot is a lightweight fault generation framework, based on CMU's Ballista. Basically it is a set of Python scripts which follows the same approach as Ballista to generate data type faults. However, Slingshot has traded off portability of Ballista to many POSIX implementations in favor of less code complexity. Slingshot-RTEMS^[10] is the adapted version of Slingshot to RTEMS which was developed during this SOCIS. In the rest of this document, we call Slingshot-RTEMS simply Slingshot.

Slingshot runs on a Linux host system to produce test suites for an RTEMS target, currently i386. The target can be changed to any new platform as long as the cross-compiler and QEMU support it. Test suites are C++ applications.

An overview to Slingshot is given in Figure 2.

As illustrated in Figure 2, Slingshot runs in two phases:

Database initialization phase: In this phase, test case stubs are generated from the test case list, and stored in database. ………...database tables

2.1. Adaption of Slingshot to RTEMS

2.1.1. Challenge one: Different process model

2.1.2. Challenge two: No dynamic linking

2.2. Configuring RTEMS for Slingshot use

3. GRINDER

3.1. Extending GRINDER to new targets

4. Future works

In this chapter future works for this project are given. Generally, future works divides in two types of tasks:

Tasks required to finish the automatic procedure of fault generation, execution, and statistics gathering (basic tasks).
Tasks to further develop the tool to cover more functions of POSIX API, to cover RTEMS Classic API, etc. (features).

As implementation of features is dependant to basic tasks, future works in this chapter have been prioritized considering tasks that belong to the first category first, and then feature tasks.

4.1. Distribute Slingshot-RTEMS as a stand-alone repository

Slingshot-RTEMS^[11] is an adaption of Slingshot^[12] to RTEMS. This means, test case generation, and the skeleton of Slingshot-RTEMS is based on Slingshot. But test case execution has been tailored to RTEMS model (Slingshot was designed for Linux Standard Base).

Currently Slingshot-RTEMS is released using Slingshot as a Git Submodule. This means Slingshot-RTEMS is released as a patch, that should be applied to Slingshot, which is referenced via Git Submodule. This method has two drawbacks:

Increased overhead and complexity in build instructions.
Modifications to Slingshot that might be incompatible with Slingshot-RTEMS because these projects are being separately developed and changes to Slingshot might not be necessarily compatible with Slingshot-RTEMS.

Due to this reasons, it is better to distribute Slingshot-RTEMS as a stand-alone repository. This is possible because Slingshot is released under GPL v2 (or higher). Legal considerations are mostly involved in this task and technical aspects are little.

4.2. Integrate GRINDER-RTEMS’ functions into Slingshot-RTEMS

Currently Slingshot-RTEMS generates test campaigns and GRINDER-RTEMS is responsible to execute them and write outputs back to the database. There are several points in integrating GRINDER-RTEMS’ functionality into Slingshot-RTEMS:

GRINDER-RTEMS is a Java Maven project and Slingshot-RTEMS is a Python project. Having two tools require additional maintenance overhead.
GRINDER-RTEMS is an extension to GRINDER which is developed at DEEDS^[13]. This, again, might impose incompatibility issues resulted from further development of GRINDER by DEEDS.
It is easier to automate test case generation and execution in a single tool, rather than separate tools.

GRINDER-RTEMS functionality is as simple as invoking QEMU and running test program, getting QEMU’s console output, parsing output and storing results on database. Therefore it is relatively low-cost to implement these functions on Slingshot-RTEMS to obtain an integrated tool.

4.3. Use dynamic linking for tests generation and execution

Since a test campaign may encompass a huge list of test cases, putting all test cases in a single C++ program may produce GCC overflow error simply because source file might get as big as a hundred megabytes file. Therefore Slingshot uses dynamic linking to run test cases one by one and avoid compiling a big source file. However, Slingshot-RTEMS is using static linking because at the time of development, the author was unaware of RTEMS libdl which provides dynamic linking support.

Although static linking works fine for relatively small test campaigns (maybe up to thousands of test cases per campaign), it would be nice to use RTEMS dynamic linking for tests generation and execution.

NOTE: In author’s point of view, this task is not critical because as an alternative approach, size of test campaigns can be reduced to thousand of test cases rather than hundreds of thousands.

4.4. Continue tests execution in case of failure

Since there is much more overhead involved in invoking QEMU comparing to execute each test case, test campaigns (test programs) encompass lots of test cases to make QEMU invocation time neglectable in comparison to test execution time. This, however, poses two considerations:

System under test’s state changes after execution of each test case. Therefore order of test cases in test campaign matters. This means, maybe for instance WAIT_1 test case passes if executed as the first test case in the test campaign, but fails if executed after for instance WAIT_4 test case.
If a test program fails due to failure of a test case in the middle, test execution process should be resumed automatically to enable nightly tests execution.

For the first point, each test campaign should be executed several times, with test cases randomly placed (different from other times) in it each time. Then results should be analyzed to see if different combinations produced different outputs.

The second point is actually what is intended for this sub-section. Test program execution should automatically resume in case of a test case failure. The failure could be detected via a timer on host (which is executing tests under QEMU). Then, QEMU should be restarted and a list which contains rest of test cases should be passed to RTEMS to continue execution. End of each test program is marked via a specific string so this process should continue until the end.

4.5. Statistics gathering and analysis

Initially, R^[14] was proposed^[15] as a free tool to gather statistics and results analysis. During SOCIS 15, nothing has been done on this due to some unforeseen issues in adapting Slingshot to RTEMS and shortage of time. Therefore this could be changed to any other tool based on contributor’s discretion.

Statistics gathering is the last step to make the whole process useful and meaningful. In another word, this is the last basic task. In the rest of this chapter, additional features are discussed.

4.6. Cover more POSIX functions

Many POSIX^[16] functions like fork, exec* are unimplementable on RTEMS due to a different process model. Apart from this, many of POSIX functions were omitted to limit the scope of this SOCIS project.

One feature is to test more POSIX functions implemented on RTEMS to eventually cover whole API (except unimplementable functions). Slingshot-RTEMS provides util scripts to generate test case lists based on functions defined in Function Signatures (refer to Slingshot chapter for more details).

4.7. Extend to Classic API

Even if all POSIX functions are tested, many RTEMS users use RTEMS Classic API for application development. Therefore it is crucial to extend tests to Classic API as well. However, this might be a little tricky since Slingshot’s code generation module is based on Ballista, which has targeted POSIX API. This means Function Signatures, definition of compound data types in the type hierarchy, and maybe some other basic elements are specific to POSIX and should be tailored to Classic API to add support for this API.

4.8. Use RTEMS Source Builder to facilitate invoking the injection procedure

This feature was proposed as Further Improvements in the proposal^[17].

Slingshot-RTEMS is an external tool which aims to be easily deployable on many Linux host systems.

But the author believes taking this as a feature should be reconsidered.

4.9. Extend set of supported fault models, e.g., by resource exhaustion models or hardware single event upsets

5. Conclusion

[1] "RTEMS Real Time Operating System (RTOS)." 2007. 1 Sep. 2015 <https://www.rtems.org/>

[2] "Developer/OpenProjects – RTEMS Project." 2014. 1 Sep. 2015 <https://devel.rtems.org/wiki/Developer/OpenProjects>

[3] "Developer/Projects/Open/Fault_injection – RTEMS Project." 2015. 1 Sep. 2015 <https://devel.rtems.org/wiki/Developer/Projects/Open/Fault_injection>

[4] "Developer/OpenProjects – RTEMS Project." 2014. 20 Oct. 2015 <https://devel.rtems.org/wiki/Developer/OpenProjects>

[5] "SOCIS/2015 – RTEMS Project." 2015. 1 Sep. 2015 <https://devel.rtems.org/wiki/SOCIS/2015>

[6] "DEEDS-TUD/Slingshot · GitHub." 2015. 1 Sep. 2015 <https://github.com/DEEDS-TUD/Slingshot>

[7] "DEEDS-TUD/GRINDER · GitHub." 2015. 1 Sep. 2015 <https://github.com/DEEDS-TUD/GRINDER>

[8] Avižienis, A. 2004. Basic concepts and taxonomy of dependable and secure ... http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1335465.

[9] 2014. Ballista: COTS Software Robustness Hardening. http://users.ece.cmu.edu/~koopman/ballista/.

[10] "salpha2004/slingshot-rtems · GitHub." 2015. 6 Sep. 2015 <https://github.com/salpha2004/slingshot-rtems>

[11] "salpha2004/slingshot-rtems · GitHub." 2015. 29 Sep. 2015 <https://github.com/salpha2004/slingshot-rtems>

[12] "DEEDS-TUD/Slingshot · GitHub." 2015. 29 Sep. 2015 <https://github.com/DEEDS-TUD/Slingshot>

[13] "DEEDS." 2014. 29 Sep. 2015 <https://www.deeds.informatik.tu-darmstadt.de/>

[14] "R: The R Project for Statistical Computing." 2015. 29 Sep. 2015 <https://www.r-project.org/>

[15] "RTEMS Fault Tolerance Project Proposal.” 30 Apr. 2015 <https://docs.google.com/document/d/1rVEhyHXajAVMcqgrOR-0t84596QUkYOnifoaYpa8QYM/pub?usp=sharing >

[16] "Posix - The Open Group." 2011. 29 Sep. 2015 <http://pubs.opengroup.org/onlinepubs/9699919799/>

[17] "RTEMS Fault Tolerance Project Proposal.” 30 Apr. 2015 <https://docs.google.com/document/d/1rVEhyHXajAVMcqgrOR-0t84596QUkYOnifoaYpa8QYM/pub?usp=sharing>