1 of 14

Vallari Agrawal <val.agl002@gmail.com>

Outreachy Summer Intern, 2022

Making Teuthology a Better Detective

2 of 14

Outreachy and GSoC

2

3 of 14

Outreachy Project

3

4 of 14

Agenda

  1. An overview of teuthology workflow
  2. Problem
  3. Solution
  4. Benefits of the solution
  5. Implementation
  6. How to adopt this feature for other tests
  7. Future improvements

4

5 of 14

An overview of teuthology workflow

Teuthology is an automation framework for Ceph.

Components:

  1. Teuthology scheduler
  2. Teuthology dispatcher
  3. Beanstalk Queue
  4. Shaman
  5. Paddles
  6. Pulpito
  7. Jenkins

A job consists of multiple tasks.

Collection of jobs makes a test run.

5

6 of 14

Problem

Current behaviour: when a unit test task fails, teuthology throws error like CommandFailure with no information about which test failed.

Due to this:

  • Test run reviewers need to go through teuthology log file to find which test failed.
  • Data captured is not useful for sentry purposes since it does not point to the actual error

6

7 of 14

Error Message - Before

7

8 of 14

Solution

Create an opt-in feature to scan teuthology logs for unit-test errors.

  1. In job’s yaml file, add scan_logs: True
  2. Teuthology will:
    1. Scan teuthology log of that job with unit test (with ErrorScanner)
    2. Look for job errors according to the kind of unit test (eg. nose for python)
    3. Raise UnitTestError with information about what specific test is failing

8

9 of 14

Error Message - After

9

10 of 14

Benefits of the Solution

By enabling this solution, it will:

  • Improve productivity of the engineer:

Before, it took about ~15 seconds to look for the failing tests in logs.

Npw, for a run with 20 failures, this could save 5 mins of the engineers time.

  • Improve Sentry data for error tracking:

Instead of CommandFailure error, now it stores the information about the failing unit test which is more meaningful and accurate.

  • No significant difference in job’s runtime:

ErrorScanner reads a 0.5GB log file in 1.5 seconds.

10

11 of 14

Implementation

  • Read teuthology log files and search for test failure’s error regex.

Example: nose test failure message start with “ERROR:” or “FAIL:”

  • Keep a flag which points to the last byte read by the ErrorScanner.

ErrorScanner does not read above this flag index to avoid re-reading.

  • When test failure is found, raise UnitTestError with error message from logs.

11

12 of 14

How to adopt this feature for other tests?

  1. Add the test’s error message regex to class ErrorScanner in teuthology.

ERROR_PATTERN = { “nose”: [

re.compile(r"ERROR:\s"),

re.compile(r"FAIL:\s"),

],}

  • Ensure ceph/ceph-ci’s Run() calls include the test error type as an argument.

scan_tests_errors = [“nose”]

remote.Run(args, scan_tests_errors=scan_tests_errors)

12

13 of 14

Future Improvements

  • Improve coverage of unit tests to include more kinds of test.

Right now, only implemented for nose (python’s unit test library) and gtest (c++ unit test library).

13

14 of 14

References

14