Launchpad FY2011 Q1 & Q2 Critical Bugs Analysis

Introduction

That analysis was prompted because the long-term trend around our Critical

bugs pool is flat (y=252). We roughly get as many new criticals as we fix. This hints

at that we have serious quality issues and that we need to change things in

order to quench the source of new criticals.

The Data

For Canonical folk, you can retrieve the original spreadsheet containing the data used in the analysis is available in the Launchpad FY2011 Critical Bugs spreadsheet.

For others, you can follow along on the published version.

You can switch output=xls, output=ods, or output=csv to retrieve this in a machine-readable format.

Methodology

I randomly selected 50 bugs from the 389 that were filed since April 1st. (I selected April 1st because I thought that we were done with legacy issues and that would expose quality issues in _new_ development).

(To random sample, I use =RANDOM() and then sort the bugs and selected the first 50, after removing Invalid bugs and the one from the results-tracker.)

I then look into each bug, reading the related merge proposals to find where the bug was introduced.

Several columns were filled during that process:

Three fields were filed automatically based on the closing date and the assignee:


Results

Two pivot table were created to summarize the data.

Introduced vs Fixed                                    

unfixed

community

feature

maintenance

support

Grand Total

unknown

1

2%

1

2%

2

4%

4

8%

community

1

2%

1

2%

feature

6

12%

1

2%

7

14%

legacy

14

28%

1

2%

13

26%

1

2%

29

58%

maintenance

2

4%

3

6%

1

2%

6

12%

thunderdome

1

2%

2

4%

3

6%

Grand Total

18

36%

1

2%

9

18%

17

34%

5

10%

50

100%

This table shows in rows the rotation-type that introduced the bug and in columns the rotation-type that fixed the bug. From that table, we can see that in our sample:


Root cause by source

                                                     

community

feature

legacy

maintenance

thunderdome

Grand Total

critical_by_inheritance

1

2%

1

2%

data_corruption

1

2%

1

2%

deployment_interaction

1

2%

3

6%

1

2%

5

10%

development_production_drift

1

2%

1

2%

failed_to_handle_user_generated_error

1

2%

1

2%

flaky_tests

2

4%

2

4%

incomplete_refactoring

1

2%

1

2%

insufficient_scaling

2

4%

10

20%

12

24%

invalid_priority

2

4%

2

4%

leaky_abstraction

1

2%

1

2%

merge_conflict

1

2%

1

2%

missing_integration_test

2

4%

3

6%

2

4%

7

14%

missing_interface_test

1

2%

1

2%

2

4%

missing_unit_test

1

2%

4

8%

5

10%

open_transaction

1

2%

1

2%

requirements_misunderstanding

1

2%

1

2%

script_db_permission

2

4%

1

2%

3

6%

unexpected_interaction

1

2%

1

2%

unknown

1

2%

1

2%

view_model_validation

1

2%

1

2%

Grand Total

4

8%

1

2%

7

14%

29

58%

6

12%

3

6%

50

100%

This table shows in rows the class of bug and in columns the rotation-type that introduced the bug. From it, we can see that in our sample:

Recommendations

Based on these results, I can think of the following recommendations.

Launchpad has a lots of legacy

This is not a recommendation, it's more the summary of the background in which we are operating: really, Launchpad is a legacy application. 58% of the new bugs were actual legacy bugs! That means that have probably hundreds of bugs matching our critical criteria lurking in various part of our code base. The inconsistency in our test coverage means also that we are likely to introduce many more bugs than we should in the future. (of the 14 bugs related to spotty test coverage, 7 were legacy and 7 were part of new work).

Given this I think that focusing on performance and testing are likely to be our two best strategy to reduce the incoming flow of criticals.

We are not done with performance

The single most specific cause of bugs were performance related. So we should renew our efforts to improve our model to make it performant. Performance bugs are not trivial to fix. In many cases, we'll either need to rework the model to make it possible to operate on sets instead of individual objects (the death-by-a-thousand-sql-queries problem) or rework the schema. Fortunately, it's now easier than ever to iterate on schema changes. And we have good patterns for the operate-on-set case.

So I would recommend that maintenance squads should focus on timeout related issues first and foremost.

This will have compounding effects:

So while working on maintenance rotation, a good rule of thumb could be ask yourself: in the bugs I fixed recently, was there a performance-related issue? If not, please pick one as your next target.

Test or die

We can do better at our testing story. Missing tests represent the other big source behind new criticals. In this category, I would suggest things along the following lines:

With the LaunchpadObjectFactory, it's now easier then ever to write good unit tests, so there is no reason we shouldn't be doing it. We also have much better javascript testing infrastructure than ever. The parallellisation of the test suite (will be started by Yellow once Orange completes Custom bug listings) should also help kill the meme that we shouldn't add too many tests because it slows down the test suite.

Deployment-related issues

We should probably remove the policy about fine-grained DB permissions for scripts. It's not giving us much, and it trips us very often. Let's use the same permission for scripts than the web app (while maintaining a separate user).

I'm not sure if there is anything we can do about the deployment interactions class of issue apart. We should learn to be more defensive when writing code and think about how this code will interact with older version of the code running concurrently. We will need to develop this awareness more and more anyway as we move to a SOA archicture. The fastdowntime also introduce more conditions of this kind. I don't have anything more concrete here than let's remind each other about this aspect.

Recommendations and observations (TA)

Incoming vs outgoing rates

maintenance+support squads together are paying 14/29=48% of the tech-debt listed as 'legacy', and doing that is taking 14/22=63% of their combined output. To stay on top of the legacy critical bug source then, we need a 100% increase in the legacy fix rate and that

isn't available from the existing maintenance squads no matter whether we ask them to drop other sources of criticals or not. If we did not have maintenance added criticals (6 items) and that translated 1:1 into legacy fixes we'd still be short 9 legacy bugfixes to keep the legacy component flat.

So this says to me, we are really mining things we didn't do well enough in the past, and it takes long enough to fix each one, that until we hit the bottom of the mine, its going to be a standing feature for us.

Here I recommend that we keep, or increase in future, resourcing for maintenance work (unless and until we reduce the cost of these fixes)

Testing

I agree with the recommendations to spend some more effort on the

safety nets of testing; the decreased use of doctests and increases in

unit tests should aid with maintenance overhead and avoiding known

problems is generally a positive thing. The SOA initiative will also

help us decouple things as we go which should help with

maintainability and reaction times.

Plugging the hole

What troubles me a bit is the unknown size of the legacy mine, and that from the analysis we added 25% of the legacy volume criticals from feature work. The great news is that all the ones you examined were all fixed. I'd like us to make sure though, that we don't end up

adding performance debt - which can be particularly hard to fix.

The numbers don't really say we're safe from this - 26% of criticals coming from changes (feature + maintenance (discounting the thunderdome)) - is a large amount, and features in particular are not just tweaking things, they are making large changes, which adds up to a lot of risk. There are two aspects to the feature rotation that have been worrying me for a while; one is performance testing of new work (browser performance, data scaling - the works), the other is that we rotate off right after users get the feature. I think we should allow 10% of the feature time, or something like that, so that after-release-adoption issues can be fixed from the resources allocated to the feature. One way to do this would be to say

that:

 - After release feature squads spend 1-2 weeks doing polish and/or general bugs (in the area, or even just criticals->high etc). At the end of that time, they stay on the feature, doing this same stuff, until all the critical bugs introduced/uncovered by the feature work feature are fixed.

 - The maintenance squad about to rotate onto feature work does not rotate either, until the feature squad has fixed all those criticals.