Refactoring Large Functions

Robert Muth (robert@muth.org)


This post was inspired by C++ but should apply to other procedural languages, e.g Java.

Note, that C++ uses the term function where other languages may use more precise terms like procedure, routine, method.

Because this document is backed by an embedded google doc the links may be messed up. Use [open link in new window] if you have problems.


A (Not So) Hypothetical Scenario

So you have inherited a > 1000 line function and are supposed to add a new feature.

Your company’s style guide mandates that a function should be no longer than one

 “screen height” but clearly this has not been enforced very well - even with the smallest font you struggle to get more than 100 lines to display at once.

Several if-blocks in the function body span more than 100 lines and make it hard to match the end of the block to its beginning.

The function has almost two dozen variables, many with life-ranges that pretty much span the entire function. Closures which were added to your programming language of choice a couple of years ago have not lost their novelty appeal and are used liberally.

Furthermore, indentation levels are out of control causing most of the function body to be perched on the right side of the screen which is aggravating the problem:

Unlike the limits on function length, line length restrictions are automatically enforced by a code reformatter. So combined with high indentation levels this forces line breaks all over the place decreasing readability and making the function even longer.

Clearly your best hope is to refactor the function by decomposing it into smaller ones.

This post offers some strategies for how to go about this.

What Functions Are And Are Not About

Before we start, I would like to address one common misconception: when we break up a function, many of the new functions will have exactly one caller. This rubs some people the wrong way, who view functions primarily as a means of code sharing and re-use.

But code sharing is only one of several benefits of functions and arguably not the main one.

For me the main benefits center around improving readability/understandability:

A good function name lets us raise the level of abstraction away from implementation details. This is why even single line function can be a good idea.

Once we have created a function we can move code out of line. This affords a person reading the code a choice: to either keep reading at the call site or drill down into the implementation of the called function. If the called function is well named the reader will likely just keep reading. If she chooses to drill down it should only take a click in her favorite IDE to see the implementation or a mouse over to pop up the function comment.

  1. All the information used by a function has to be passed in. This makes it very clear what the computation represented by the function does and does not depend on.
  2. The values of local variables do not escape the function body unless explicitly returned. So the length of the function is an upper bound for the length of variable life ranges. This allows us to use shorter variable names.

Note, we ignore global variables here, which should be avoided if possible.

Also note that in OO languages, from the perspective of a member function, member variables are not that different from global variables which is why lots of them are problematic and indicate that a class has become too large.

Note, that most of these benefits do not apply to closures.

 

Another concern against breaking up functions is performance, in particular call overhead. In most cases this is unwarranted. These days functions are mostly a zero cost abstraction:

The compiler will likely inline all the functions that have been broken out.

But as always with performance questions: in case of doubt, measure!

Strategies For Breaking Up Functions

Low Hanging Fruit

Closures are easy to break out because they are functions themselves, assuming they explicitly enumerate what they capture. Closures that are longer than a few lines and/or are not anonymous are prime candidates. Closures potentially capturing everything should be rewritten to a more explicit form regardless.

 

Other prime candidates are any code blocks, i.e. stuff enclosed in “{“ and “}”.

This includes code blocks introduced purely for scoping, if-then-else blocks and

loop bodies (see next sub-section).

Loops

Loops are an especially good candidate for being factored out because of an additional benefit: Many programming languages do not support exiting nested loops directly but once the nested loops has been wrapped into a function you can exit from anywhere with a simple return - no need for goto’s or hard to follow booleans.

If you don’t need to jump out of nested loops, you have a choice: do you move the entire loop or just the loop body into its own function. Often the latter is the better choice, because it can enable further transformations, e.g. Map or fold patterns.

Other Clues

Suppose your function has the structure shown below, i.e. white space separated lines of code usually preceded by a comment stating what the next few lines do:

// first do  x

// then do y

Likely, do_x() and do_y() are good candidates for being broken out.

Similarly, when asked to describe the purpose of a function the answer is “this function does a and b and c” maybe a, b, and c should become separate functions.

How To Deal With Long Parameter Lists

When we mechanically break out functions all the free variables in the function body have to be passed in as parameters (reminder: we strongly eschew global variables).

Such parameter lists can get rather large. There is some debate on how many parameters are acceptable ( [2] mentions 7 while [4] advocates for 3 - more discussion  in [5]) but chances are that refactoring will, at least temporarily, exceed any such number.

The first thing to note here is that this does not reflect a problem of the function we are trying to factor out but rather of the underlying code. Any code that uses a lot of variables close together is problematic in itself. But what can be done? A few suggestions:

Ideally such loop fission would also be a zero cost abstraction and the compiler would just fuse the loop back together. Unfortunately, you cannot count on that. Again, if you are worried about performance, measure!

Sequential Coupling

If the initial function consists of mostly straight line code. Refactoring might result in the following code:

Type1 var1 =...

Type2 var2 = ...

Type3 var3 = ...

ParameterObject po = ...

var4 = Phase1(po, var2)     // PhaseX will have a more meaningful name

var5 = Phase2(po, var3, var2, var4)

Phase3(po, var1, var5)

return var1;

When functions have to be called in sequence we call them sequential coupled which some consider to be a borderline anti-pattern.

If the functions called in sequence are primarily communicating via side-effects the critics have a point. But if the flow of information between the calls can be highlighted by argument (and result) passing as in the example above the code has readability benefits over just a lot of straight line code. Also,  Phase1, Phase2, etc. will be more of  an implementation detail and not part of the API which is less problematic.

Extracting Computations

Computations of individual values are excellent candidates for breaking out, e.g.:

type x

if (some condition) {

  …

  x = …

  …

} else {

  …

  x = …  

  …

}

Or

type x = …

for (...) {

    if (....) {

      x = …

      break            

    }

}

Become:

const x = compute_x(...)

Not only have you broken out some code but what used to be variable is now a constant.

The problem is that the situation is not always that clear-cut. Some other computations might be intermingled and need to be moved out first, e.g via loop fission.

Complex If Conditions

Complex if conditions are good candidates for being factored out.

This will not reduce the line count of the originating function much but can improve readability by switching to a higher level of abstraction.

Compare

if (p.x^2 + p.y^2 + p.z^2 < r^2)  {

    ...

}

To

if (IsInsideSphere(r, p)) {

        ...

}

Often it is possible to simplify complex boolean expressions to use fewer negations and improve commentability:  

Compare:

if ((A || !B) &&  !C) {

    ...

}

To:

bool pred(...) {

  if ( C ) return false;  // comment

  else if ( A ) return true;   // comment

  else If ( B ) return false;  // comment

  return true;

}

Opportunities for Polymorphism

If the code contains repeated switch statements (or equivalent if-then-else sequences) over the same enum type, check if those could be avoided by using polymorphism.

“Flag Parameters” (Logical Cohesion)

Some functions accept boolean arguments that determine which execution path is taken through the function. This is considered an anti-pattern which can be avoided or at least ameliorated by factoring out code. Example:

Foo (... bool add_prologue_and_epilogue) {
        if (add_prologue_and_epilogue) {
         // emit prologue
         ...
     }

     // emit main portion
     ...

     if (add_prologue_and_epilogue) {
         // emit epilogue
         ...

      }
}

After applying techniques from above we get

Foo (..., bool add_prologue_and_epilogue) {
   if (add_prologue_and epilogue) {
       EmitProloguePortion(...)      

    }  
   EmitMainPortion(...)
   if (add_prologue_and_epilogue) {
       EmitEpiloguePortion(...)
   }
}

Taking this one step further eliminates the boolean and offers two API functions instead:

EmitMainPortion(...)  // should probably be renamed to Foo.

FooWithPrologueAndEpilogue(...) {

EmitProloguePortion(...)

EmitMainPortion(...)

EmitEpiloguePortion(...)

}

Early Out Programming

Often indentations levels can be reduced by employing a technique which moves parameter checking and the handling of trivial and pathological cases to the beginning of the function.

This usually simplifies the code since it no longer has to deal with corner cases.

Here is an example of what an rpc handler using this technique might look like:

Response RpcHandler(Request request) {

    if (!Authorized(request.user)) {

             return Error(...);          

    }

    if (request.num_items == 0) {

        return Response();

    }

    // main code  

}

More discussion on this topic can be found here.

Early returns can cause duplication of cleanup code. The next section points out ways to avoid that.

Cleanup Code

The need to cleanup before function exit often leads to patterns like this:

 

file = open(...)

...

if (condition1) {

       close(file)

       return;

}

 …

 acquire(lock)

 ...

if (condition2) {

      release(lock)

      close(file)

      return;

}

...

Sometimes the code duplication is avoided by moving the cleanup to the end of the function

and then using if statements or goto/labels to control what cleanup is performed.

Fortunately many programming languages have mechanisms that make this error-prone practice largely unnecessary:

C++

RAII

Go

Defer (uses function scope)

Java

 try-with-resources

Python

Context Managers

The best time to arrange for cleanup is at the point where it becomes necessary.

The code above becomes shorter and more maintainable when re-rewritten as:

 file = open(...)

 FileCloser closer(file)   // closes the file when going out of scope

...

if (condition1) {

     return;

}

acquire(lock)

LockReleaser releaser(lock)  // releases the lock when going out of scope

 if (condition2) {

       return;

}

...

Uniform Level Of Abstraction

Refactoring is a good occasion to make sure that all the code inside the function ends up at roughly the same high level of abstraction. Here is an example of what we want to avoid:

IncreaseSalaries(employee_list_filename, factor) {

     name_list = []

     file = open(employee_list_filename)

     for (line in file.ReadLines()) {

        name_list.append(line)

     }

     for (name in name_list) {

         employee = db.FetchEmployeeByName(name)

         employee.salary *= factor

         if (employee.salary > T2) {

             employee.withholding = W2  

         } else if(employee.salary > T1) {

             employee.withholding = W1  

         } else {

             employee.withholding = W0

         }

         db.SaveEmployee(employee)

     }

}

A better solution would be to decompose this as follows:

ReadEmployeeNamesFromFile(employee_list_filename) {

     name_list = []

     file = open(employee_list_filename)

     for (line in file.ReadLines()) {

         name_list.append(line)

     }

     return name_list

}

IncreaseSalaries(name_list, factor) {

    for (name in name_list) {

        employee = db.FetchEmployeeByName(name)

        employee.IncreaseSalaryAndUpdateTaxWitholding(factor)

        db.SaveEmployee(employee)

    }

}

A Few Words On Process

Add Or Improve Tests

Tests are your safety-net when refactoring. Improving test will not only make this safety-net more effective, it will also improve your understanding of the code.

Don’t start any refactoring until you have a proper test suite in place.

Start Small

Modifying a large function is daunting, so warm up to the task by starting small. Do simple and less risky things first, e.g.

Once you get more comfortable with the code base advance to more challenging refactorings

Lean On Your Tools

If your IDE has refactoring tools such as “extract function” use them. It will be far more accurate than doing it by hand and save you time.

In the absence of more sophisticated tools, the compiler will be your best friend. You can lean on it for help with tedious work. For example, if you want to extract a function, just move out the block of code and wrap it in preliminary function without any parameters. Next, let the compiler tell you which parameters you need to add.

Iterate

Don’t do everything at once. Start small and iterate. Make sure the tests pass and check in your changes frequently. There is no need to be a perfectionist either . If you temporarily create functions that have way too many parameters, that is ok. You can fix this later.

Don’t Feel Bad About Undoing Earlier Changes

Refactoring a large function is a learning process. Some early refactoring steps may not make sense later after you understand things better. Don’t be ashamed to revert them.

References

[1] https://stackoverflow.com/questions/475675/when-is-a-function-too-long

[2] Code Complete (2nd Ed) - Steve McConnel

[3] Working Effectively with Legacy Code - Michael Feathers

[4] Clean Code - Robert C. Martin

[5] https://stackoverflow.com/questions/174968/how-many-parameters-are-too-many

[6] Big Ball of Mud - Brian Foote, Joseph Yoder