COS320, Spring 2015: Compiling Techniques

Superblock Formation

Warning: This assignment can be a challenging one (a lot more than AS4 and AS5) and will be worth more in scores. Start early! And you can’t use any late days on this assignment (because it’s due on Dean’s date).

Implement superblock formation.

Download the archived file for as6 and extract it to the project directory.

cd $COS320_SRC_ROOT
wget http://www.cs.princeton.edu/courses/archive/spring15/cos320/homeworks/as6/as6.tar.gz
tar xfvz as6.tar.gz
rm as6.tar.gz

This assignment will be using the basic block and edge profilers and the profile information with loaders we made in the previous assignment. If your solution works well, use them. But if you want to use the reference solution, first modify $COS320_SRC_ROOT/lib/Makefile not to build your own Profile library. Change this line

PARALLEL_DIRS = AST Visitor CodeGen Profile Superblock

to this:

PARALLEL_DIRS = AST Visitor CodeGen Superblock

Then you can download the shared library here.

cd $COS320_SRC_ROOT
wget http://www.cs.princeton.edu/courses/archive/spring15/cos320/homeworks/as6_ref_profile_lib/Profile.so.ref
make (to create Debug+Asserts/ directory)
cd Debug+Asserts/lib
ln -s ../../Profile.so.ref Profile.so

Note that you need to make to symbol link again after you run ‘make clean’.

Files relevant to this assignment are in lib/Superblock/. You will modify SuperblockFormation class defined in lib/Superblock/SuperblockFormation.cpp and lib/Superblock/SuperblockFormation.h. In the previous assignment, the profile information printer (ProfilePrinter) used the two profile information loaders (BBProfileLoader and EdgeProfileLoader) to print the profiling information. In this assignment, your SuperblockFormation class needs to use the two loaders again to implement the profiling feedback-directed optimization. In SuperblockFormation::getAnalysisUsage method, the two loaders have already been added as required using AnalysisUsage::addRequired<>. You need to use getAnalysis<> to get the instances of these loaders. (You can also refer to ProfilePrinter class too)

The assignment can be roughly divided into two parts:

For each innermost loop, select traces that are frequently executed
Perform tail duplication on those selected traces

There can be various detailed strategies on how to do a), but in this assignment we will use the algorithm covered in the lecture. Please follow this algorithm closely; otherwise it will be very hard to check the correctness of your code. You will need the profile information loaders to do this.

You are going to select traces only within innermost loops in this assignment. Below is the pseudocode for the trace selection algorithm. This has to be done for every innermost loop that exists in a module.

i = 0;

mark all BBs unvisited

while (there are unvisited nodes) do

seed = unvisited BB with largest execution frequency

trace[i].push_back(seed)

mark seed visited

current = seed

// Grow trace forward

while (true) do

next = best_successor_of(current)

if (next == NULL) then break

traces[i].push_back(next)

mark next visited

current = next

endwhile

current = seed // set current back to seed

// Grow trace backward

while (true) do

prev = best_predecessor_of(current)

if (prev == NULL) then break

traces[i].push_front(prev)

mark prev visited

current = prev

endwhile

i++;

endwhile

The pseudocode to select best successor/predecessor is as follows:

best_successor_of(BB)

e = control flow edge with highest probability leaving BB

if (e is a backedge) then

return NULL

endif

if (probability < THRESHOLD) then

return NULL

endif

d = destination of e

if (d is visited) then

return NULL

endif

if (d is outside of the loop) then

return NULL

endif

return d

endprocedure

best_predecessor_of(BB)

if (BB is the loop header)

return NULL

e = control flow edge with highest probability entering BB

if (probability < THRESHOLD) then

return NULL

endif

s = source of e

if (s is visited) then

return NULL

if (s is outside of the loop) then

return NULL

endif

return s

endprocedure

In this assignment, you are going to use Loop class, including methods inherited from its base class LoopBase, a lot. When selecting a BB with the highest probability, you should iterate BBs within a loop using Loop::block_iterator; if there is a tie, you should select the first one encountered when you are iterating BBs using LoopBase::block_begin/LoopBase::block_end. The below is the template code showing how you iterate BBs within a loop using Loop::block_iterator.

for (Loop::block_iterator bi = l->block_begin(), be = l->block_end(); bi != be; ++bi) {

BasicBlock *bb = *bi;

...

}

When iterating predecessors and successors of a BB, you can use pred_iterator (or const_pred_iterator) and succ_iterator (or succ_const_iterator) with pred_begin/pred_end and succ_begin/succ_end functions. If there is a tie between predecessors/successors with the highest probability, you should select the first one encountered in the iteration as well. Below is an example of iterating predecessors and successors.

The branch bias threshold is configurable by the option ‘-superblock-branch-bias-threshold’ defined in Superblock.cpp and currently set to 0.7. You can use ‘biasThreshold’ to access this value. Don’t change or delete this option and the default value. And note that a basic block can belong to at most one trace, because when a BB is ‘marked’, it is not a candidate of next BB selection or successor/predecessor selection anymore.

The next step is tail duplication. For every trace you selected in the previous step, identify the first side entrance, and replicate all BBs from the target to the bottom of the trace. Even if you have multiple side entrances to a BB, you should duplicate each BB only once. After duplicating BBs, fix side entrances. The last duplicated BB (BB3’ in the figure below) in a trace should branch to the next BB in the original CFG (BB4 below). Note that BB4 does not belong to the trace in this case.

When duplicating a BB, make the cloned BB’s name’s suffix ‘.clone’. So, ‘mybb’ will be cloned to ‘mybb.clone’. And every time you clone a BB, increment ‘numClonedBBs’ statistics variable defined in SuperblockFormation.cpp to collect statistics. And you are not allowed to create any other BBs other than cloned BBs, or delete any existing BBs, even if they are empty (= have only one TerminatorInst).

Here we are not going to impose any limit on the code size expansion, because it is nearly impossible to exceed the code size by 2x anyway in the current setting.

Because you have duplicated BBs and altered a CFG, you need to fix various PHI nodes and reconstruct a valid SSA form. Suppose, in the example above, BB2 has another successor:

If BB2 has a definition of a variable ‘myvar’, its duplicate BB2’ has a definition of its duplicate variable ‘myvar.clone’. If BBX has a use of ‘myvar’, BBX should have a PHI node in the beginning and its use of ‘myvar’ should be converted to that PHI node. This is just one example; there can be many possible cases which can be tricky. Think about how to reconstruct a valid SSA form again and implement it. Describe how you solved the problem in your README.

As usual, you can build your program using ‘make’.

cd $COS320_SRC_ROOT
make

If your program compiles successfully, there will be Debug+Asserts/lib/Superblock.so. Let’s first test with our proverbial all.fun.

./Debug+Asserts/bin/codegen tests/all.fun

Before applying superblock formation, we need to run BB/edge profilers we made in AS5 to generate CFG profiling data. We add ‘-mem2reg’ to see whether your PHI node handling is correct or not.

opt -break-crit-edges -mem2reg -S -o bench.ll all.ll

opt -load Debug+Asserts/lib/Profile.so -bb-profiler -edge-profiler -stats -S -o bench.prof.ll bench.ll
clang -o bench.prof.exe bench.prof.ll
./bench.prof.exe

opt -load Debug+Asserts/lib/Profile.so -bb-profile-loader -edge-profile-loader -profile-printer -dump-bb-info -dump-edge-info -stats -disable-output bench.ll

Now you have bb_info.prof, edge_info.prof, and prof_dump.txt.

It’s time to run your superblock formation pass.

opt bench.ll -load Debug+Asserts/lib/Profile.so -load Debug+Asserts/lib/Superblock.so -bb-profile-loader -edge-profile-loader -superblock-formation -stats -S -debug-only=superblock-formation -o bench.sb.ll

-debug-only=superblock-formation will print debug messages enclosed in DEBUG macro only in your superblock formation pass. And we compile and run this superblock-formed program:

clang -o bench.sb.exe bench.sb.ll
./bench.sb.exe

Your program output and the return value should be the same as the original program. all.fun does not take any input file, but if the program takes an input file, profiling input is usually smaller than the final testing input.

To compare the execution time, to be consistent, we optimize both the unmodified LLVM bitcode file and the superblock-formed file with -O3 and compare their execution time.

opt bench.ll -O3 -unroll-allow-partial -S -o bench.opti.ll
opt bench.sb.ll -O3 -unroll-allow-partial -S -o bench.sb.opti.ll
clang -o bench.opti.exe bench.opti.ll
clang -o bench.sb.opti.exe bench.sb.opti.ll
time ./bench.opti.exe
time ./bench.sb.opti.exe

all.fun is too simple so there would be no difference in the execution time anyway. But actually, even if you run bigger programs with longer execution time, chances of seeing noticeable difference are small. Possible reasons are

While superblock formation alone can give some scheduling freedom to the backend phases, we do not change the superblock scheduling part. To do this, we need to modify the Intel processor backend part of the compiler, which is not feasible in assignments, and the processor itself needs to support the features.
We didn’t apply all the enabling optimizations after the superblock formation pass (-O3 includes some of them though).
Most importantly, superblock formation is especially efficient in VLIW-like processors, since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler. Since our FC lab machines are very complex Intel i7 processor in which the processor itself does most of the scheduling by hardware OOO execution, it is hard to predict meaningful speedup in these machines.

So don’t worry if your program does not show any speedup; you will not be graded on your program’s speedup.

One way to check the correctness of your pass is to see what your new CFG looks like. opt has -view-cfg and -view-cfg-only options for this; these options show your CFGs in pictures. see Tips and Tricks section for the detailed instruction.

The other way is to see the output of profile information printer we used in AS5. We profiled the unoptimized bitcode file to generate profiling data for superblock formation above, but this time, this is to check if the superblock formation pass is running as we expected.

opt -load Debug+Asserts/lib/Profile.so -bb-profiler -edge-profiler -bb-info-output-file=bb_info.sb.prof -edge-info-output-file=edge_info.sb.prof -stats -S -o bench.sb.prof.ll bench.sb.ll
clang -o bench.sb.prof.exe bench.sb.prof.ll
./bench.sb.prof.exe

opt -load Debug+Asserts/lib/Profile.so -bb-profile-loader -edge-profile-loader -profile-printer -bb-info-input-file=bb_info.sb.prof -edge-info-input-file=edge_info.sb.prof -profile-printer-dump-file=prof_dump.sb.txt -dump-bb-info -dump-edge-info -stats -disable-output bench.sb.ll

We specified other filenames for *.prof and prof_dump.txt files not to overwrite the files generated above. See the result of prof_dump.sb.txt and compare it to the original prof_dump.txt and check the difference makes sense.

As in AS5, we are not limited to FUN files for testing. All these commands are very tedious to test manually, so we provide you with a few benchmarks and Makefile as in the previous assignment.

cd $COS320_SRC_ROOT/benches/BENCH_NAME/src
make

Here replace BENCH_NAME with a real directory name. Don’t use parallel options like -j8 here, because several tests may clobber each other’s outputs. This will do all the steps listed above for you. prof_dump.sb.txt will be generated automatically as well.

(Note that this is only applicable if you use the LLVM version 3.5 and compile the code with the same options as in the provided Makefile)

Now you can add several more programs to benches/ directory to test. What you need to do is make a similar directory structure as one of the benchmarks, and edit benches/BENCH_NAME/exec_info file to specify command-line arguments, testing commands, and reference outputs. You can also edit benches/BENCH_NAME/compile_info if you need special cflags.

Submit a README, SuperblockFormation.cpp, and SuperblockFormation.h, to dropbox here.

Tips and Tricks

Below is the list of miscellaneous tips and tricks. The listed order is not the order of importance.

The classes you might want to additionally use in this assignment include, but are not limited to:

When f is a variable of Function*, you can get a LoopInfo instance of a function by

LoopInfo &li = getAnalysis<LoopInfo>(*f);

Note that LoopInfo has been already declared as a required analysis in SuperblockFormation::getAnalysisUsage.

You can print debug messages within a pass using one of outs(), errs(), or dbgs() stream. But make sure you delete or comment out all those messages when you submit your work. Or, a better way is to enclose those statements within DEBUG macros. In this way your debug messages will be printed only with the -debug or -debug-only=superblock-formation option.

When cloning basic blocks, CloneBasicBlock function will be helpful.

To use -view-cfg or -view-cfg-only option in opt, you need to install X server in your computer. If you are using Mac, you need to install XQuartz if you haven’t already. If you are using Windows, Cygwin/X or Xming is a good option. If you are a Linux user, well, in general you don’t need to do anything; chances are you already have X11 installed.
After installing one of those, when you log in to FC lab servers, you need to use -X option to ssh, to enable X11 forwarding:

ssh -X NETID@labpc-proxy.cs.princeton.edu

If you are using the forwarding routine to one of FC lab servers in .bashrc (it’s in AS1 page), add ‘-X’ option to the ssh command there too. If you are not using it, add ‘-X’ option to your second ssh command too.

ssh -X NETID@fc010-labpc-01.cs.princeton.edu (You can use any of labpc-01 to labpc-23)

To see if your X forwarding works, run xclock after you log in to one of FC lab servers. If a clock window appears, you are all set.

xclock

-view-cfg option displays the CFG with all the instructions, while -view-cfg-only does not show any instructions. -view-cfg-only is often easier to use because it is simpler. To run it:

opt -view-cfg-only -disable-output whatever.ll (or whatever.bc)

This displays the CFG for each function within the .ll or .bc file one by one. If you want to view a specific function only or view the CFG in the middle of a pass execution, you can use Function::viewCFG or Function::viewCFGOnly function.

When computing edge weights for best predecessor,

bb1 -> bb2 : 0.9
bb1 -> bb3 : 0.1
bb4 -> bb2 : 1.0

In this case, when selecting the best predecessor from bb2, you should select bb4. It exceeds the threshold(0.7), so it satisfies the requirement. Don't compute the edge weight of bb4->bb2 as 1.0 / (0.9 + 1.0). Just use EdgeProfileLoader::getWeight(edge).

Here are the reference outputs for the given benchmarks:

sum does not have superblock formation candidates
wc
yacc

Note that this is only applicable if you use the LLVM version 3.5 and compile the code with the same options as in the provided Makefile.

Sometimes it is helpful to explore LLVM source code to get some information. (For example, you might want to know how a certain function is being used) You can do that using Doxygen generated documentation. But you can also directly browse the source code in $LLVM_SRC_ROOT/. ctags can be convenient to browse the code.

References

You can find all the LLVM related documentations in the LLVM 3.5 documentation page.

Available from: Friday, 24 April 2015

Due date: Tuesday, 12 May 2015, 5:00pm (Dean’s date)

COS320, Spring 2015: Compiling Techniques

You are here

Superblock Formation

References