1 of 37

Boa

A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories

{rdyer,hoan,hridesh,tien}@iastate.edu

Iowa State University

Robert Dyer

Tien N. Nguyen

Hoan Anh Nguyen

Hridesh Rajan

2 of 37

Why mine software repositories?

Learn from the past

Empirical validation

To find better designs

Inform the future

Spot (anti-)patterns

What is actually practiced

Keep doing what works

3 of 37

4 of 37

Consider a task that answers

"What is the average churn rate for Java projects on SourceForge?"

Note: churn rate is the average number of files changed per revision

5 of 37

Is Java project?

Has repository?

Access repository

Calculate project's churn rate

mine project metadata

Yes

Yes

mine revision data

foreach project

Calculate average churn rate

6 of 37

A solution in Java...

public class GetChurnRates {

public static void main(String[] args) { new GetChurnRates().getRates(args[0]); }

public void getRates(String cachePath) {

for (File file : (File[])FileIO.readObjectFromFile(cachePath)) {

String url = getSVNUrl(file);

if (url != null && !url.isEmpty())

System.out.println(url + "," + getChurnRateForProject(url));

}

}

private String getSVNUrl(File file) {

String jsonTxt = "";

... // read the file contents into jsonTxt

JSONObject json = null, jsonProj = null;

... // parse the text, get the project data

if (!jsonProj.has("programming-languages")) return "";

if (!jsonProj.has("SVNRepository")) return "";

boolean hasJava = false;

... // is the project a Java project?

if (!hasJava) return "";

JSONObject svnRep = jsonProj.getJSONObject("SVNRepository");

if (!svnRep.has("location")) return "";

return svnRep.getString("location");

}

private double getChurnRateForProject(String url) {

double rate = 0;

SVNURL svnUrl;

... // connect to SVN and compute churn rate

return rate;

}

}

Full program�over 70 lines of code

Uses JSON and SVN libraries

Runs sequentially

Takes over 24 hrs

Takes almost 3 hrs - with data locally cached!

Too much code!

Do not read!

7 of 37

A better solution...

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

Full program 6 lines of code!

No external libraries needed!

Automatically parallelized!

Results in about 1 minute!

8 of 37

A better solution...

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

9 of 37

The Boa language and data-intensive infrastructure

http://boa.cs.iastate.edu/

10 of 37

Research Questions

  • Can we abstract and simplify the software mining process to make it more accessible to non-experts?

  • Can software repository mining be done efficiently at a large scale?

11 of 37

Design goals

Easy to use

Scalable and efficient

Reproducible research results

12 of 37

Design goals

Easy to use

  • Simple language

  • No need to know details of
    • Software repository mining
    • Data parallelization

13 of 37

Design goals

Scalable and efficient

  • Study millions of projects

  • Results in minutes, not days

14 of 37

Design goals

Reproducible research results

Robles, MSR'10

Studied 171 papers

Only 2 were "replication friendly"

15 of 37

Boa architecture

Boa's Data Infrastructure

Local Cache

Replicator

Caching Translator

SF.net

Compile

Execute on

Hadoop Cluster

Deploy

Query Program

Query Plan

Query Result

Boa's Compiler

MapReduce2

Domain-specific Types/Functions

Quantifiers

Runtime

Cached Data

input reader

User Functions

Boa Language

MapReduce1

Domain-specific Types/Functions

1 Pike et al, Scientific Prog. Journal, Vol 13, No 4, 2005

2 Anthony Urso, http://github.com/anthonyu/Sizzle

16 of 37

Design goals

Easy to use

Scalable and efficient

Reproducible research results

17 of 37

Domain-specific types

http://boa.cs.iastate.edu/docs/dsl-types.php

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

Abstracts details of how to mine software repositories

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

18 of 37

Domain-specific types

http://boa.cs.iastate.edu/docs/dsl-types.php

Project

id

: string

name

: string

description

: string

homepage_url

: string

programming_languages

: array of string

licenses

: array of string

maintainers

: array of Person

....

code_repositories

: array of CodeRepository

19 of 37

Domain-specific types

http://boa.cs.iastate.edu/docs/dsl-types.php

CodeRepository

url

: string

kind

: RepositoryKind

revisions

: array of Revision

Revision

id

: int

committer

: Person

commit_date

: time

log

: string

files

: array of File

File

name

: string

kind

: FileKind

change

: ChangeKind

20 of 37

Domain-specific functions

http://boa.cs.iastate.edu/docs/dsl-functions.php

Mines a revision to see if it contains any files of the type specified.

hasfiletype := function (rev: Revision, ext: string) : bool {

exists (i: int; matches(format(`\.%s$`, ext), rev.files[i].name))

return true;

return false;

}

21 of 37

Domain-specific functions

http://boa.cs.iastate.edu/docs/dsl-functions.php

Mines a revision log to see if it fixed a bug.

isfixingrevision := function (log: string) : bool {

if (matches(`\s+fix(es|ing|ed)?\s+`, log)) return true;

if (matches(`(bug|issue)(s)?[\s]+(#)?\s*[0-9]+`, log)) return true;

if (matches(`(bug|issue)\s+id(s)?\s*=\s*[0-9]+`, log)) return true;

return false;

}

22 of 37

User-defined functions

http://boa.cs.iastate.edu/docs/user-functions.php

id := function (a1: t1, ..., an: tn) [: ret] {

... # body

[return ...;]

};

  • Allows for complex algorithms and code re-use
  • Users can provide their own mining algorithms

23 of 37

Quantifiers

http://boa.cs.iastate.edu/docs/quantifiers.php

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

  • foreach, exists, ifall
  • Bounds are inferred from the conditional

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

24 of 37

Output and aggregation

http://boa.cs.iastate.edu/docs/aggregators.php

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

  • Output can be indexed
  • Output defined in terms of predefined data aggregators
    • sum, set, mean, maximum, minimum, etc
  • Values sent to output aggregation variables

p: Project = input;

rates: output mean[string] of int;

exists (i: int; lowercase(p.programming_languages[i]) == "java")

foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)

foreach (k: int; def(p.code_repositories[j].revisions[k]))

rates[p.id] << len(p.code_repositories[j].revisions[k].files);

25 of 37

Design goals

Easy to use

Scalable and efficient

Reproducible research results

26 of 37

Let's see it in action!

<<demo>>

27 of 37

Why are we waiting for results?

Program is analyzing...

699,332 projects

494,159 repositories

6,385,666 revisions

57,304,233 files

28 of 37

Let's check the results!

<<demo>>

29 of 37

Efficient execution

Task4

Task3

Task2

Task1

30 of 37

Scalability of input size

Task1

Task2

Task3

Task4

6k

60k

620k

6k

60k

620k

6k

60k

620k

6k

60k

620k

31 of 37

Design goals

Easy to use

Scalable and efficient

Reproducible research results

32 of 37

Controlled Experiment

  • Published artifacts (on Boa website)
    • Boa source code
    • Dataset used (timestamp of data)
    • Results file

33 of 37

Related Works

Sourcerer [Linstead et al. Data Mining Know. Disc.'09]

  • SQL database on 18k projects

Kenyon [Bevan et al. ESEC/FSE'05]

  • Centralized database of metadata and source code

PROMISE [Boetticher, Menzies, Ostrand 2007]

  • Online data repository for SE datasets
  • Boa provides raw, un-processed data

Boa provides better scalability

34 of 37

Related Works

Sawzall [Pike et al. Sci.Prog.'05]

  • Similar syntax to Boa
  • Abstracts details of the MapReduce runtime

Pig Latin [Olston et al. SIGMOD'08]

  • Declarative syntax, similar to SQL

DryadLINQ [Yu et al. OSDI'08]

  • Syntax based on .Net's LINQ
  • Compiles to Dryad framework, a DAG of processes

None provide direct support

for mining software repositories

35 of 37

Ongoing work

cvs

git

hg

bzr

GitHub

Google Code

Launchpad

Other artifacts

Language abstractions

Infrastructure improvements

36 of 37

Recent Work

  • Support for mining source code
    • Down to expression level

  • Currently for Java
    • Over 23k projects, with full history
    • Over 14 Billion AST nodes

37 of 37

Conclusions

  • Domain-specific language and infrastructure for software repository mining

    • Easy to use

    • Efficient and scalable

    • Allows reproducing prior results

http://boa.cs.iastate.edu/request/