Boa
A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories
{rdyer,hoan,hridesh,tien}@iastate.edu
Iowa State University
Robert Dyer
Tien N. Nguyen
Hoan Anh Nguyen
Hridesh Rajan
Why mine software repositories?
Learn from the past
Empirical validation
To find better designs
Inform the future
Spot (anti-)patterns
What is actually practiced
Keep doing what works
Consider a task that answers
"What is the average churn rate for Java projects on SourceForge?"
Note: churn rate is the average number of files changed per revision
Is Java project?
Has repository?
Access repository
Calculate project's churn rate
mine project metadata
Yes
Yes
mine revision data
foreach project
Calculate average churn rate
A solution in Java...
public class GetChurnRates {
public static void main(String[] args) { new GetChurnRates().getRates(args[0]); }
public void getRates(String cachePath) {
for (File file : (File[])FileIO.readObjectFromFile(cachePath)) {
String url = getSVNUrl(file);
if (url != null && !url.isEmpty())
System.out.println(url + "," + getChurnRateForProject(url));
}
}
private String getSVNUrl(File file) {
String jsonTxt = "";
... // read the file contents into jsonTxt
JSONObject json = null, jsonProj = null;
... // parse the text, get the project data
if (!jsonProj.has("programming-languages")) return "";
if (!jsonProj.has("SVNRepository")) return "";
boolean hasJava = false;
... // is the project a Java project?
if (!hasJava) return "";
JSONObject svnRep = jsonProj.getJSONObject("SVNRepository");
if (!svnRep.has("location")) return "";
return svnRep.getString("location");
}
private double getChurnRateForProject(String url) {
double rate = 0;
SVNURL svnUrl;
... // connect to SVN and compute churn rate
return rate;
}
}
Full program�over 70 lines of code
Uses JSON and SVN libraries
Runs sequentially
Takes over 24 hrs
Takes almost 3 hrs - with data locally cached!
Too much code!
Do not read!
A better solution...
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
Full program 6 lines of code!
No external libraries needed!
Automatically parallelized!
Results in about 1 minute!
A better solution...
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
The Boa language and data-intensive infrastructure
http://boa.cs.iastate.edu/
Research Questions
Design goals
Easy to use
Scalable and efficient
Reproducible research results
Design goals
Easy to use
Design goals
Scalable and efficient
Design goals
Reproducible research results
Robles, MSR'10
Studied 171 papers
Only 2 were "replication friendly"
Boa architecture
Boa's Data Infrastructure
Local Cache
Replicator
Caching Translator
SF.net
Compile
Execute on
Hadoop Cluster
Deploy
Query Program
Query Plan
Query Result
Boa's Compiler
MapReduce2
Domain-specific Types/Functions
Quantifiers
Runtime
Cached Data
input reader
User Functions
Boa Language
MapReduce1
Domain-specific Types/Functions
1 Pike et al, Scientific Prog. Journal, Vol 13, No 4, 2005
2 Anthony Urso, http://github.com/anthonyu/Sizzle
Design goals
Easy to use
Scalable and efficient
Reproducible research results
Domain-specific types
http://boa.cs.iastate.edu/docs/dsl-types.php
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
Abstracts details of how to mine software repositories
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
Domain-specific types
http://boa.cs.iastate.edu/docs/dsl-types.php
Project | |
id | : string |
name | : string |
description | : string |
homepage_url | : string |
programming_languages | : array of string |
licenses | : array of string |
maintainers | : array of Person |
.... | |
code_repositories | : array of CodeRepository |
Domain-specific types
http://boa.cs.iastate.edu/docs/dsl-types.php
CodeRepository | |
url | : string |
kind | : RepositoryKind |
revisions | : array of Revision |
Revision | |
id | : int |
committer | : Person |
commit_date | : time |
log | : string |
files | : array of File |
File | |
name | : string |
kind | : FileKind |
change | : ChangeKind |
Domain-specific functions
http://boa.cs.iastate.edu/docs/dsl-functions.php
Mines a revision to see if it contains any files of the type specified.
hasfiletype := function (rev: Revision, ext: string) : bool {
exists (i: int; matches(format(`\.%s$`, ext), rev.files[i].name))
return true;
return false;
}
Domain-specific functions
http://boa.cs.iastate.edu/docs/dsl-functions.php
Mines a revision log to see if it fixed a bug.
isfixingrevision := function (log: string) : bool {
if (matches(`\s+fix(es|ing|ed)?\s+`, log)) return true;
if (matches(`(bug|issue)(s)?[\s]+(#)?\s*[0-9]+`, log)) return true;
if (matches(`(bug|issue)\s+id(s)?\s*=\s*[0-9]+`, log)) return true;
return false;
}
User-defined functions
http://boa.cs.iastate.edu/docs/user-functions.php
id := function (a1: t1, ..., an: tn) [: ret] {
... # body
[return ...;]
};
Quantifiers
http://boa.cs.iastate.edu/docs/quantifiers.php
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
Output and aggregation
http://boa.cs.iastate.edu/docs/aggregators.php
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
p: Project = input;
rates: output mean[string] of int;
exists (i: int; lowercase(p.programming_languages[i]) == "java")
foreach (j: int; p.code_repositories[j].kind == RepositoryKind.SVN)
foreach (k: int; def(p.code_repositories[j].revisions[k]))
rates[p.id] << len(p.code_repositories[j].revisions[k].files);
Design goals
Easy to use
Scalable and efficient
Reproducible research results
Let's see it in action!
<<demo>>
Why are we waiting for results?
Program is analyzing...
699,332 projects
494,159 repositories
6,385,666 revisions
57,304,233 files
Let's check the results!
<<demo>>
Efficient execution
Task4
Task3
Task2
Task1
Scalability of input size
Task1
Task2
Task3
Task4
6k
60k
620k
6k
60k
620k
6k
60k
620k
6k
60k
620k
Design goals
Easy to use
Scalable and efficient
Reproducible research results
Controlled Experiment
Related Works
Sourcerer [Linstead et al. Data Mining Know. Disc.'09]
Kenyon [Bevan et al. ESEC/FSE'05]
PROMISE [Boetticher, Menzies, Ostrand 2007]
Boa provides better scalability
Related Works
Sawzall [Pike et al. Sci.Prog.'05]
Pig Latin [Olston et al. SIGMOD'08]
DryadLINQ [Yu et al. OSDI'08]
None provide direct support
for mining software repositories
Ongoing work
cvs
git
hg
bzr
GitHub
Google Code
Launchpad
Other artifacts
Language abstractions
Infrastructure improvements
Recent Work
Conclusions
http://boa.cs.iastate.edu/request/