Alexander Bezzubov
Workshop: Introduction to ML-on-code
Alexander Bezzubov
source{d}
Intro
Justification & background
What is ML-on-Code?
justification & background
Number of source code is rapidly growing
&
We need better tools to deal with all aspects of it
justification & background
DevTools
https://www.openhub.net/
justification & background
GUI
https://www.openhub.net/
justification & background
Browsers
https://www.openhub.net/
justification & background
RDBMS
https://www.openhub.net/
justification & background
OS
https://www.openhub.net/
justification & background
Bigger proprietary codebases inside the companies
https://informationisbeautiful.net/visualizations/million-lines-of-code/
justification & background
We need better tools to deal with all aspects of it
justification & background
We need better tools to deal with all aspects of it
justification & background
What is ML-on-code?
Approach:
ML research directions:
OSS Tech stack
How to get enough data?
tech stack
Ad-hoc scripts to collect/process/analyze
Necessary evil.
Not maintained custom Python scripts, written by a single researcher
Task-specific,
<150k files
https://informationisbeautiful.net/visualizations/million-lines-of-code/
Research
Dataset
tech stack
OSS tools to collect/process/analyze
Research
Dataset
Production-grade OSS.
Shared cost of ownership of the infrastructure among community
Shared
>180k repositories >400 languages >50m files
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
infrastructure
More details at talk at CfgMgmtCamp http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
collection
go-git to talk Git Last year had a talk at FOSDEM https://archive.fosdem.org/2017/schedule/event/go_git/
GIT LIBRARY FOR GO
motivation
go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO
GO-GIT IN ACTION
example
PURE GO SOURCE CODE
features
usage
resources
YOUR NEXT STEPS
TRY IT YOURSELF
• need to clone and analyze tens of millions of repositories with our core language Go
• be able to do so in memory, and by using custom filesystem implementations
• easy to use and stable API for the Go community
• used in production by companies, e.g.: keybase.io
• the most complete git library for any language after libgit2 and jgit
• highly extensible by design
• idiomatic API for plumbing and porcelain commands
• 2+ years of continuous development
• used by a significant number of open source projects
example mimicking `git clone` using go-git:�
• https://github.com/src-d/go-git
• go-git presentation at FOSDEM 2017
• go-git presentation at Git Merge 2017
• list of more go-git usage examples
output:
# installation
$ go get -u gopkg.in/src-d/go-git.v4/...
// Clone the repo to the given directory��url := "https://github.com/src-d/go-git",
_, err := git.PlainClone(
"/tmp/foo", false,
&git.CloneOptions{� URL: url,� Progress: os.Stdout,� },
)��CheckIfError(err)
Counting objects: 4924, done.�Compressing objects: 100% (1333/1333), done.�Total 4924 (delta 530), reused 6 (delta 6), pack-reused 3533
CODE COLLECTION AT SCALE
motivation
rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE
KEY CONCEPT
architecture
SEEK, FETCH, STORE
architecture
usage
resources
YOUR NEXT STEPS
SETUP & RUN
• collection and storage of repositories at large scale
• automated process
• optimal usage of storage
• optimal to keep repositories up-to-date with the origin
• distributed system similar to a search engine
• src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories
• src-d/borges producer reads URL list, schedules fetching
• borges consumer fetches and pushes repo to storage
• borges packer also available as a standalone command, transforming repository urls into siva files
• stores using src-d/śiva repository storage file format
• optimized for storage and keeping repos up-to-date
• rooted repositories are standard git repositories that store all objects from all repositories that share a common history, identified by same initial commit:
• a rooted repository is saved in a single śiva file
• updates stored in concatenated siva files: no need to rewriting the whole repository file
• distributed-file-system backed, supports GCS & HDFS
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
storage
SMART REPO STORAGE
motivation
śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT
SIVA FILE BLOCK SCHEMA
architecture
CHARACTERISTICS
architecture
usage
resources
YOUR NEXT STEPS
APPENDING FILES
• store a git repository in a single file
• updates possible without rewriting the whole file
• friendly to distributed file systems
• seekable to allow random access to any file position
• src-d/go-siva is an archiving format similar to tar or zip
• allows constant-time random file access
• allows seekable read access to the contained files
• allows file concatenation given the block-based design
• command-line tool + implementations in Go and Java
# pack into siva file �$ siva pack example.siva qux
# append into siva file�$ siva pack --append example.siva bar
# list siva file contents� $ siva list example.siva�Sep 20 13:04 4 B qux -rw-r--r-- Sep 20 13:07 4 B bar -rw-r--r--
Core OS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
processing
Apache Spark
Engine
YOUR NEXT STEPS
UNIFIED SCALABLE PIPELINE
motivation
engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK
APACHE SPARK DATAFRAME
architecture
PREPARATION
architecture
usage sample
resources
• easy-to-use pipeline for git repository analysis
• integrated with standard tools for large scale data analysis
• avoid custom code in operations across millions of repos
• listing and retrieval of git repositories
• Apache Spark datasource on top of git repositories
• iterators over any git object, references
• code exploration and querying using XPath expressions
• language identification and source code parsing
• feature extraction for machine learning at scale
• https://github.com/src-d/engine
• Early example jupyter notebook: https://github.com/src-d/spark-api/blob/master/examples/notebooks/Example.ipynb
• extends Apache SparkSQL
• git repositories stored as siva files or standard repositories in HDFS
• metadata caching for faster lookups over all the dataset.
• fetches repositories in batches and on demand
• available APIs for Spark and PySpark
• can run either locally or in a distributed cluster
EngineAPI(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
analysis
Enry
Project Babelfish
Talk in Source Code Analysis devRoom
Room: UD2.119, Sunday, 12:40 https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_code_analysis/
UNIVERSAL CODE ANALYSIS
motivation
babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING
CONTAINER-BASED
architecture
POWERFUL OPPORTUNITIES
use cases
usage
resources
YOUR NEXT STEPS
UNIVERSAL AST
architecture
• was born as a solution for massive code analysis
• parsing single files in any programming language
• analyze all source code from all repositories in the world
• analyze many languages using a shared structure/format
• AST-based diff'ing. Understanding changes made to code with finer-grained granularity.
• extract features for Machine Learning on Source Code.
• statistics of language features
• detecting similar coding patterns across languages
• language drivers as the main building blocks
• parsing service via one driver per language
• language drivers can be written in any language and are packaged as standard Docker containers
• containers are executed by the babelfish server in a specific runtime built on-top of libcontainer.
• UAST is a universal (normalized and annotated) form of Abstract Syntax Tree (AST)
• language-independent annotations (roles) such as Expression, Statement, Operator, Arithmetic, etc.
• can be easily ported to many languages using gogo/protobuf
• or run babelfish server & dashboard locally:
$ docker run --privileged -d -p \
9432:9432 --name bblfsh \
bblfsh/server
�$ docker run -p 8080:80 --link \
bblfsh bblfsh/dashboard \
--bblfsh-addr bblfsh:9432
LANG DETECTION AT SCALE
motivation
enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR
benchmarks
COMPATIBLE AND FLEXIBLE
architecture
usage
resources
YOUR NEXT STEPS
GO FASTER
usable in Go as a native library, in Java as shared library and as a CLI tool.
• need to detect programming languages of every file in a git repository
• initially used github/linguist, but needed more performance for large scale applications
• keep compatibility with the original linguist project�
• linguist as source of information on language detection
• ignores binary and vendored files
• command line tool mimics the original linguist one
• can be used in Go (native library) or Java (shared library)
• enry speed improvement over linguist when applied to linguist/samples folder file samples
• src-d/enry is at least 4x faster than linguist
• 5x (larger repos) to 20x faster (smaller repos)
$ enry /path/to/src-d/go-git�98.28% Go�0.69% Shell�0.34% Makefile�0.34% Markdown�0.34% Text
CoreOS
infrastructure
Rovers
collection
storage
processing
analysis
K8s
go-git
HDFS
śiva
Apache Spark
source{d} Engine
Bblfsh
Enry
tech stack
Borders
tech stack
Production-grade OSS.
Shared cost of ownership of the infrastructure among community
Shared
>180k repositories >400 languages >50m files
OSS tools to collect/process/analyze
Research
Public Github Archive
tech stack
2Tb, 180k repos, 400 langs, 50m files
C: 5m files, 2,6b LOC, 84Gb
CLI interface: pga list, pga get
1. Collect: rovers/borges/siva
Engine(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")
2. Proces: source{d} Engine
3. Analyze: project Babelfish
UAST
OSS tools to collect/process/analyze
Research
Public Github Archive
thank you.
https://github.com/bzz/ml-on-code