1 of 37

Alexander Bezzubov

Workshop: Introduction to ML-on-code

2 of 37

Alexander Bezzubov

source{d}

Intro

PMC at Apache Zeppelin
lead engineer at source{d}
@seoul_engineer

startup in EU, based Madrid, Spain
builds open-source software for Machine Learning on source code

3 of 37

Justification & background

What is ML-on-Code?

4 of 37

justification & background

Number of source code is rapidly growing

&

We need better tools to deal with all aspects of it

5 of 37

justification & background

DevTools

https://www.openhub.net/

6 of 37

justification & background

GUI

https://www.openhub.net/

7 of 37

justification & background

Browsers

https://www.openhub.net/

8 of 37

justification & background

RDBMS

https://www.openhub.net/

9 of 37

justification & background

OS

https://www.openhub.net/

10 of 37

justification & background

Bigger proprietary codebases inside the companies

https://informationisbeautiful.net/visualizations/million-lines-of-code/

11 of 37

justification & background

We need better tools to deal with all aspects of it

Discover: search, catalog, recommend projects
Read: search, navigate, understand, comprehend code
Write: suggest, generate

ranking code completion suggestions
learning code style/conventions

Test: generate, validate, spec
Compile/execute: optimize, fine-tune, validate
Hiring: candidate sourcing, ranking contributors
Legal: copyright, licensing, plagiarism detection

12 of 37

justification & background

We need better tools to deal with all aspects of it

Security & Compliance

Malicious Actor Detection, Vulnerability Detection, Malicious Code Detection

Automated Code Review

Style Conventions, Idiomatic Code, Naming Suggestions, Architecture Suggestions

QA & Testing

Test case suggestions, Test generation

Bug Detection & Prediction

Issues, Code Evolution & Commit data as labels

Performance

Metrics as labels for code

13 of 37

justification & background

What is ML-on-code?

Research direction and business opportunity to improve the way we deal with code (by building better tools!)

Approach:

Code = Data
Naturalness hypothesis *
Apply NLP-like techniques to source code

ML research directions:

Project similarity
Code similarity
Contributor similarity

14 of 37

OSS Tech stack

How to get enough data?

15 of 37

tech stack

Ad-hoc scripts to collect/process/analyze

Necessary evil.

Not maintained custom Python scripts, written by a single researcher

Task-specific,

<150k files

github.com/src-d/awesome-machine-learning-on-source-code

https://informationisbeautiful.net/visualizations/million-lines-of-code/

Research

Dataset

16 of 37

tech stack

OSS tools to collect/process/analyze

Research

Dataset

Production-grade OSS.

Shared cost of ownership of the infrastructure among community

Shared

>180k repositories >400 languages >50m files

17 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

18 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

19 of 37

infrastructure

Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb)
CoreOS provisioned on bare-menta \w Terraform
Booting and OS configuration Matchbox and Ignition
K8s deployed on top of that

More details at talk at CfgMgmtCamp http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html

20 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

21 of 37

collection

Rovers: search for Git repository URLs
Borges: fetching repository \w “git pull”
Git storage format & protocol implementation
Optimize for on-disk size: forks that share history, saved together

go-git to talk Git Last year had a talk at FOSDEM https://archive.fosdem.org/2017/schedule/event/go_git/

22 of 37

GIT LIBRARY FOR GO

motivation

go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO

GO-GIT IN ACTION

example

PURE GO SOURCE CODE

features

usage

resources

YOUR NEXT STEPS

TRY IT YOURSELF

• need to clone and analyze tens of millions of repositories with our core language Go

• be able to do so in memory, and by using custom filesystem implementations

• easy to use and stable API for the Go community

• used in production by companies, e.g.: keybase.io

• the most complete git library for any language after libgit2 and jgit

• highly extensible by design

• idiomatic API for plumbing and porcelain commands

• 2+ years of continuous development

• used by a significant number of open source projects

example mimicking `git clone` using go-git:�

• https://github.com/src-d/go-git

• go-git presentation at FOSDEM 2017

• go-git presentation at Git Merge 2017

• compatibility table of git vs. go-git

• comparing git trees in go

• list of more go-git usage examples

output:

# installation

$ go get -u gopkg.in/src-d/go-git.v4/...

// Clone the repo to the given directory��url := "https://github.com/src-d/go-git",

_, err := git.PlainClone(

"/tmp/foo", false,

&git.CloneOptions{� URL: url,� Progress: os.Stdout,� },

)��CheckIfError(err)

Counting objects: 4924, done.�Compressing objects: 100% (1333/1333), done.�Total 4924 (delta 530), reused 6 (delta 6), pack-reused 3533

23 of 37

CODE COLLECTION AT SCALE

motivation

rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE

KEY CONCEPT

architecture

SEEK, FETCH, STORE

architecture

usage

resources

YOUR NEXT STEPS

SETUP & RUN

• collection and storage of repositories at large scale

• automated process

• optimal usage of storage

• optimal to keep repositories up-to-date with the origin

• set up and run rovers

• set up borges

• run borges producer

• run borges consumer

• distributed system similar to a search engine

• src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories

• src-d/borges producer reads URL list, schedules fetching

• borges consumer fetches and pushes repo to storage

• borges packer also available as a standalone command, transforming repository urls into siva files

• stores using src-d/śiva repository storage file format

• optimized for storage and keeping repos up-to-date

• rooted repositories are standard git repositories that store all objects from all repositories that share a common history, identified by same initial commit:

• a rooted repository is saved in a single śiva file

• updates stored in concatenated siva files: no need to rewriting the whole repository file

• distributed-file-system backed, supports GCS & HDFS

• https://github.com/src-d/rovers

• https://github.com/src-d/borges

• https://github.com/src-d/go-siva

• śiva: Why We Created Yet Another Archive Format

24 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

25 of 37

storage

Metadata: PostgreSQL
Built small type-safe ORM for Go<->Postgres https://github.com/src-d/go-kallax

Data: Apache Hadoop HDFS
Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file

26 of 37

SMART REPO STORAGE

motivation

śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT

SIVA FILE BLOCK SCHEMA

architecture

CHARACTERISTICS

architecture

usage

resources

YOUR NEXT STEPS

APPENDING FILES

• store a git repository in a single file

• updates possible without rewriting the whole file

• friendly to distributed file systems

• seekable to allow random access to any file position

• src-d/go-siva is an archiving format similar to tar or zip

• allows constant-time random file access

• allows seekable read access to the contained files

• allows file concatenation given the block-based design

• command-line tool + implementations in Go and Java

• https://github.com/src-d/go-siva

• śiva: Why We Created Yet Another Archive Format

# pack into siva file �$ siva pack example.siva qux

# append into siva file�$ siva pack --append example.siva bar

# list siva file contents� $ siva list example.siva�Sep 20 13:04 4 B qux  -rw-r--r-- Sep 20 13:07 4 B bar  -rw-r--r--

27 of 37

Core OS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

28 of 37

processing

Apache Spark

Engine

For batch processing, SparkSQL

Library, \w custom DataSource implementation GitDataSource
Read repositories from Siva archives in HDFS, exposes though DataFrame
API for accessing refs/commits/files/blobs
Talks to external services though gRPC for parsing/lexing, and other analysis

29 of 37

YOUR NEXT STEPS

UNIFIED SCALABLE PIPELINE

motivation

engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK

APACHE SPARK DATAFRAME

architecture

PREPARATION

architecture

usage sample

resources

• easy-to-use pipeline for git repository analysis

• integrated with standard tools for large scale data analysis

• avoid custom code in operations across millions of repos

• listing and retrieval of git repositories

• Apache Spark datasource on top of git repositories

• iterators over any git object, references

• code exploration and querying using XPath expressions

• language identification and source code parsing

• feature extraction for machine learning at scale

• https://github.com/src-d/engine

• Early example jupyter notebook: https://github.com/src-d/spark-api/blob/master/examples/notebooks/Example.ipynb

• extends Apache SparkSQL

• git repositories stored as siva files or standard repositories in HDFS

• metadata caching for faster lookups over all the dataset.

• fetches repositories in batches and on demand

• available APIs for Spark and PySpark

• can run either locally or in a distributed cluster

EngineAPI(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")

30 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

31 of 37

analysis

Enry

Project Babelfish

Programming language identification
Re-write of github/linguist in Golang, ~370 langs

Distributed parser infrastructure for source code analysis
Unified interface though gRPC to native parsers in containers: src -> uAST

Talk in Source Code Analysis devRoom

Room: UD2.119, Sunday, 12:40 https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_code_analysis/

32 of 37

UNIVERSAL CODE ANALYSIS

motivation

babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING

CONTAINER-BASED

architecture

POWERFUL OPPORTUNITIES

use cases

usage

resources

YOUR NEXT STEPS

UNIVERSAL AST

architecture

TRY BABELFISH ONLINE

• was born as a solution for massive code analysis

• parsing single files in any programming language

• analyze all source code from all repositories in the world

• analyze many languages using a shared structure/format

• AST-based diff'ing. Understanding changes made to code with finer-grained granularity.

• extract features for Machine Learning on Source Code.

• statistics of language features

• detecting similar coding patterns across languages

• language drivers as the main building blocks

• parsing service via one driver per language

• language drivers can be written in any language and are packaged as standard Docker containers

• containers are executed by the babelfish server in a specific runtime built on-top of libcontainer.

• https://github.com/bblfsh

• Babelfish documentation

• announcing Babelfish

• Babelfish presentation

• join the Babelfish community

• UAST is a universal (normalized and annotated) form of Abstract Syntax Tree (AST)

• language-independent annotations (roles) such as Expression, Statement, Operator, Arithmetic, etc.

• can be easily ported to many languages using gogo/protobuf

• or run babelfish server & dashboard locally:

$ docker run --privileged -d -p \

9432:9432 --name bblfsh \

bblfsh/server

�$ docker run -p 8080:80 --link \

bblfsh bblfsh/dashboard \

--bblfsh-addr bblfsh:9432

33 of 37

LANG DETECTION AT SCALE

motivation

enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR

benchmarks

COMPATIBLE AND FLEXIBLE

architecture

usage

resources

YOUR NEXT STEPS

GO FASTER

usable in Go as a native library, in Java as shared library and as a CLI tool.

• need to detect programming languages of every file in a git repository

• initially used github/linguist, but needed more performance for large scale applications

• keep compatibility with the original linguist project�

• linguist as source of information on language detection

• ignores binary and vendored files

• command line tool mimics the original linguist one

• can be used in Go (native library) or Java (shared library)

• https://github.com/src-d/enry

• enry: detecting languages

• benchmark methodology and results

• enry speed improvement over linguist when applied to linguist/samples folder file samples

• src-d/enry is at least 4x faster than linguist

• 5x (larger repos) to 20x faster (smaller repos)

• additional info on benchmarking enry

$ enry /path/to/src-d/go-git�98.28% Go�0.69% Shell�0.34% Makefile�0.34% Markdown�0.34% Text

34 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

35 of 37

tech stack

Production-grade OSS.

Shared cost of ownership of the infrastructure among community

Shared

>180k repositories >400 languages >50m files

OSS tools to collect/process/analyze

Research

Public Github Archive

36 of 37

tech stack

2Tb, 180k repos, 400 langs, 50m files

C: 5m files, 2,6b LOC, 84Gb

CLI interface: pga list, pga get

1. Collect: rovers/borges/siva

Engine(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")

2. Proces: source{d} Engine

3. Analyze: project Babelfish

UAST

OSS tools to collect/process/analyze

Research

Public Github Archive

http://pga.sourced.tech/

37 of 37

thank you.

https://github.com/bzz/ml-on-code