1 of 37

Alexander Bezzubov

Workshop: Introduction to ML-on-code

2 of 37

Alexander Bezzubov

source{d}

Intro

  • PMC at Apache Zeppelin
  • lead engineer at source{d}
  • @seoul_engineer
  • startup in EU, based Madrid, Spain
  • builds open-source software for Machine Learning on source code

3 of 37

Justification & background

What is ML-on-Code?

4 of 37

justification & background

Number of source code is rapidly growing

&

We need better tools to deal with all aspects of it

5 of 37

justification & background

DevTools

https://www.openhub.net/

6 of 37

justification & background

GUI

https://www.openhub.net/

7 of 37

justification & background

Browsers

https://www.openhub.net/

8 of 37

justification & background

RDBMS

https://www.openhub.net/

9 of 37

justification & background

OS

https://www.openhub.net/

10 of 37

justification & background

Bigger proprietary codebases inside the companies

https://informationisbeautiful.net/visualizations/million-lines-of-code/

11 of 37

justification & background

We need better tools to deal with all aspects of it

  • Discover: search, catalog, recommend projects
  • Read: search, navigate, understand, comprehend code
  • Write: suggest, generate
    • ranking code completion suggestions
    • learning code style/conventions
  • Test: generate, validate, spec
  • Compile/execute: optimize, fine-tune, validate
  • Hiring: candidate sourcing, ranking contributors
  • Legal: copyright, licensing, plagiarism detection

12 of 37

justification & background

We need better tools to deal with all aspects of it

  • Security & Compliance
    • Malicious Actor Detection, Vulnerability Detection, Malicious Code Detection
  • Automated Code Review
    • Style Conventions, Idiomatic Code, Naming Suggestions, Architecture Suggestions
  • QA & Testing
    • Test case suggestions, Test generation
  • Bug Detection & Prediction
    • Issues, Code Evolution & Commit data as labels
  • Performance
    • Metrics as labels for code

13 of 37

justification & background

What is ML-on-code?

  • Research direction and business opportunity to improve the way we deal with code (by building better tools!)

Approach:

  • Code = Data
  • Naturalness hypothesis *
  • Apply NLP-like techniques to source code

ML research directions:

  • Project similarity
  • Code similarity
  • Contributor similarity

14 of 37

OSS Tech stack

How to get enough data?

15 of 37

tech stack

Ad-hoc scripts to collect/process/analyze

Necessary evil.

Not maintained custom Python scripts, written by a single researcher

Task-specific,

<150k files

https://informationisbeautiful.net/visualizations/million-lines-of-code/

Research

Dataset

16 of 37

tech stack

OSS tools to collect/process/analyze

Research

Dataset

Production-grade OSS.

Shared cost of ownership of the infrastructure among community

Shared

>180k repositories >400 languages >50m files

17 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

18 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

19 of 37

infrastructure

  • Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb)
  • CoreOS provisioned on bare-menta \w Terraform
  • Booting and OS configuration Matchbox and Ignition
  • K8s deployed on top of that

More details at talk at CfgMgmtCamp http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html

20 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

21 of 37

collection

  • Rovers: search for Git repository URLs
  • Borges: fetching repository \w “git pull”
  • Git storage format & protocol implementation
  • Optimize for on-disk size: forks that share history, saved together

go-git to talk Git Last year had a talk at FOSDEM https://archive.fosdem.org/2017/schedule/event/go_git/

22 of 37

GIT LIBRARY FOR GO

motivation

go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO

GO-GIT IN ACTION

example

PURE GO SOURCE CODE

features

usage

resources

YOUR NEXT STEPS

TRY IT YOURSELF

need to clone and analyze tens of millions of repositories with our core language Go

be able to do so in memory, and by using custom filesystem implementations

easy to use and stable API for the Go community

used in production by companies, e.g.: keybase.io

the most complete git library for any language after libgit2 and jgit

highly extensible by design

idiomatic API for plumbing and porcelain commands

2+ years of continuous development

used by a significant number of open source projects

example mimicking `git clone` using go-git:�

list of more go-git usage examples

output:

# installation

$ go get -u gopkg.in/src-d/go-git.v4/...

// Clone the repo to the given directory��url := "https://github.com/src-d/go-git",

_, err := git.PlainClone(

"/tmp/foo", false,

&git.CloneOptions{� URL: url,� Progress: os.Stdout,� },

)��CheckIfError(err)

Counting objects: 4924, done.�Compressing objects: 100% (1333/1333), done.�Total 4924 (delta 530), reused 6 (delta 6), pack-reused 3533

23 of 37

CODE COLLECTION AT SCALE

motivation

rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE

KEY CONCEPT

architecture

SEEK, FETCH, STORE

architecture

usage

resources

YOUR NEXT STEPS

SETUP & RUN

collection and storage of repositories at large scale

automated process

optimal usage of storage

optimal to keep repositories up-to-date with the origin

distributed system similar to a search engine

src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories

src-d/borges producer reads URL list, schedules fetching

borges consumer fetches and pushes repo to storage

borges packer also available as a standalone command, transforming repository urls into siva files

stores using src-d/śiva repository storage file format

optimized for storage and keeping repos up-to-date

rooted repositories are standard git repositories that store all objects from all repositories that share a common history, identified by same initial commit:

a rooted repository is saved in a single śiva file

updates stored in concatenated siva files: no need to rewriting the whole repository file

distributed-file-system backed, supports GCS & HDFS

24 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

25 of 37

storage

  • Metadata: PostgreSQL
  • Built small type-safe ORM for Go<->Postgres https://github.com/src-d/go-kallax

  • Data: Apache Hadoop HDFS
  • Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file

26 of 37

SMART REPO STORAGE

motivation

śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT

SIVA FILE BLOCK SCHEMA

architecture

CHARACTERISTICS

architecture

usage

resources

YOUR NEXT STEPS

APPENDING FILES

store a git repository in a single file

updates possible without rewriting the whole file

friendly to distributed file systems

seekable to allow random access to any file position

src-d/go-siva is an archiving format similar to tar or zip

allows constant-time random file access

allows seekable read access to the contained files

allows file concatenation given the block-based design

command-line tool + implementations in Go and Java

# pack into siva file
�$ siva pack example.siva qux

# append into siva file�$ siva pack --append example.siva bar

# list siva file contents�
$ siva list example.siva�Sep 20 13:04 4 B qux
 -rw-r--r-- Sep 20 13:07 4 B bar 
-rw-r--r--

27 of 37

Core OS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

28 of 37

processing

Apache Spark

Engine

  • For batch processing, SparkSQL
  • Library, \w custom DataSource implementation GitDataSource
  • Read repositories from Siva archives in HDFS, exposes though DataFrame
  • API for accessing refs/commits/files/blobs
  • Talks to external services though gRPC for parsing/lexing, and other analysis

29 of 37

YOUR NEXT STEPS

UNIFIED SCALABLE PIPELINE

motivation

engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK

APACHE SPARK DATAFRAME

architecture

PREPARATION

architecture

usage sample

resources

easy-to-use pipeline for git repository analysis

integrated with standard tools for large scale data analysis

avoid custom code in operations across millions of repos

• listing and retrieval of git repositories

• Apache Spark datasource on top of git repositories

• iterators over any git object, references

• code exploration and querying using XPath expressions

• language identification and source code parsing

• feature extraction for machine learning at scale

extends Apache SparkSQL

git repositories stored as siva files or standard repositories in HDFS

metadata caching for faster lookups over all the dataset.

fetches repositories in batches and on demand

available APIs for Spark and PySpark

can run either locally or in a distributed cluster

EngineAPI(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")

30 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

31 of 37

analysis

Enry

Project Babelfish

  • Programming language identification
  • Re-write of github/linguist in Golang, ~370 langs
  • Distributed parser infrastructure for source code analysis
  • Unified interface though gRPC to native parsers in containers: src -> uAST

32 of 37

UNIVERSAL CODE ANALYSIS

motivation

babelfish A SELF-HOSTED SERVER FOR UNIVERSAL SOURCE CODE PARSING

CONTAINER-BASED

architecture

POWERFUL OPPORTUNITIES

use cases

usage

resources

YOUR NEXT STEPS

UNIVERSAL AST

architecture

was born as a solution for massive code analysis

parsing single files in any programming language

analyze all source code from all repositories in the world

analyze many languages using a shared structure/format

AST-based diff'ing. Understanding changes made to code with finer-grained granularity.

extract features for Machine Learning on Source Code.

statistics of language features

detecting similar coding patterns across languages

language drivers as the main building blocks

parsing service via one driver per language

language drivers can be written in any language and are packaged as standard Docker containers

containers are executed by the babelfish server in a specific runtime built on-top of libcontainer.

UAST is a universal (normalized and annotated) form of Abstract Syntax Tree (AST)

language-independent annotations (roles) such as Expression, Statement, Operator, Arithmetic, etc.

can be easily ported to many languages using gogo/protobuf

or run babelfish server & dashboard locally:

$ docker run --privileged -d -p \

9432:9432 --name bblfsh \

bblfsh/server

�$ docker run -p 8080:80 --link \

bblfsh bblfsh/dashboard \

--bblfsh-addr bblfsh:9432

33 of 37

LANG DETECTION AT SCALE

motivation

enry A FASTER FILE PROGRAMMING LANGUAGE DETECTOR

benchmarks

COMPATIBLE AND FLEXIBLE

architecture

usage

resources

YOUR NEXT STEPS

GO FASTER

usable in Go as a native library, in Java as shared library and as a CLI tool.

need to detect programming languages of every file in a git repository

initially used github/linguist, but needed more performance for large scale applications

keep compatibility with the original linguist project

linguist as source of information on language detection

ignores binary and vendored files

command line tool mimics the original linguist one

can be used in Go (native library) or Java (shared library)

enry speed improvement over linguist when applied to linguist/samples folder file samples

src-d/enry is at least 4x faster than linguist

5x (larger repos) to 20x faster (smaller repos)

$ enry /path/to/src-d/go-git�98.28% Go�0.69% Shell�0.34% Makefile�0.34% Markdown�0.34% Text

34 of 37

CoreOS

infrastructure

Rovers

collection

storage

processing

analysis

K8s

go-git

HDFS

śiva

Apache Spark

source{d} Engine

Bblfsh

Enry

tech stack

Borders

35 of 37

tech stack

Production-grade OSS.

Shared cost of ownership of the infrastructure among community

Shared

>180k repositories >400 languages >50m files

OSS tools to collect/process/analyze

Research

Public Github Archive

36 of 37

tech stack

2Tb, 180k repos, 400 langs, 50m files

C: 5m files, 2,6b LOC, 84Gb

CLI interface: pga list, pga get

1. Collect: rovers/borges/siva

Engine(spark, 'siva',� '/path/to/siva-files')�.repositories�.references�.head_ref�.files�.classify_languages()�.extract_uasts()�.query_uast('//*[@roleImport and @roleDeclaration]',� 'imports')�.filter("lang = 'java'")�.select('imports',� 'path',� 'repository_id')�.write�.parquet("hdfs://...")

2. Proces: source{d} Engine

3. Analyze: project Babelfish

UAST

OSS tools to collect/process/analyze

Research

Public Github Archive

37 of 37

thank you.

https://github.com/bzz/ml-on-code