1 of 8

Predicting GitHub repo language using NLP

A quest to predict programming language from a single aspect of an entire body of work

By Shay Altshue, Ravinder Singh, and Nick Joseph

2 of 8

“You shall know a word by the company it keeps.”

- J.R. Firth, Professor of General Linguistics, 1957

2.

”Hope it ain’t this company.”

  • Ravinder Singh, Co-Founder, The Tree Musketeers, 2020

3 of 8

Executive Summary

  • Our model was able to accurately predict the programming language used in a given repository 68% of the time
  • After cleanup, we had 219 readme files, which we used to train the model
  • Scraped the list of 280 of the “most forked” repositories on Github.
  • Created a baseline model, which only predicts a repository is using JavaScript (the most recurring language). Predicted accurately 26% of the time

5

3.

4 of 8

Data Acquisition and Preparation Pipeline

4.

Acquire Readme files using Zach’s acquire.py

Drop ‘Nulls’ and Non-English repos.

219 Repos

‘Cleaned’ readme text

Basic cleaning, normalize, tokenize, lemmatize,

stopword removal

Scrape names of Github 280 ‘most forked’ repositories

Exploration and Modeling

5 of 8

Exploration

5.

Takeaways:

  • JavaScript is most often used language in the repos analyzed.

  • ‘Python’, ‘yes’ and ‘unknown’ are most frequently used words for Python

  • Top 10 words for each language is quite distinct.

Other

6 of 8

Feature Exploration

6.

Of all the features explored:

Python has highest ‘spread’ (IQR).

HTML has lowest

7 of 8

Modeling

7.

Naive Bayes: How it Works

1

2

2

4

6

ξ

θ

π

Easier Math Textbook

Harder Math Textbook

Model

Accuracy of Predictions

Naive Bayes

68%

Logistic Regression

61%

Decision Tree

56%

Random Forest

56%

K Nearest Neighbor

54%

Baseline

(JavaScript)

26%

8 of 8

Conclusion

Getting a good sample of Github Repos is difficult

Readmes are chaotic and challenging to fine tune predictions from

Finding new ways to differentiate the languages

Scraping the files in a repo to predict language, not the readmes

8.

01

02

03

04