JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

Predicting GitHub repo language using NLP

A quest to predict programming language from a single aspect of an entire body of work

By Shay Altshue, Ravinder Singh, and Nick Joseph

2 of 8

“You shall know a word by the company it keeps.”

- J.R. Firth, Professor of General Linguistics, 1957

”Hope it ain’t this company.”

Ravinder Singh, Co-Founder, The Tree Musketeers, 2020

3 of 8

Executive Summary

Our model was able to accurately predict the programming language used in a given repository 68% of the time

After cleanup, we had 219 readme files, which we used to train the model

Scraped the list of 280 of the “most forked” repositories on Github.

Created a baseline model, which only predicts a repository is using JavaScript (the most recurring language). Predicted accurately 26% of the time

4 of 8

Data Acquisition and Preparation Pipeline

Acquire Readme files using Zach’s acquire.py

Drop ‘Nulls’ and Non-English repos.

219 Repos

‘Cleaned’ readme text

Basic cleaning, normalize, tokenize, lemmatize,

stopword removal

Scrape names of Github 280 ‘most forked’ repositories

Exploration and Modeling

5 of 8

Exploration

Takeaways:

JavaScript is most often used language in the repos analyzed.

‘Python’, ‘yes’ and ‘unknown’ are most frequently used words for Python

Top 10 words for each language is quite distinct.

Other

6 of 8

Feature Exploration

Of all the features explored:

Python has highest ‘spread’ (IQR).

HTML has lowest

7 of 8

Modeling

Naive Bayes: How it Works

Easier Math Textbook

Harder Math Textbook

Model	Accuracy of Predictions
Naive Bayes	68%
Logistic Regression	61%
Decision Tree	56%
Random Forest	56%
K Nearest Neighbor	54%
Baseline (JavaScript)	26%

8 of 8

Conclusion

Getting a good sample of Github Repos is difficult

Readmes are chaotic and challenging to fine tune predictions from

Finding new ways to differentiate the languages

Scraping the files in a repo to predict language, not the readmes