Predicting GitHub repo language using NLP
A quest to predict programming language from a single aspect of an entire body of work
By Shay Altshue, Ravinder Singh, and Nick Joseph
“You shall know a word by the company it keeps.”
- J.R. Firth, Professor of General Linguistics, 1957
2.
”Hope it ain’t this company.”
Executive Summary
5
3.
Data Acquisition and Preparation Pipeline
4.
Acquire Readme files using Zach’s acquire.py
Drop ‘Nulls’ and Non-English repos.
219 Repos
‘Cleaned’ readme text
Basic cleaning, normalize, tokenize, lemmatize,
stopword removal
Scrape names of Github 280 ‘most forked’ repositories
Exploration and Modeling
Exploration
5.
Takeaways:
Other
Feature Exploration
6.
Of all the features explored:
Python has highest ‘spread’ (IQR).
HTML has lowest
Modeling
7.
Naive Bayes: How it Works
1
2
2
4
6
ξ
θ
π
Easier Math Textbook
Harder Math Textbook
Model | Accuracy of Predictions |
Naive Bayes | 68% |
Logistic Regression | 61% |
Decision Tree | 56% |
Random Forest | 56% |
K Nearest Neighbor | 54% |
Baseline (JavaScript) | 26% |
Conclusion
Getting a good sample of Github Repos is difficult
Readmes are chaotic and challenging to fine tune predictions from
Finding new ways to differentiate the languages
Scraping the files in a repo to predict language, not the readmes
8.
01
02
03
04