In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps.
The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data. Yellowbrick is written in Python, utilizes matplotlib for drawing, and is already in a beta stage of development.
In this research lab, we will focus on extending Yellowbrick with new features and functionality, from adding text visualizations to optimizing parallel coordinates. Because of the API, this essentially means developing different kinds of model-visualizing techniques and writing custom Visualizer objects that implement them. The lab will select a core group of 6-8 developers to work on implementing Python code, tutorials in blog posts and documentation in an agile fashion. Sprints will be two weeks long, and at our research meetings we will review progress from the previous weeks and plan the sprint for the next week. The implementation itself will be done in pairs throughout the weeks.