Python: A Tool for the Practical Data Scientist

Why Python for Practical Data Science?

The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting/graphing/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc.  One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don’t do them all well, plus learn a programming language to tie them together.  (Or do a lot of manual work.)  

An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things.  Python is an excellent choice for this.  It has a diverse range of open source  libraries for just about everything the data scientist will do.  It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture.  Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.

How important for data science is the ability to program?

Programming is vital to data science.  It is not necessary to be “a programmer” or a software engineer.  However, there will always be things you want to do that the developers of any particular analytics software package have not envisioned.  You cannot put your enterprise at the mercy of the software package.  

Moreover, data science involves exploration.  It is inherently an iterative process of discovery.  You will want to run analyses over and over with slight variations.  If you have to do these manually, even with the help of a software package, you will get sick and tired and eventually not do as much analysis as you ought to.  A new, good idea for how you ought to have done things differently will invoke feelings of dread rather than excitement.  On the other hand, if you’ve programmed everything up, you just go in and change a few lines of code and push the button.  Voila!  The new analysis -- complete with revised graphics, with axes still labeled properly and everything.  And you need not even tell your boss that the new analysis will take only 30 minutes of work.  Let her think you need two days to redo it all.  Take a bike ride.  Go to the spa for the afternoon.

Having analytical processes fully automated is also important for repeatability. Running a script multiple times on the same data will always produce the same results. By changing the data being considered or some input parameters, you can be sure that the rest of the process remains the same. This high level of control is vital maintain the precise nature of statistical experiments, yet difficult in a manually driven workflow--there are many small opportunities to make mistakes that might alter the results.

Additionally, in practice data is seldom available in the format required for input by a spreadsheet program or the format favored by a mathematical library such as R or Matlab. To a user with a little programming background, this won’t be a problem, parsers can be written to slurp up almost any data format, enabling processing through all of your data-driven systems. The same is also true of system output; just a little bit of programming should allow you to format data in whatever way you desire. If some system is going to be consuming your data or analyses, matching data formats will be no problem.

While a bit beyond the scope of this class (though not too far beyond!) operationalizing a data-driven system only becomes possible when the data scientist has some programming knowledge. Embedding a predictive model behind a web API or setting a recommendation system to consume new entries in a production database will require some programming effort. The difference between academic laboratory work and data science in practice is this process of implementation.

Nothing is more important to practical data science than iteration--taking your idea, implementing, testing it out, evaluating results, and understanding consequences of the choices you have made and revising all take time. The more tools you have at your disposal, the faster you can iterate. This faster iteration translates to more ideas getting tested, yielding a greater understanding of the system and model being considered and increasing the chances for success within some given time period.

Finally, knowing how to program, at least a little, is simply good for the soul.  You will forever know you are able to do many things you couldn’t do before.

If you are very much a beginner to programming, or if some of what follows starts to freak you out, go to the following resource and go through it.  It will have you programming some Python very quickly, and will begin to build some of the cognitive tools you will need:

Learn Python the Hard Way    (http://learnpythonthehardway.org)

What exactly is Python?

Python is a high level programming language that has become widely used in a variety of settings; for example it is used increasingly in production systems with companies such as Google. Python’s design favors readability and clarity over flexibility, making  sharing code among a group of developers much easier, enabling a user to readily understand what third-party code does functionally, in addition to easing the pain of debugging. Python is interpreted rather than compiled, giving a faster turn around in the development cycle.

Importantly, Python has an active user base. This makes it relatively easy to find others who can help with development problems, and means there is rich online literature illustrating others’ experiences engineering Python systems of all kinds. Another important manifestation of Python’s popularity is its wide variety of libraries. There are currently mature open source libraries for numerical and statistical computation, data analysis, web programming, data processing, interacting with databases, and just about any other task a data scientist is likely to encounter.

Features of Python

Python supports a variety of common programming paradigms including:

Installing Python

One of the primary complaints of python is that installs can be a challenge. This is particularly true of some libraries and extensions that depend on system libraries written in C or FORTRAN. Additionally, there are several versions of python, and there isn’t a strong consensus on which version to use; the most recent versions of python are not the most widely used. For the purposes of this course, we will generally prefer python 2.7.

What follows is a variety of options for installing both python and common libraries used by the data scientist. Note that most system-level installs require “root” or “superuser” access. In the context of a linux system, to issue a command as root, you typically type ``sudo [command here]’’. You will then be prompted to enter your password, assuming that you have superuser (su) privileges on your system.

Guides to Programming with Python

Resources for installing python libraries

As mentioned above, one of the primary difficulties with python is getting your libraries set up correctly. This section lists several options for getting libraries set up and resolving dependencies with what is hopefully minimal input by the user.

First, OS-level package management systems can seamlessly install most major python libraries with very little effort. Additionally, there are several python-specific library installers, including:

Crucial Data Science Libraries

Other Useful Data Science Libraries

Web Programming

Django is a powerful and popular web programming framework. Django gives tools for database management given the object models used in a website. Additionally, it has a simple templating language, to enable programmatic construction of html pages. Generally, Django greatly reduces the effort required to build complex web sites, and makes exposing a data-driven product to the web much easier. Generating APIs to access data in tables or to access a model or data-driven service is very simple.

Several common database backends are supported. Django is typically run on the ubiquitous Apache web server on large scale web sites, however, Django comes bundled with a simple development server implemented in python to get up and running quickly. The tutorials are easy to read and explain a variety of Django’s functionality, but if you want to learn Django, you may want to start here.