Steve O'Grady (@sogrady) , a developer-focused analyst from RedMonk, views large-scale data collection and aggregation as a problem that has largely been solved. The tools and techniques required for the Googles and Facebooks of the world to handle what he calls "data sets of extraordinary sizes" have matured. In O'Grady's analysis, what hasn't matured are methods for teasing meaning of this data that are accessible to "ordinary users".
In this interview I ask O'Grady to explain this "Last Mile" for large-scale data infrastructure, clarify the evolving role of Data Scientist, talk about some of the characteristics of successful Data Scientists, and provide guidance to companies looking to adopt large-scale data infrastructure in 2012.
Tim O'Brien (O'Reilly Media): We're talking to Steve O'Grady (@sogrady) from RedMonk. RedMonk is a company that's focused on developers and trying to help companies understand developers better. One of the things that you wrote in your blog on January 13th there's a section on Data & The Last Mile. And, a statement that I found interesting was, "It's not technically correct to assert that large-scale data is a solved problem, decades of innovation remain."
Could you talk a little bit about what you expect in 2012?
Stephen O'Grady (Redmonk): The point, essentially, is that from an infrastructure perspective, the tools that we have now to collect, aggregate, and operate on data sets -- data sets of all types, data sets, in many cases, of extraordinary sizes -- that tool set is actually, really, pretty good. It's not to say that, as I put it there, there's no innovation remaining. There is quite a bit and we see that, certainly, reflected in the volume of projects that are targeting different opportunities that are in the data management/database space. But certainly relative to other segments of the market, that looks increasingly like not the primary problem, I guess is the simplest way to put it.
TO: So in terms of the projects that provide the foundation, things like Hadoop, or smaller, I'd say less important projects like MongoDB. Those are solved problems [storage and data management]. It's the user interface level that you're looking to for this year.
SO: Yeah. I think everybody, including the authors of those projects, whether we're talking Hadoop or whether we're talking Mongo. Every project author, I think, would admit that they have obvious needs to work on in the year and, more importantly, the years ahead. But certainly when we look at things, again, on a relative basis, the biggest problem that we see today is that last mile. It's trying to make sense of this information.
Kevin Weil (@kevinweil) from Twitter put it pretty well, I think it was a year ago, basically saying that it's hard to ask the right question. Essentially one of the implications of that statement is that even if we had perfect access to perfect data, it's very, very difficult to try and determine what you would want to ask, how you would want to ask it. More importantly, once you get that answer, what are the questions that derive from that?
So there are a bunch of different levels to the problem. We have basic difficulties in terms of exposing these data sets to regular users as opposed to, for example, people who are trained in writing MapReduce jobs.
TO: If you look at the people who are doing that, we call them, now, Data Scientists. And, it seems like that's a new job description. For the most part, is a data scientist a technical user or is a data scientist someone who sits outside of the IT department?
SO: Probably a little bit of both. I actually would sort of argue that data scientist typically isn't a user. Users tend to be substantially less skilled. Most of the people that I know that would characterize themselves as data scientists -- or even if they were too modest to do so, other people would cast them in that light -- most of these people are very skilled.
Most of these people have a wide range of exposure into technology. Certainly they have programming expertise of some sort, and that'll range from things like Java, particularly coming from the Hadoop community, that could be very valuable.
Even some of the folks are purely skilled in things like Python or R. So the actual mix of skills varies pretty widely, but what is common to all of them is a unique mix of programming, talent, understanding of the basic mechanics of different database styles.
Because that's one of the things that I think is frequently lost in conversations around NoSQL is that there are a variety of different, not only projects, but there are a variety of different styles.
A key-value store, for example, doesn't have a ton in common with a MapReduce engine. Those are two different tools for two different use cases. Certainly bunching them all under a single NoSQL term doesn't really do anybody any favors in that context.
So I think from a project perspective, there are, again, a variety of tools. That's where the data scientists come in, as understanding when and how you might apply one. Then, obviously, having the skills to basically leverage that and ask and answer the questions they have.
The difficulty for basically every business on the planet is that there just aren't many of these people. This is, at present anyhow, a relatively rare skill set and therefore one that the market tends to place a pretty hefty premium on.
TO: Another thing that you wrote in this post, I'll just quote you directly. "With data driven decision making on the rise, premiums are being placed on tooling which can expose in a sensible fashion data to those without degrees in computer science."
SO: That's right.
TO: What I'm hearing from you is that you still need a degree in computer science to be a Data Scientist at this point.
SO: Yes, I think that's true. It doesn't necessarily have to be an official degree, but you certainly, in my opinion, need to have equivalent experience. You need to be very skilled.
For example, there are a variety of people who would be on the cusp of being called a data scientist who don't have the traditional computer science background.
The most logical category of these people would be that class of user who is very skilled, traditionally has been considered a power user of Excel. So these are people, for example, who are familiar with things like pivot tables, fusion tables, and so on, and really can get the most out of the spreadsheet tools. And they probably have some experience in terms of programming macros and so on.
But even that skill set, as valuable as it is, certainly as valuable as it has traditionally been, really isn't enough, in my opinion certainly, to qualify for data scientist title. You really have to think about data in brand new ways. As opposed to leveraging existing skill sets like spreadsheets, it is a more broad and comprehensive set of skills.
TO: OK. Last question, if I work at a company and you are on the borderline. You use a huge relational database that's not scaling well. You are looking at a new technology. What are the most important questions you need to ask yourself if you are going to make a transition to a company that's using one of these large-scale data products?
SO:Well, I think the first question really should be -- it frequently isn't unfortunately -- what are the problems that you're having with your relational database. The answer to that may be there are no problems. That's one of the misconceptions, I think, around NoSQL is that because it's new it is inherently better for every use case. And that, in my opinion, is false.
There are many, many tasks, many, many workloads where a relational database is not only sufficient, but the ideal solution, the ideal use case. So really, I think if you're looking at these tools for your organization, I think the first question you have to ask is what is a problem that we're trying to solve? What are the problems? How is the relational database that we're using not meeting our needs?
There are certainly instances where the other relational database is not sufficient. We've certainly seen that.
For example, a lot of people are employing Hadoop and MapReduce engines as a complementary approach to their database, because basically it offers, among other things, the ability to work on structured information. It offers certain linear scalability and a variety of other advantages as opposed to the traditional relational database model.
So there are certainly use cases where it comes in handy, but yeah, I think that the one thing that I would want to know would be just that. It would be, OK, what is the problem that we're trying to solve? Because if you are going down the NoSQL route for the sake of going down the NoSQL route, that's the wrong way to do things.
You're likely to end up with a solution that may not even improve things. It may actively harm your production process moving forward because you didn't implement it for the right reasons in the first place.