Published using Google Docs
Updated automatically every 5 minutes

Opportunities and Perils in Data Science (Slides and Transcript)

Opportunities and Perils in Data Science

Dr. Alfred Z. Spector

Presentations at Cornell, Harvard & Rice

Fall and Winter 2016-2017

Slides and Edited Transcript


Over the last few decades, empiricism has become the third leg of computer science, adding to the field’s traditional bases in mathematical analysis and engineering.  This shift has occurred due to the sheer growth in the scale of computation, networking, and usage as well as progress in machine learning and related technologies.  Resulting data-driven approaches have led to extremely powerful prediction and optimization techniques and hold great promise, even in the humanities and social sciences.  

However, no new technology arrives without complications.  In this presentation, I will balance the opportunities provided by big data and associated A.I. approaches with a discussion of the various challenges.  I’ll enumerate ten plus one categories including those which are technical (e.g., resilience and complexity), societal (e.g., difficulties in setting objective functions or understanding causation), and humanist (e.g., issues relating to free will or privacy).  I’ll provide many example problems, and make suggestions on how to address some of the unanticipated consequences of big data.

About Alfred Spector

Alfred Spector is Chief Technology Officer and Head of Engineering at Two Sigma, a firm dedicated to using information to optimize diverse economic challenges.  Prior to joining Two Sigma, Dr. Spector spent nearly eight years as Vice President of Research and Special Initiatives, at Google, where his teams delivered a range of successful technologies including machine learning, speech recognition, and translation. Prior to Google, Dr. Spector held various senior-level positions at IBM, including Vice President of Strategy and Technology (or CTO) for IBM Software and Vice President of Services and Software research across the company.  He previously founded and served as CEO of Transarc Corporation, a pioneer in distributed transaction processing and wide-area file systems, and he was a professor of computer science at Carnegie Mellon University. Dr. Spector received a bachelor's degree in Applied Mathematics from Harvard University and a Ph.D. in computer science from Stanford University. He is an active member of the National Academy of Engineering and the American Academy of Arts and Sciences, where he serves on the Council.

Preface to Transcript

This is an edited and hopefully readable transcript of this presentation, based on a recording made when I first gave the first version of this talk at the INFORMS Annual Meeting in November 2015[1].   Please note that several graphics in the slides were provided to me by teams at Google for use in publicly presented research talks, and that I  have edited my spoken words with the goal of enhancing readability and completeness.  I acknowledge the very helpful edits of Nikki Ho-Shing at Two Sigma, to whom I am grateful for many editorial improvements.  

[Slide 1 - Title: Opportunities and Perils in Data Science]


Thanks to all of you for coming to hear me. I am really delighted to be here.  And, as someone who quantifies things, I’m sensitive to the aggregate amount of time you will spend listening.  I'll do my very best to make this interesting and useful.

I recognize that in the fields I’ll be discussing, particularly given the breadth of this presentation, all of you collectively know much more than I know.  Nonetheless, I hope you'll find the aggregate perspective that I provide to be useful, in part because I've seen these issues of data science, optimization, and big data from many different perspectives and helped apply big data approaches to many domains.

[Slide 2 - Abstract]

No abstract presented during the talk.

[Slide 3 - Data Science]

To begin, we need to think about what we mean by data science.  I define data science as the study of new approaches to computing based on the application of large amounts of data, approaches themselves known by the term, big data.  I believe data science is a transdisciplinary study involving computer science and engineering, applied mathematics, statistics, as well as the humanities and social sciences considering all technology and processes, human and computational, for the set-up, capture, storage, and valuable use of large amounts of data with goals such as predictive analytics, optimization, and understanding.  Because of the broad collection of techniques, the involvement of societal systems, and the vast number of implications, the field is both involved and impactful.  I believe data science will induce sufficient change throughout the world that it will define new challenges in many seemingly unrelated disciplines, even the humanities.

As an example of the scale of today’s data, it’s safe to assume that Google is doing at least five billion searches per day, each of which stores its query and is matched against a vast store of information.  That number of searches works out to be over fifty thousand a second.  Two Sigma loads about one petabyte of data per month for its search businesses.  It's been commonly reported that YouTube users load three hundred hours of video per minute.  In the scientific realm, the Sloan Digital Sky Survey, which is scanning the heavens continuously and comparing the images to see what's changed, generates thirty trillion bytes of day.  And there are so many more examples.  

I call out big data and machine learning as the two most fundamental approaches underlying data science.  The term big data applies to a body of techniques for conceiving, designing, developing, and operating systems that gather and process vast amounts of information.  To summarize, I use the term big data to refer to the core engineering approaches underlying data science.

I call out the term machine learning because it refers to a rapidly evolving, highly useful set of techniques with which computers can usefully learn from large amounts of data.  There are other mathematical and statistical techniques of great importance, but machine learning has been advancing rapidly and to great effect, as I will illustrate later in this talk.

[Slide 4 - Prodigiousness Realized]

Here is an illustration of scale.  These are data centers from the big three cloud computing providers:  Amazon, Google, and Microsoft.  I’ve been in one of Google’s warehouse computers, so I attest to the accuracy of this type of picture, albeit perhaps without the mood lighting.  Google described its data centers in a 2014 video, which can be located by typing “Google Data Center Video” into Google’s search engine.

[Slide 5 - Vocabulary of Prodigiousness]

Some of the words we use to drive scale are also illustrative.  Large cloud computing platforms today store many exabytes (1018 bytes), and we can imagine them having a zettabyte (1021 bytes) in the 2020-2025 timeframe.  As another example of scale, it’s quite conceivable to imagine a warehouse computer with sixteen million processors.  To illustrate this, imagine a chip with 100 processors, or cores, per chip.  This is not at all far-fetched, as Intel announced its Knights Landing chip with 72 cores way back in 2013.  Now imagine placing just ten chips on a single circuit board and then twenty boards in a stack.  So, each stack would contain 20,000 processing cores.  If there were then 40 rows and 20 columns of stacks in a warehouse computer, this would comprise sixteen million processors!  Such a system can both support vast numbers of users and reduce the computing time for many important and challenging computations to fractions of a second.

[Slide 6 - Growing Empiricism in Computer Science Leading to Data Science]

Let me make a brief detour and discuss the impact of large scale data and empiricism on my own field of computer science.  

Historically, computer science had only two bases:

  1. Computer science has always been, in part, an analytical (or mathematical) discipline.  The subject of computability, that is, the basic nature of what is computable, goes back to the 1930s.  The creation and analysis of the time and space bounds of algorithms has been a core part of the field since then, as well.  Very many of the Turing Awards, which are computer science’s Nobel Prizes, and related prizes (e.g., the Gödel Prize) have been given out for such work.
  2. Computer science has also been, in part, an engineering discipline.  Like all engineering, it is based on abstraction (where one invents reusable components of a system), encapsulation (enabling the diverse use and reuse of those components), and then assembly of those components into ever larger building components or perhaps systems that are useful in themselves.  The field has been not only concerned with the abstractions and components but the tools and disciplines for their construction, management, maintenance, and reuse.  These include programming languages, code repositories, editor frameworks, and others.  Certainly, tens of millions of programmers have built vast quantities of software in ever more elaborate structures.  For the most part, the overwhelming remainder of the Turing Awards have been given out for work in this domain.

What about the empiricism that the more traditional sciences have at their core?  Clearly, there has always been some in computer science, relating to the testing and performance evaluation of programs, and a little early thinking about learning.  But, the field had been primarily mathematical analysis and engineering in nature.  

But, with the advent of big data all of this changed.  With vast quantities of data and the kinds of problems that they enable, computer science has now also become an empirical discipline.   By empirical, I mean that systems can learn from data and that programmers routinely experiment with their programs, measure their impact on users or their broader environment, and then modify the programs to achieve the desired result.

The mid-1980s marks the period when this third leg began its rapid growth. With empiricism, programmers routinely use large datasets to test and continually refine their programs to achieve their usability, functionality, or economic goals.  With vast amounts of usage and data from which to learn, computer programs may themselves automatically learn and improve their own operation.  This change has occurred for many reasons, including the improvement in computing, networking, and storage technology, the growth in machine learning, the web’s catalysis of enormous growth in data, and the application of computing to a range of problems for which non-empirical solutions make no sense.

As this empirical portion of computer science has grown, it has gained inputs from other disciplines (e.g., statistics) and its implications have broadened considerably.  

But before leaving this partial detour, I do think it’s worth mentioning that the impact of all aspects of computer science will continue to grow, so the field will play a key role in many, if not all, disciplines.  Thus, I like to say that when we combine computer science and X, for all disciplines represented by the variable  X, we uncover the sweet spot for vast amounts of contemporary innovation.  CS+X, for all X, is a message I give pretty much everywhere I go.  It’s a bit off-topic, but I often say that while we need to train many computer scientists, we need to train even more people who understand many aspects of computer science, because of the enormous impact of CS on every discipline.  Jeanette Wing, previously leading computer science at the National Science Foundation and now the leader of Microsoft Basic Research, coined the term “Computational Thinking,” to describe what we need to teach.

[Slide 7 - Why Growth in Empiricism?]

Why the growth in empiricism?  I've mentioned many of the causes in passing, but this slide lists them together.  

Certainly, most of this change is a result of the growth in scale of systems and usage.  Suffice it to say that essentially everything in computing (networking, user community, storage, processing speed) has grown with Moore’s Law, which correctly predicted a doubling of the number of transistors per unit area every 12-18 months, at least until recently.  What we often miss in thinking about Moore’s Law, however, is that this gradual growth has resulted in a factor of one million or more since, say, the early 1970s, when I began in the field.  I don’t know of anything else where things have become one million-fold greater.  Consider even the printed Directory of the Arpanet, the predecessor to the Internet.  In the late 70’s, this directory contained all the users and was at most a few hundred pages of listings.  Contrast the few thousand users of the time to the few billion today!

All of this scale, and particularly the scale in usage, allowed for A/B experimentation. On the web, websites are continually experimenting (either automatically or under human control) with many variants to determine what users like more.  Colors, placement, form, and content can be rapidly changed to meet usage, user-satisfaction, or monetary goals, most of which are measurable.

Here’s a most basic example:  It’s not hard on the web to do a, say, 1% experiment regarding the color scheme of a website or some other aspect of a website’s user interface, and then to gather statistics on whether this is achieving some desired goal (often in comparison) with the other 99%.  And, if the A/B experiment shows that the changed 1% has the desired outcome, the experiment can gradually be ramped up to 5% and more, and if it keeps working, it can be universally rolled out.  Whether automatically or with a human in the loop mediating the evolution, a system can keep improving indefinitely.  And the more it improves, the more usage it gains, allowing for more improvement and more usage.  The classic virtuous circle.

[Slide 8 - AI: Turing Test ... & Learning]

Journeying back historically, Alan Turing did so many things.  He was clearly a war hero, who led critical work to decrypt Nazi codes.  In computer science, he also had his feet in both the analytical and engineering bases of the field:

Regarding learning and empiricism, in particular, he wrote a bit about the subject.  As shown on this slide, he wondered if you could learn from the outcomes of particular chess moves as a computer played a human.  But, forward thinking as this was, he didn’t get very deep into learning.

Of course, Turing is also known for the Turing Test—an interesting, but not very fundamental, operational model for deciding when a computer can be determined to be intelligent.  

[Slide 9 - Letter from Strachey on Learning]

Following Turing, Christopher Strachey, another early computer scientist who lived until 1975, also talked about learning systems.  He thought about how to recognize and learn from relationships.  I’m sure Dr. Strachey would be very pleased to be working in this era of big data.

[Slide 10 - Empiricism in CS, Historically]

While I’ve argued empiricism at scale is a largely new phenomenon, I should mention for completeness some places where it did play a role in computer science prior to the mid-1980s.

Arthur Samuel, in 1959, actually did write a checkers-playing program that appears to be the world's first serious, self-learning program. However, machine learning remained the exception, not the rule.

Experts in artificial intelligence did write about the role of using a several experts to teach systems.  “Rules-based systems” were the result and, while brittle, they showed computers could, at least partially, solve some very challenging problems like infection diagnosis.

Certainly, programmers had been debugging computer programs, via test cases, since the beginning.  On the other hand, luminaries in the field like Edsgar Dijkstra didn’t believe much in testing, and he actively denigrated it.  He felt that computer programs were a mathematical form, and that they should be analyzed mathematically to prove whether or not they would work as expected. He is known for having said that “testing shows the presence, not absence, of bugs.”  Certainly true, but most of the time, there was little alternative as our analytical capabilities were, and remain, limited.

In performance evaluation of programs, programmers have always gathered data on program execution; for example, with the profil command that many programmers have used in Unix or Linux.

As a final example, when people really began to think about making computer programs more intuitive to use, one almost inevitably had to measure human responses to them.  In the 1990s, Don Norman (now at UCSD) and others pioneered the field of Human Computer Interaction (HCI), which is by nature a very empirical discipline.  HCI, now a major field within computer science, I submit, has been an important part of the major movement towards empiricism beginning in the 1980s.

In summary, it’s fair to say there has always been some empiricism in the field, but it was a small part compared to mathematics and engineering. This has now changed, and that change has had the biggest recent impact on computing, other than perhaps the continued long-term impacts of Moore’s Law.

[Slide 11 - Big Data: Technical Successes]

Consider the successes of empirical computer science:  vast websites, all sorts of large business, ever more applications of computing, the dominant force impacting new business formation, and more.  We see it everywhere.

[Slide 12 - Google books Ngram Viewer]

Turning to a very simple use of big data (without the machine learning), let me discuss the Google Ngram Viewer.  This tool looks for occurrences of Ngrams (one or multi-word sequences) in the Google scanned books corpus and then plots the frequency that those Ngrams appear in the corpus over time.  As a sanity check, as expected, the Ngram viewer reports that the word “the” appears very frequently—about 5% to 6% of the time from 1800 to 2000 in the English books corpus.  Notably, the Ngram viewer fairly recently gained capability to discriminate among homonyms with different parts of speech, so one can better determine word usage:  that is, the relative frequency of “effect” as either a noun or a verb.  

I show this chart also to illustrate a point I made previously about the relatively recent rise of empiricism as a third leg of computer science.  The chart on the screen shows that “software engineering,” a term of art related to the engineering aspect of computer science, began to take off in the mid-1970s, while the newer empirical aspect of the field, as evidenced by the use of the words “machine learning” only began to rise in the mid-1980s.  The term, “computability,” which is related to the mathematical aspect of the field, has a low and fairly constant frequency of use, for it is practiced by a much smaller part of the computing community.

[Slide 13 - Machine Learning in More Detail]

Carnegie-Mellon Professor Tom Mitchell’s well-accepted definition of machine learning is this: “A computer program is said to (practice machine learning), or learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Supervised learning refers to learning systems that are trained on valid results.  Unsupervised learning are systems that can operate without first being trained.  And, finally reinforcement learning systems are systems that learn from increasing use.

The major problems solved are:

[Slide 14 - Typical Machine Learning at Scale]

Almost every time data scientists apply machine learning to a new problem area, they get significant improvements on important metrics.  These metrics may be related to outcomes as diverse as revenue, consumer satisfaction, increased time spent engaging with an application, or increased efficiency in completing a task, among other things.

[Slide 15 - Machine Learning Architecture]

Allow me to summarize the primary machine learning paradigm in use on the web, as illustrated in this block diagram: (1) Many users interact with a system.  (2) The system saves lots of data relating to the individual interactions.  (3) The data is combined with other databases of exogenous information in what are termed “join logs.”  (4) Those data are used by a machine learning system to create a model (and related logic) that will “improve” the system’s operation.  (5) That machine learned model is periodically incorporated into the server, further improving the system’s operation.  

This can happen repeatedly generating the previously mentioned virtuous circle.  Machine learning makes the system better so more people use it; more use results in more training data; the system gets even better; so even more people use it.  And, the circle continues. Recommendations on Netflix or YouTube get better and better the more use there is.  This is even true of speech recognition, so the era where computer software learns, at least in somewhat constrained arenas, is upon us.

While it all seems simple, don't underestimate the complex software and data gathering, cleaning, and transformation, as well as the computer science, mathematics, and statistics needed to make this work.  There is also need for very careful attention to issues of privacy and the specific types of improvement, or a user community (or society) will reject the system.   These are some of the constituent elements that make data science so interesting.

[Slide 16 - A Probabilistic Transduction]

Big data approaches have been applied repeatedly to solving a big part of the grand challenge problem of the computer transcription of human speech, sometimes called “speech recognition.”  Today, every step of the speech recognition pipeline now has machine learning models in it.  Whether it's the extraction of the individual phones from a digital stream of sound, the combination of those phones into phonemes, the piecing together of phonemes into words, capitalization, or punctuation, machine learning algorithms trained on large data sets are central.  Machine learning is equally important in speech synthesis, and related problems like language translation.  And, many more.

[Slide 17 - Speech Recognition]

Showing the large changes afoot in machine learning, Google, based on research done in many places, deployed the first large-scale production speech recognition system using neural networks for the acoustic modeling stage of the speech recognition process. The improvement in accuracy was about 10% on an absolute scale, which made the system almost 40% more accurate than it had previously been.  This one change in the type of machine learning led to a greater improvement in one year than had been achieved in more than 10 years of work!  This was a powerful demonstration of the power of vast speech training data and neural networks, a pairing that has become far more common since.

[Slide 18 - Music Recommendations]

As another example, consider music recommendations.  With vast cloud-based libraries of music at our disposal, we need recommendations to help us make our way through the possibilities.  So, how can a music application give us useful recommendations?

One approach is called “collaborative filtering.”  If a system notes that someone who has listened to music we have liked also likes certain other music, we may like that music as well.  This is an approach that Amazon and Netflix have used with enormous success.

A second approach uses semantic information about music and composers to make recommendations.  So, for example, you could imagine extracting from Wikipedia all the relationships between musicians and composers over time.  You could follow the relationships, and possibly make recommendations in this way, as well.  Beethoven studied with Haydn, so if one likes Haydn, perhaps one might like Beethoven.  If one likes a rock 'n' roll group with a particular drummer, one might like another group with the same drummer.   One could call this “semantic connection.”

A third approach uses audio signal processing and a trained machine learning model to cluster music that sounds complementary.  The resulting system could recognize tempo, beat, timbre, and use that data to make recommendations for songs that are similar or perhaps even contrasting.

It’s natural to combine all these techniques creating an ensemble method.

Interestingly, I feel there are analogs between music recommendations and investment management, a key focus of my current company.  Approaches based on collaborative filtering, semantic knowledge, and raw musical signal are akin to momentum, fundamental, and technical approaches to the markets.

[Slide 19 - Dictionary]

Big data has even affected the dictionary results Google Search returns when it determines a user is likely looking for a word’s meaning; Search will display not only the meaning (from a dictionary) but also a chart of word usage over time.  As you can guess, this is based on how frequently the word has appeared in the Google Books corpus – an Ngram viewer-like approach.  Google also uses other information from the vast web corpus to provide more information possible than in just a dictionary.

[Slide 20 - More Areas of Applicability]

The previous examples I gave are just a few of the many possible.  There is a nearly endless list of applications, which is part of the reason so many universities are creating data science programs.  It is also one of the drivers of this decade’s rapid growth in the number of students choosing computer science and related disciplines.

You’ll be glad to know that I’m not going to methodically go through this list, but I’ll highlight a just three more examples:    

Self-driving cars. This challenge is substantially, though by no means totally, a big data problem.   Clearly, self-driving cars will need to understand road networks, lanes, turns, danger areas, where traffic slows down, etc.  With all the GPSs and cell phones, this data even when well-anonymized, can provide a detailed view of the collective action of the world’s roads and drivers’ responses to them, and thus allow computer systems to model how humans have driven.  While I personally don’t think current techniques, in themselves, have developed quite enough to permit totally autonomous cars (except perhaps in geo-fenced areas), they go a long way.

Consider also political campaigns now and what's being done to slice and dice populations so that candidates can better appeal to fine-grained constituencies.  For better or for worse, big data is a large part of this.

Personalized, or targeted, advertisements are often discussed.  They have been crucial to the growth of the large internet companies.  The industry believes, or at least hopes, that such ads are better for both the consumer (as the consumer is much more likely to be interested in the advertised product or service), the content provider (the property on which the advertising is done), and the advertiser, though some do feel the ads may elicit consumer behavior not in the self-interest of the consumer.  More on this latter point later.

[Slide 21 - Characteristics of Successes (spelling)]

I think it’s worth stepping back a bit to see the characteristics of the big data successes so as to begin to illustrate where big data approaches work well and where they are more challenging.  

The clearest example of a really good machine learning application is spelling correction, as practiced by the big search engines.  Please note that search engine companies do NOT have analysts in every language that know about the analogs of “s’s” versus “z’s”, dyslexic typing, or incorrectly positioned fingers.   The system learns correct spelling merely by being reinforced upon matches between what people type and useful (that is, clicked on) search results.  A system also learns that if a search term results in no clicks but is textually very similar to a follow-on search term that does result in a click, the first term is related to the second.

The problem has these characteristics:  

  1. A coherent data repository that stores the signals (often called, features) from which to train a model.  In the case of spelling correction, for example, a search engine’s history function can provide a list of everything people have ever typed, recording both incorrect and correct spellings. (This is called the “search log.)  Generally, the data repository works best when there are a modest number of different types of signals (e.g., tens), and for the system to really work, there must be access to the features that actually imply the results.  It’s clearly true in spelling correction, but not necessarily in large systems, say, the global economy.
  2. A clear objective function, or statement of what constitutes a good result.  In this case, the goal of correct spelling is simple and clear.
  3. Tolerable failures. If a spelling corrector almost always does the right thing, but occasionally cannot correct a word’s spelling (or guesses a word incorrectly), it’s acceptable to almost all users.
  4. No need for transparent reproducibility. A spelling corrector’s internal operation can be opaque and need not be opened up to public scrutiny.  In particular, the signal dataset (e.g., the search history) does not need to be published to enable others to scrutinize the understandings gleaned from the spelling corrector.  That’s a good thing because search logs are exceedingly private and cannot possibly be released.  
  5. While there is a lot of data, it’s not overwhelming.  There are only billions of searches per day.  (Not that this would ever be done, but if everyone in the world typed one character per second all day and all night, the storage requirement would be large but tractable--at most 1015 bytes per day.)  Contrast this with the vast, combinatorial collection of health-related data from all humans, which proponents of personalized medicine would like to understand.  While this is an extremely interesting data set, it’s much largerv and less tractable.  
  6. And, finally, one need not understand or explain cause and effect.  If a spelling corrector corrects ‘Barrack’ to ‘Barack’, it does not need to explain, "here's the rationale I went through to generate the correct spelling.”  The user is glad to see the automatic correction.  In fact, as noted, correctors may fix misspellings due to very different underlying causes, but the algorithm need have no explicable rationale as to the root cause of the error.

[Slide 22 - Characteristics of Successes (recommendations)]

I’ll run through one other example quickly – that of music recommendations, and I’ll reflect on the same considerations that I just used.

  1. There are a few coherent data repositories which provide the signals from which to train a model.  
  1. The history of click data indicating the tracks users selected and, perhaps, the how long they have listened.
  2. A database of semantic information about music, musicians and musical periods.
  3. Information about the recording and the musical signals themselves.  
  1. There is a clear objective function, or understanding, of what constitutes a good result.  In this case, the goal is to play a track of interest to the user, perhaps as evidenced by how long the user, in fact, listens to it.
  2. It’s acceptable to have some occasional music recommendations a user doesn’t like, particularly if they are “plausible.”
  3. There is no need for transparent reproducibility.  The system isn’t trying to convince anyone of a scientific result.
  4. The amount of data is large yet tractable.  For example, there have been some millions of separate CD titles, and even without further compression, that’s not so much data.
  5. Finally, there is also no reason to explain cause and effect.  But, on this point, I note that a music recommendation engine might be able to do a little explanation: like, this piece is being recommended because the lead singer was in common with a previously recommended piece, because the timbre was similar to the last piece, or because users who listen to the previous piece also liked the current piece.

 [Slide 23 - Data Science Challenges]

So far, I’ve focused on what’s happening with big data and a few of the anticipated direct and mostly beneficial consequences.  But, I hope I’ve also set the stage for this second part of the talk.  That is, a more direct focus on the unanticipated consequences of big data.   In this section, I’ll try to illustrate the challenges as data science grows in its technical foundation and its societal impact.  As a society of mathematicians, scientists, and engineers, or as owners and operators of businesses, or even just as members of the body politic, we should keep these challenges in mind.

[Slide 24 - Decomposition of Challenges]

As I’ve been saying, there are many domains where big data approaches work extremely well.   So, let’s step back and see how we characterize the space where big data approaches work well, and where they are, shall we say, challenged.

To my thinking, the three biggest obstacles relate to:

  1. Setting the objective function.  If the objective function is very complex or contentious, then it’s both hard to design a computer system to function properly, and society may have significant concerns about its goals.
  2. How error-tolerant is the application.  
  3. Whether understanding a correlation is sufficient, or whether a system must explain cause and effect.

Let’s discuss setting objectives first.  Consider personalized advertisements on a new publisher’s site.  What are the right goals?

Similarly, in book or video recommendation systems, some could say good recommendation systems distract society from more important pursuits by enticing people to read yet another novel, watch another, say, cute cat video, or have an erroneous fact reinforced.  So, big data approaches vary in their value by how unambiguously good are their objective functions.

Even spelling correction, as benign as it seems to be, may have doubters like Nicholas Carr in his 2008 piece “Is Google Making Us Stupid?” Although I truly doubt the disadvantages of spelling correction outweigh its advantages, particularly since auto-correction may also teach us spellings.

At this point, given the recent election cycle, it’s also worth noting how clearly this cycle demonstrated great differences in setting objectives.  Populism, progressivism, conservatism, and libertarianism are perhaps just different overarching objective functions, so the difficulty in reaching societal consensus certainly impacts our ability to build or train some type of computer systems.

Tolerance of error is my second most important determinant of the applicability of many big data approaches.  Recall that information systems were initially sought after because of their precision in dealing with large amounts of data (e.g., census, payroll, trajectory, etc).  But, big data approaches typically provide answers that are only approximate, so the applications that work best should tolerate imprecision.

On a mapping system, the approximate speed of cars on nearby congested roadways is very valuable to drivers, but it’s not a matter of life and death if estimates were to be wrong or out-of-date.   But, on the other hand, medical diagnoses need to be very accurate.  While it’s true that doctors are imperfect, we need to strive for machines that nearly always make the right call from the available data.  Furthermore, I’m pretty sure society will hold our machines to a higher standard than it would hold humans for all sorts of reasons.

The specific use to which a system is placed may govern how tolerant of failures it is.  Many years ago, I was approached by the New York Department of Health inquiring as to whether Google Translate (then supporting about 50 languages) could help New York City’s multilingual population understand drug labels, which were typically provided in only a few languages.  While we at Google felt that Translate would usually provide a valuable service to people, we ultimately demurred as we did not then feel our automatic translation was accurate enough for instructions on medication.  I believe Google Translate is far more accurate today, but I think it’s still questionable whether it’s “good enough” for drug advices.

As another example, web search, as perfected as it seems to many, is a very complex product, made even more complex due to consideration of objectives and failure tolerance.  Here are just two challenges:

  1. What should a search engine return in response to a political query?  What balance?  What focus on the likely veracity of the source?  Should the pre-existing views of the searcher be taken into account?  Many would disagree on the objectives.
  2. In terms of poor results, the reputation of a search engine would go way downhill if even a small percentage of the search results were atrocious.  For this reason, many search engines have used algorithms, tuned with considerable human labor, that are designed to prevent problematic results.  Certain types of errors are easily acceptable; for example, if I type “cataract,” and I get information on eye disease not waterfalls.  But, others are quite bad.  In 2004, “miserable failure” returned “George Bush” as a result, due to Google’s inability to prevent a kind of abuse, termed a “Google Bomb.”  This no doubt offended a large part of the public.  This illustrates that big data approaches need to consider rare downsides and increasingly ones perpetrated by bad actors.

The third topic I’ve mentioned previously is causality.  With lots of data, it is relatively easy to find correlation, but much more challenging to get to causality and understanding. It's hard in matters of health and disease; it's hard in physics; and frankly, it’s just hard.  In health, many factors may be influenced by some initial cause, so these factors will prove correlated.  But, this doesn’t mean that one factor causes another.  In issues of health, diseases may have many contributing causes (genetic predisposition, as well as patient history) making the underlying analysis even more challenging.  Also, causal sequences may be very long, but medicine may need to understand the various steps to understand how to effect treatment.

Answers from big data approaches are even harder to come by when one is trying to optimize an outcome and when it’s hard to gain agreement on the best outcome and one needs to know cause and effect to achieve the outcome.  Take global warming.  It was fairly easy to show warming trends, and it was clear these trends were correlated with CO2 levels, but there were other correlates, as well. (Some had previously believed that solar activity was implicated.)  There has been disagreement among economists as to how serious the problem is in economic terms, due to an almost philosophical debate about setting the discount rate that should be applied to the impact of global warming effects.  That is, if we care less about future problems (perhaps, because in decades or centuries, cheaper solutions or mitigations may arise), then we would worry much less about future adverse impacts, if eliminating them were to affect our well-being today.  If, on the other hand, as Nicholas Stern argues, we have a moral obligation to use a low discount rate, the optimizations needed would be very different.  Finally, many interventions are extremely difficult to understand due to the breadth of implications and costs.  

[Slide 25 - Eleven Data Science Challenges]

Considerations like these and others lead to my list of the Ten big data Science Challenges.  But, for this series of talks this Fall of 2016, I have added a bonus eleventh challenge.  There may be more I should include, and I would love to hear from you about these.  My email is

[Slide 26 - Risk 1: Privacy, Security, and Resilience]

First on my list of ten is, “Privacy, Security, and Resilience.”   It’s not surprising that I put privacy first, as many refer to it as the major problem in big data.  But, actually, I don’t think privacy is the leading problem, in part, because it gets so much attention already.  However, I acknowledge great privacy challenges, particularly given the significant security threats I’ll discuss shortly.

But turning to privacy per se, the challenges all start with the fact that big data systems thrive, as I discussed earlier in the talk (Slides 5) on having lots of information.  While data can be moderately well-anonymized, and their uses carefully vetted, both of these activities are challenging.  For example, while a recommendation system may learn my (potentially private taste) in movies, that information can be used in collaborative filtering applications in ways that do not reveal anything about me.  For example, a system would only use recommendations from me anonymously and then only in conjunction with recommendations from a sufficient number of other people so that my identity would be masked.  The latter is important so that a recommendation, when coupled with exogenous information about me would not allow the deduction of personal information about me.  Without question, it takes very careful analysis by privacy experts to prevent the leak of private information.  As someone who has been responsible for the launches of numerous systems based on big data, the techniques for maintaining privacy are becoming better, but the problem is still quite irregular and not so amenable to cookbook solutions. Thus, significant creativity and care need be used in each application.  This does indeed make for the potential for error.

Many privacy advocates would like a system to discard the bulk of the data it keeps, because if data is truly discarded, there would be inevitably be fewer risks of leakage.  Big data systems developers, however, have a hard time accepting the deletion of data for two reasons. First, lots of historical data can be important for reasons that are not immediately apparent. For example, for comparing present versus historical activity patterns to determine if a system is exhibiting suspicious activity – perhaps a cyber-attack.  Second, people devise new, valuable uses of data that were not considered at first.  The best example is that Google used search log data in 2008 to predict the severity of flu outbreaks based on crowd-sourced search terms.  (See “Detecting influenza epidemics using search engine query data”. Nature 457 (7232): 1012–1014.)

A great complication in issues relating to privacy are that some people are not very concerned about certain things, but are very concerned about others that are in some ways very similar. Privacy concerns are individual and situational. There are situations when we would want to control the dissemination of almost any information: like a location, an activity, or a purchase.  And other situations, when we readily would broadcast it.  Dating someone or seeing a particular movie might be completely public or highly private.  So, it is very hard to make ethical, legal, and technical evaluations of privacy objectives.

Security is a different, but related, issue.  While privacy usually deals with the controlled protection and release of data, security focuses on issues relating to attacks or system misuse that would deny or interfere with a system’s proper operation or would cause the data to be divulged.  Providing security is truly hard for many reasons including (1) political and economic motivations for bad actors, (2) the complexity of computer systems, and (3) fallible humans who program, use, or operate systems.  So, with so much data and processing, there is always the possibility of theft, interference with a system’s correct operation (even downright sabotage) or attacks that cause outages.  As systems become ever more important to the world, these attacks become ever more worrisome, and perhaps, the motivations for perpetrating them are increasing concomitantly.

Relating to the growing importance of computer systems in more and more endeavors, the last issue on this slide is resilience.  Here optimization can lead, in some cases, to less resilient systems because they're operating near their theoretical maxima.  When there is not enough dynamic headroom, failures such as the Tacoma Narrows Bridge disaster can ensue.

I have a personal example that really happened last night.  Last night, before giving this talk on big data, I went into the hotel at 11:50 p.m. to find it overbooked.  I assume the hotel must have used historical data to optimize hotel revenue with, shall we say, a little less resilience than was needed. I believe I’ve been subject to overbooking only about three times in circa 1,000 nights staying in a hotel over the last 30 years, so maybe the hotel chain got this right.  And, I don’t know how modern the hotel’s algorithm was, but it wasn’t resilient enough for me to sleep well.

Relating to both privacy and security, the debates in 2015 and 2016 on NSA or FBI metadata collection (end-point information on phone calls or chats) or the pros and cons of robust, on-device encryption of user data on iPhones illustrate the complexity of the issues.

Many are concerned about potential attacks on the automatic control systems for critical infrastructure on which we all depend.  As information technology backbones based on large-scale data are also optimizing systems operations in this realm, there are possible concomitant risks due to reduction in excess capacity, or headroom.  For example, will automated, delay-minimizing routing systems, which sometimes reduce congestion by diverting traffic to secondary roads, use up the margins of safety on the road system so that there is insufficient spare capacity to deal with serious perturbations?

[Slide 27 - Risk 2: Technical Challenges]

Second on my list of ten are the “Technical Challenges” relating to machine learning.  

Many issues arise.  For example, learning from the past data presupposes the future will be somewhat like the past.  The past is often a great predictor of the future, until something happens, such as a so-called regime change in my finance-related world.  Then, it no longer is.  And even if the past is generally predictive of the future, there is always the risk of learning too much and allowing noise to influence outcomes.  This is an instance of what is referred to as “overfitting.”

There are also technical issues associated with machine learning algorithms themselves.  In many cases, scientists don’t fully understand why the algorithms work, something particularly true in the case of neural networks.  

Another problem is that data scientists often focus on optimizing a mean, or average, result.  In many problems, we can find good average results, but it's very difficult to prevent poor outcomes and the mathematics often don’t support really a good understanding of results’ distributions.  But, it may not be okay for even a small percentage of results, at least in some domains, to be truly awful.  

Relatedly, statisticians may assume that input data is selected from a normal distribution or their predictions should follow such distribution.  But as I mentioned previously and Taleb states even more clearly in The Black Swan, this is problematic.  Real-world distributions are often quite different than ones which are mathematically tractable.

Turning to a completely different category of problem, there is the challenge related to the speed at which systems can learn.  Somehow, we humans can often learn the right lessons from a small set of experiences.  When we teach a child, we may only need to say something once or twice for the child to learn.  

However, machine learning systems may require thousands of cases to be trained.  And, if the systems could be tuned to learn very rapidly, it is possible they would learn many wrong things; in any case, they are certainly at risk of abuse.  Microsoft’s Chatbot in Spring 2016 apparently was too quick to learn, and learned hate speech, before being shut down by Microsoft.  

The scale of systems may, in some instances, also be a technical impediment.  There are challenges relating to the acquisition of data (say, at a vast rate of elements or bytes per second), the storage of the data (perhaps, requiring peta or exabytes of storage), the processing (measured in numbers of CPU cores or a correlate such as power consumed or data center square footage), or the number of people needed to design, build, and then operate the system.  In many domains, issues of privacy, regulatory compliance, system integrity, and availability requirements add to the engineering complexity of big data systems.

Traditionally, Moore's Law (the ability to reduce the size of transistors geometrically) has allowed scale issues to be surmounted.  However, it’s proving much harder to shrink transistors for many reasons, so it’s getting harder and harder to eke out such easy cost-effectiveness improvements.  Amazingly enough, the amount of data may be growing even faster than Moore's Law in certain domains, as this NIH picture shows that.  Time precludes me from discussing this too much. 

[Slide 28 - Risk 2: Technical Challenges (2)]

This slide shows challenges related to accuracy, some of which can be amusing.

Imagine a computer transcription system that transcribes a voice message after a human repeats a phone number to gain more certainty.  One can easily imagine an erroneous transcription (with an extraneous sound interpreted as the digit “5” to be: “My number is area code (626) 523-8023.  Once again that number is (562) 652-3802 free.”  

One week into a new job, I was using gesture-typing (another place where machine learning is used) and I wanted to tell my new assistant that I would call him from the car after leaving a meeting.  The message, which the system recorded as "I'll call you when I'm on the can,” was rather embarrassing.  

I’m sure you can find better examples of these on the web, but mostly, they are usually due to a lack of some commonsense processing by machine learning algorithms.

Perhaps you saw this problem with neural networks; Nguyen et al published a paper, entitled “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images showing how easy it is to fool deep network image recognition networks. The one with yellow and black stripes, for example, is deemed to be a school bus.  That these systems can be easily spammed is problematic in some domains, and they invite abuse.  For example, it may be possible to flood a spam detection algorithm with false positives thereby preventing it from finding truly abusive posts.  In fact, I’m told, though I have not independently confirmed, that kids in the Bay Area are holding up road signs that confuse self-driving cars.

I do believe there will be solutions to these problems, but they may require the addition of other approaches beyond the machine learning typically used.  Since 2001, I have been terming my call for this need to combine multiple approaches (including more semantically-based algorithms), “The Combination Hypothesis,” which states that intelligence requires a collection of somewhat orthogonal approaches to be applied.  Like in multi-drug therapy, each approach gets only so far in getting the right answer, but when combined, the residual error from the approaches tends towards zero.  The use of multiple approaches certainly reduces brittleness.

[Slide 29 - Risk 3: Problematic Data]

A potential problem with all big data approaches is that of problematic data.  An old 1950s adage in computing is “garbage in, garbage out.”  While this is not exactly true with some big data approaches which tolerate very noisy data, one still needs the incoming data to meet certain criteria if good outcomes are to result.

Here are some of the problems:

One last concern is that learning from data available today may create a kind of inertia based on the actions and norms of today, even if they are not what we ultimately want.  I observe, metaphorically, that if you learn from the present, you may imprison the future.  Amit Datta et al. wrote about this in “Automated Experiments in Ad Privacy Settings: A Tale of Opacity Choice and Discrimination.  In their article, they noted that if you look at click rates on ads, rates for certain ads may be higher for males.  It would be natural for ad selection algorithms to use this information to bias the presentation of ads on properties more likely to be of interest to males.  If those ads were for higher-paying jobs, a big data approach to ad recommendations could unintentionally perpetuate an existing societal tendency.  I found it a most interesting article.

[Slide 30 - Risk 4: Ownership and Liability]

The fourth issue I illustrate relates to questions related to ownership rights and liability.  

The first is the ownership of collected data.  Each of us, every time we use most any computer application, will likely leave behind a few tidbits of information.   As I mentioned, whenever we interact with a search engine, we are almost certainly contributing some small amount of information that adds to the spelling correction system.  

So, the question arises, should we have some ownership stake in that information?   On the one hand, the little bit of information is, after all, just a little bit of information; perhaps a tiny reinforcement of a word’s correct spelling, which may already have been reinforced by countless interactions.  But, on the other hand, there may be great value in the aggregate.  The debate on this has real implications, particularly in the realm of medical research, where a modest amount of data from a moderate number of patients could result in a highly valuable, and profitable, therapeutic advance.  Some medical researchers are arguing that restrictive policies here are greatly slowing the rate of progress in medicine.

The second topic relates to liability.  Programmers will create algorithms, data will be generated and collected, and systems will combine the algorithms and data, and humans will own/use/operate the systems.  Resulting operation won’t always be perfect, and very harmful results may sometimes ensue.

Where will liability fall?  Self-driving cars or medical treatments are good domains to consider.  In the case of autonomous vehicles, data will be gathered from many sources to accurately model road networks, traffic patterns, and regulatory rules.  Algorithms will utilize this data to produce the logic to operate vehicles.  The logic will then be bundled by vehicle producers which are sold to car owners who will use their cars.  There are analogs in medicine.  How will our legal system attribute blame and allocate damages in this complex, multi-stage system research system?

Relatedly, there is another problem.  Humans make mistakes all the time and reasonable liability accrues.  And while big data-advised computer systems may make far fewer mistakes, they are likely to still be imperfect.  But, because those systems are now automated (with concomitantly large societal expectations) and many deep-pocketed companies may build them, there may be a societal tendency to allocate excessive liability to failure.

Again, consider self-driving cars.  While one can easily imagine a factor of 10 reduction in fatalities in the United States, what would the liability be for those remaining 3,000 deaths?  If a jury tries Tesla, General Motors, or Google (as opposed to an individual), could the liabilities be so great that the penalties dis-incentivize innovations that are, on the whole, quite valuable?

Finally, if almost everyone on a highway speeds, a big data system may learn a speed limit that is different from the posted speed limit.  The higher speed may be safer and necessary for an autonomous vehicle to use.  But, when the vehicle is stopped by the police, I suppose the vehicle could list all the others it has learned from and perhaps the fine could be allocated fairly to us all.  In fact, perhaps, we are all a bit guilty and the system is just making that explicit!

[Slide 31 - Risk 5: Explanation]

As we consider intelligence, we commonly want to know why a particular decision has been reached.  It’s common in day-to-day human interactions to ask why someone has reached a conclusion and for that reasoning is often as important to us as the conclusion itself.    

This directly leads to one of the most common problems in many systems that utilize big data:  they cannot answer the question of “why.”  Some have wondered if you can have intelligence without explanation.  Looking a little deeper, there many reasons why “why” is important.

  1. If a system is not functioning well, from a developer perspective, not being able to ask “why” a machine learning system is making an error is a significant problem.  In response to someone trying to debug an erroneous result, one can only inspect a largely uninterpretable, vast matrix or network with lots of coefficients.  To fix errors, one would need to retrain a whole system with new (and perhaps better) data and features or to manually add rules or other patches to its operation.   This is a true challenge that has reduced the spread of machine learning in some domains.  

  1. As a related problem, with no true understanding of how some machine learning systems work, it’s difficult to augment them with other semantic knowledge.  How does a developer of a system with a big neural network with a lot of coefficients instruct it with additional common sense knowledge?  For example, we might want to add, "by the way, a school bus must be a vehicle and have characteristics beyond yellow and black lines", or "don't recommend pork to someone who's observant of the rules of Kosher food."

  1. In many applications, it’s essential to also explain to users why a system is operating as it is.  In recommendation systems, a user will be much more likely to benefit from good recommendations and to forgive bad ones if they can explain the “why.”  In the latter case, a user could react, that’s a really silly conclusion, but I see why you made that inference.  While some rudimentary explanation may occur in limited domains (e.g., as mentioned previously regarding music recommendations), they are quite simplistic.

It’s worth noting that         when many big data algorithms learn from a relatively small set of features, explanation becomes much easier.  In the simplest of cases, if one built a system to recommend restaurants based merely on the price of previous restaurants someone had frequented, the rationale is pretty obvious.  It’s when the space is more complex that the “why” becomes so difficult.

  1. Societal acceptability of conclusions is also weakened if there is no “why.”  For example, machine learning researchers have looked at whether machine learning algorithms can do better than parole boards at predicting recidivism. Researchers have shown that in some circumstances, automated systems have proven better, but they cannot justify their conclusions.  I question whether society or prisoners should accept such decisions, particularly where it’s inevitable that things will sometimes go wrong.  

As a footnote, I do note that if one creates a very simple machine learning systems with relatively few signals, they may be very interpretable as to why they are making their decisions.  Also, some including Cynthia Rudin at MIT, are very concerned with explanation, even focusing on this recidivism example, and are looking into solutions.

  1. From a legal perspective, a system might be allowed to undertake some action only if that action is done without the use of an inadmissible or illegal rationale.  In addition, legal liabilities are often based on the care with which an action or a decision is made, not just the particular decision or action.  Or consider a self-driving car which unavoidably does some damage.  Mightn’t it be necessary in a court of law to explain how it reasoned through its alternatives and how it chose the least harmful one?

[Slide 32 - Risk 6: Replicability]

Turning to one important use of big data: hypothesis generation and confirmation in science. One needs to publish results and inspire others to validate them to better ascertain the truth.  Thus, data must be available, interpretable, and usable. But there are challenges:  data sets are huge and they may be very hard for others to interpret, as they have many unstated assumptions.

Furthermore, data is often proprietary or private.  For example, Google’s flu trends demonstrated that search requests could illustrate in near real-time the incidence of flu outbreak.  While there's been some debate about its value, this type of surveillance, perhaps for other problems, would seem useful.  However, Google found it difficult to release the raw data it used for validation or experimentation by others for two reasons.  The search logs are exceedingly private.  And, the details of which search terms were correlated with the flu (and what abuse-rejection mechanisms were used) would open the system up to abuse.  Thus, even though the idea has much potential, it had these two flaws that were difficult to surmount.  I note that with lots of additional work, Google did eventually find a way of releasing aggregate data to scientists, but problems remained.

So, in summary, big data systems hold much promise for science, but privacy, commercial concerns, and the difficulty in truly understanding the precise meaning of data add challenges.

[Slide 33 - Risk 7: Causation]

It’s time now to turn to the topic of cause and effect.

It goes without saying that correlation does not demonstrate cause and effect.  All sorts of correlations have been shown in medical studies, but changing one correlate has no effect on the other, so that correlate is not causal.  And even where there may be cause and effect, understanding a causal chain may be needed to fully understand what is happening and to address a problem completely.  Even thousands of years ago, Plato in the Phaedrus wrote about this problem, discussing “the causes of each thing; why each thing comes into existence, why it goes out of existence, why it exists.”

When we look at medicine, everyone is hoping big data will be the key to solving many health problems, and it will help, but it’s not so easy.  As some examples, it’s nice to know who is at risk for heart disease, but it’s much more interesting to know how to intervene.  You almost certainly need to know the causal chain—the sequences that leads up to plaque and then to see what steps could be interrupted.  At minimum, understanding will require a lot of data at many levels of biomedicine.  The words on this slide, “Genome”, “Epigenome”, “Transcriptome”, “Proteome”, “Cytokines”, “Metabolome”, “Autoantibody-ome”, “Microbiome”, “Exome”, are just some of the basis for disease.  The list of possibilities is growing, regrettably, very fast making big data approaches much more complex.

[Slide 34 - Risk 7: Causation (2)]

In history, there are lots of examples of significant errors which illustrate the problem.  The Framingham Nurses Study was a well-funded, long-term retrospective study that had many good results.  However, one bad one was that it led to the view that estrogen replacement therapy would effectively lower the risk of heart disease in postmenopausal women.  The conclusion would seem to have made sense: (1) It seemed plausible that a change in estrogen could be implicated in heart disease occurring after menopause, and (2) the data seemed to show that women who had estrogen replacement had less heart disease.  But, it wasn’t right; there must have been another reason (most probably, selection bias) and the medical profession no longer recommends hormone replacement therapy for this purpose.  

The challenge seems just as great with diet.  In Congress, in mid-2015, there was (amazingly) a bipartisan collection of complaints about federal dietary guidelines, due to their frequent and inconsistent changes.  In effect, Congress asked why are we advising Americans to do different things every few years with respect to food.  Interestingly, the 2015 dietary guidelines will not bring forward the recommendation to lower cholesterol intake because there's no appreciable connection between consumption of dietary cholesterol and serum cholesterol.  We had been told for years to worry about butter, but we are now told not to worry, at least as it relates to butter’s cholesterol content.  

As recognizing the difference between correlation and cause and effect is so important, it’s worth a few summary points:

  1. As I’ve said earlier, sometimes correlation with no understanding of cause and effect is sufficient.  My prototypical example is spelling correction but there are very many others.
  2. But, frequently we need to know cause and effect, either because we genuinely want to understand, or because the intervention we want to make will not otherwise be successful.

As I will describe in a little while, a lack of understanding and diligence on this topic (by the public, the press, and scientists alike) is as big a problem as are many underlying scientific challenges.

[Slide 35 - Risk 8: Free Will]

We are all probably interested in the topic of free will.  To what extent are our actions driven by our innate biology, our training, and events up to the present, and to what extent are we really free actors?  While we’ve all wondered about this, we almost always believe in freedom to choose, because (1) we frequently see the various options available to us, (2) we feel we can choose worse or better actions, and (3) it’s a far more pleasant mindset.

Now, of course, much of the use of big data (in fact, its single most valuable use to date) has been to make computer systems present information to a users that is customized to his/her wants or needs.  Particularly, ads.  Called behavioral targeting, it is interesting to think of whether this is good, bad, or some of each.

When our son Ben was born in 2000, only 17 months after our twins, Emily and Asher, my wife and I were just not interested in sports car ads.  I wouldn’t have believed it earlier in life, but we wanted to see mini-van ads.  No doubt, we would have relished behavioral targeting had more of it been happening in this fairly early internet era.

On the other hand, some feel that behavioral targeting can tempt us to do things we shouldn’t be doing.  The previously mentioned book, for Phools, details this argument.  Should we reduce temptation at supermarket checkouts so that (perhaps) we have less obesity issues in America?  In the realm of politics, many wonder whether big data perniciously influences the views of the population, perhaps polarizing subgroups by reinforcing views they already have or perhaps influencing voter turnout.

In a sense, this is an old story.  There have been newspapers publishing polarizing points of view for a very long time, but something about using lots of data to craft a highly-customized point of view feels different to me.  Differences might include (1) the scale at which we can now do this, (2) a great reduction in the barriers to entry of publishers, (3) a greater ability to customize a story with much knowledge of the individual who will read it, (4) the cross-border nature of the internet with a resultant reduction in impact of cultural norms, and (5) the ability to quantify impacts and hence rapidly tune a system to be ever more effective.

A Facebook experiment in 2012 is a well-publicized example.  Facebook manipulated their news feed to see if they could quantify the impact of different types (happy or sad) stories on individuals.  That ability to quantify the impact seemed to make it much more manipulative than a newspaper trying to sell issues by putting up a particular headline on the front page.

One thing to note, and I think this is a subtle, but not so-well-understood point:  the issue of big data’s impact on free will is largely different from the issue of privacy.  As a proof point, I note there was no issue in the Facebook controversy regarding data privacy.  No individual information leaked.  And despite the immense number of cookies stored on our systems and the ensuing behavioral targeting (with many billions of dollars of impact), very few privacy problems have occurred.  So, I argue one must consider separately issues of privacy from the general topic of behavioral targeting.    

When I said “largely different” above, I do note that someone could watch us and learn about us from the way a system interacts with us; someone might observe us being presented with a minivan ad and deduce we have children.  In this instance, a system’s behavioral adaptation would have privacy implications so many of these issues are slightly interlinked.

I do hope I’m conveying neutrality on this topic on the impact on free will, as I’m very mixed about it.  It’s hard to argue against optimization.  But, while Socrates said, "the unexamined life is not worth leading," in modern parlance, we could ask if should we all be so examined, instrumented, measured, and optimized?

[Slide 36 - Risk 9:  Societal Understanding]

When I pick up the newspaper or magazine every day, I now see a vast number of stories using new data purporting to prove a point, often accompanied by wonderful charts and statistics.   I wish I had a data-oriented journalism student who could precisely document this change in quantitative reporting and the growth in wonderful charts and statistics.  (And, yes, I see the recursive challenge.)

A few years back, I was very hopeful that the availability of data would yield more truth in reporting.  But, I fear we’ve just created more noise in the form of either unmeaningful data or unwarranted, but seemingly compelling conclusions.  There was Robelle’s book which I read a long time ago, called How to Lie with Statistics.  I’m afraid that with the proliferation of data and new techniques for analysis, presentation, and dissemination we have even more serious problems of being misled.

On the morning when I first gave this talk, I was working out on in the hotel gym and was trying to distract myself from boredom by watching the cable news channels.  Remarkably, on the four cable channels that were available, there was virtually nothing being broadcast except for reporting on the poll results for the various Republicans running to be President. There was nothing about policy.  There was nothing about the candidates’ experience or truthfulness.  Instead, the commentary was typically of the form that a poll had said such-and-such, and because the poll had changed by so-and-so, some pundit did or didn’t believe in some future scenario for a candidate’s success.  Polls are not new, but perhaps, they are now so (relatively) easy to undertake and process in real-time, there is tendency to report on them rather than on matters of deeper substance.  And, I do fear that many of us can be taken in by leaderboards and statistics.

For sure, the morning reporting did practically nothing to inform me, and it seems to me that it turned the Fall 2015-2016 primary cycle into a kind of game show.  So, when watching TV news (with due respect to Karl Marx), it seemed that the ability to do incessant polling and instantaneous reporting had led to a potentially dangerous result:  big data being a present-day “opiate of the masses.”

Even more seriously, few really focus on the differences between cause and effect.  This is true of reporters, the public, and even scholars.  Most everyone has a bias -- if only the natural bias of wanting to reach some conclusion.  People naturally make assumptions, generalize, and use what data they have to reason their way to an answer.  While sometimes the behavior might be deemed misleading or manipulative, it is often well-meaning, but nonetheless erroneous.  The scientists behind the Framingham Nurse's Study were trying to help (albeit, perhaps, also to get their papers published), but the standard of truth they applied was apparently not enough.

Little of this is new, but perhaps the stakes are higher because so many have so much data, and interpretation is hard.  You may know all of these problems, but perhaps they are worth repeating.  

  1. Statisticians know the challenges of testing for a serious, rare disease.  For example, one that is prevalent in only one in ten thousand people.   Even a terrific test that always correctly identifies people with the disease and only 0.1% of the time reports a false positive is nonetheless likely to falsely identify about 90% of patients identified with the disease.  Such patients will be subject at least to stress and inconvenience, but possibly to the expense and risk of follow-on testing and treatment for a disease they don’t have.  Very few of us would know to question a test with such high accuracy and selectivity, though we need to do so.
  2. Similar problems arise from the class of problems termed the prosecutor’s fallacy.  A prosecutor could argue that a DNA match of a suspect with forensic evidence is only one in a thousand.  However, if a prosecutor screens thousands of potential perpetrators’ DNA, he or she will almost certainly find a match.  Such a match would obviously not imply culpability.  Similarly, a prosecutor could justifiably argue that the probability of winning the lottery without cheating is very low, and then go on to say that anyone who wins must therefore have cheated.  Unfortunately, this argument does not take into account the large number of people who play the lottery, making the likelihood of someone winning without cheating very high.

With all the data now available, it’s unsurprising that these types of issues are prevalent.

There is one more problem I’d like to identify.  Both scientists and the press are eager to reach a conclusion or to publish an article.  Thus, they use language that, while careful, is nonetheless misleading.  Even if scientists are careful to not let biases induce false conclusions, they frequently say that something “may imply” something else, or they “hypothesize” something as true.  The press picks this up and then writes that a particularly respected scientist hypothesizes something consistent with some point of view.  While the reporter might be guilty of at least mild sophistry, we data scientists need to be particularly careful in what we say, and the public needs to read more critically.  These types of issues have big consequences in many ways, whether relating to the denial of climate change, or to the promulgation of other terrible policies.  

All in all, I do very much worry that big data could lead to the reverse of what we all desire:  that is, less informed debate and more erroneous conclusions backed up by various arguments based, at least to some degree, on reams of data and processing.

[Slide 37 - Magnificent Comic from SMBC-Comics ]

To keep this light-hearted, my younger son gave me this relevant Saturday Morning Breakfast Cereal (SMBC) web comic.  Purportedly from the Department of Education, it starts with the (likely realistic statement) that, “people are more afraid of shark attacks than of car accidents despite the fact that car wrecks are millions of times more likely.”  The Department of Education boldly proposes to solve the problem by removing the disconnect between belief and reality by proposing to work with a consortium of marine biologists to add 100,000 more shark attacks per year, at a cost of just a few millions.  The resultant additional shark attacks, if carefully distributed around the world, including to critically under-sharked nations like Switzerland and Nigeria, would ensure the public’s fear of shark attacks is based on rationality.”  This satire illustrates the difficulty many have of putting statistics into perspective and it humorously inverts cause and effect by proposing to modify the effect to make the cause correct.

I’m a good deal less cynical than this and feel that with care and education, we can mitigate this type of risk; but, all of us in the big data world must remember we have a potent force that can be used for both good and bad.  This little humor should again emphasize to us that we scientists have to be very, very careful with what we say, and we have a strong responsibility to contextualize what we say.

[Slide 38 -  Risk 10: Complexity in Setting Objective Functions]

The penultimate issue I’ll illustrate may be the most surprising one of all.   The issue is, simply, the challenge of determining the right optimization goal.  One might think this would be easy, but it’s frequently not.  

Here’s the simplest example I know: consider the problem of applying big data techniques to online advertising.  One might think the objective is clear—say, maximizing revenue, which is related to the number of clicks on ads.   But it’s not at all clear, as one needs to think about what it really means to maximize revenue over a particular period of time.  

If an advertising property, such as a newspaper, were to show ads that are visually alluring, but actually were just “bait and switch,” the advertising property might very well gain many clicks. But the clicks would come at the expense of (1) a reduction in reputation of the property, (2) a decreased propensity for users to click on ads in the long-term (perhaps, fueling the growth of ad-blockers), and (3) perhaps even Federal Trade Commission scrutiny.  The unintended consequences of maximizing clicks may take a very long time to show up.  In general, many objective functions may be very long-term in nature, and it may be very hard to obtain the information needed to create a well-informed, enlightened objective function.

In recent elections, many sites have been accused of misinformation.  Clicks, or revenue, may be a direct goal on those sites, but the targeted spreading of misinformation may well have resulted in misinformed electorates and other problems.  Ethically, these sites would seem highly problematic.  But, those of who believe in freedom of the press/speech would argue centralized regulation of them would likely be a cure worse than the disease.

A search engine has an even greater problem.  There are many queries where big data approaches cannot completely guide a system to provide the best possible search results.  In response to a query such as, “origin of life,” the weighting of biblical versus scientific sources is a complex problem, even if the search engine creators fully understand the preponderance of scientific evidence, the page-rank weighting, the click-through frequency, and very many other signals.

Setting objective functions might even seem easy when it’s not.  For example, we might think that with self-driving cars (or perhaps any cars), we could use our large-scale data capture capabilities to read all the speed limit signs and just govern cars to go no faster than the speed limit.  I think this would be wrong, because the speed limits were never really intended to be taken so literally in the first place.  As another example, I learned about an NYU project to use big data and machine learning to do large-scale noise monitoring.  While this seems like a good idea, I’m not sure that laws on the books were necessarily meant to be interpreted universally and mechanistically, so I would expect this study will generate, well, some noise.

Both of these two examples relate to the problem that the norms and laws of society were, perhaps, meant to be applied with some considerable discretion.  But, big data may allow for the literal enforcement of ever more laws, whether or not that’s best. And, as we have learned, members of society may have differential abilities to avoid trouble, making this type of enforcement potentially unfair.  Using a computer science metaphor, big data may allow law enforcement to break through object (societal) boundaries that used to be opaque and instead scrutinize an ever-larger number of detailed human interactions.

Generally, I fear that many optimization problems are essentially zero-sum games, where one can achieve a particular objective while leaving another objective wanting.  Here is one that was in the press recently.  The Office of Management and the Budget has been trying to decide whether it should look at the indirect effects of tax legislation, say reductions, on economic growth.  That is, they are considering whether they should take into account any offsetting growth in tax revenue that might be due to such a cut.  It would seem plausible that it makes sense to do this, but it’s highly politicized.  Generally, those who want a larger role of government don’t want this effect taken into account, because it might encourage tax decreases and smaller government.  But those who believe in a smaller government will want to take such indirect effects into account because it may encourage tax cuts.  So, the decision comes down not to the question of whether large-scale econometric modeling permits better predictions, but rather politics.

[Slide 39 - Risk 10: Complexity in Setting Objective Functions (2)]

There are very many areas where this type of dilemma occurs.  Consider educational attainment in schools.  With big data and good measurement, we can hope to truly understand the benefits of certain educational interventions.  We could imagine some interventions that improve average reading scores, some that improve the ability of talented young scientists to achieve their potential, and many others.  But, what’s the goal?  This has always been an issue in school boards, but imagine now that with big data approaches and educational technology we can accurately quantify outcomes. I suspect a school board, if given precise predictions on the impact of differing alternatives for their budget, would have very fractious discussions that might tend to tear it apart.

So, I’m trying to think ahead: even if we could give correct information to policymakers, there is a very real question of whether policymakers could be harmed by data science.

Somewhat relatedly, the Obama Administration wanted to put out scorecards measuring the value of a university education stratified by major.  This did not go over well in universities, as many believed measuring value by any single set of metrics (e.g., earnings potential) would be too one-dimensional and not a good basis for influencing the education of our young.  Many in the humanities and social sciences felt it would be particularly harmful to their area.  

I agree with my university colleagues that this was a poor idea, and I think the project has been mostly scrapped.  One could imagine the authors of the concept had the idea of helping people, say, maximize compensation (perhaps to pay back government-supported student loans), but is the near-term salary maximization best for the individual, the universities, and our civilization?  And, of course, if our government really did motivate students to pursue a particular career focus, would our college population veer figuratively from one side of the ship (as many rush to a certain field) to some other (as the initial recommendation becomes too crowded) in a destabilizing oscillation?

Taken a few steps further, China has been experimenting with a “Social Credit System,” as described in the Economist’s Dec 15, 2016 article, China invents the digital totalitarian state.   With this system, government groups haveWhile it’s unclear how far China intends to go with this, it is clearly considering whether to use big data approaches to influence the breadth of citizen behavior.  While China administrators may intend to optimize some pan-China utility function, the impact on individual liberty could be extreme.

I’ve listed a few other topics where optimization objectives are less clear than one might think.  I’ll sample just a few of these, given time limitations:

[Slide 40 - Differentiated Benefit]


In 1958, a British sociologist named Michael Young wrote a short novel called, The Rise of Meritocracy.  In the book, he coined the term, and defined it as “intelligence plus effort.”  Perhaps, we’d say “intelligence plus skill and effort,” but I think that’s what he meant.  

The book was definitely in the mainstream of liberal political philosophy.  Contrasting a merit-based system with hereditary aristocracy, the topic must have been of considerable interest in the UK at that time.  Here on this side of the pond, it was only two decades earlier that Harvard President James Conant advanced the SAT aptitude tests as a basis for moving elite college admissions to a more meritocratic system.

Merit seemed a perfectly reasonable organizational principal.  The traditional farmer, who got up early and was a little stronger, may have gleaned somewhat more food, but others would not begrudge him, because his benefits seem proportional.  Contrast this with a few entrepreneurs armed with talent and skills in computing and data science.  With minimal capital and the huge leverage of technology, many ventures have enormous economies of scale and winner-take-all attributes.  While I may not agree, some may judge the huge return to such entrepreneurs to be, in some sense, unfair.

Actually, as some of you know, Michael Young’s book was actually a satire about meritocracy run amuck.  He was worried about sociological implications if meritocracy were pushed too far.  I posit that computation (with rapidly growing opportunities for automation) and data science need be considered in light of their effect on our meritocracy.  Meritocracy seems so very right to me, but we technologists need to think long-term about our impact on it.

[Slide 41 - Big Data Leads to Societal Challenges]

I hope in the previous sections, I’ve illustrated how big data, or empiricism, is changing computer science.   And, that change is permitting all sorts of new applications of computing technology to benefit the world.  I hope I’ve also illustrated that big data and machine learning will also create a lot of societal challenges.  

I hope, by now, you could ask many more such questions, and that you will agree that good outcomes, at least in many domains, are difficult to determine.

[Slide 42 - Mitigation and Recap]

I have limited time, but I’d like to wrap up with two remaining slides.

[Slide 43 - Mitigation]

To understand data science well, I believe we have a truly transdisciplinary venture.  Many of us, humanists, ethicists, lawyers, social scientists of all types, statisticians, computer and other scientists (and more) need to look at the opportunities and risks and help come up with the best possible solutions.  We will never put the genie back in the bottle, nor should we.  But, we should cause as few unintended, negative consequences as possible.

When there are discussions of big data, or more CS+X programs, the discussions should not center solely on how computer science or data science can provide tools to those disciplines; rather, we must remember those disciplines are themselves greatly changing because they are gaining entirely new problem domains on which to focus.  I look forward to their contributions to this ongoing journey.

[Slide 44 - Recap]

I hope that my talk has illustrated how Moore’s Law and prodigiousness has enabled the growth of empiricism in computer science, leading to all manner of new applications, yet many technical and societal challenges.  I’ve listed my group of eleven challenges, but I suspect there are many others for you all to think about

I’ll finish, as a computer scientist would naturally finish, with a final plea for a strong education for almost everyone in computer science.  Adding to the usual growing sphere of computer science applicability, empiricism is ensuring the field will increasingly affect every discipline.  While I, by no means, want too much growth in computer science or statistics majors per se, I do believe we must ensure that all have the right, appropriately focused background in computer science and related disciplines.

[Slide 45 - Thank You]

Thank you so much for your time.

Alfred Z. Spector

Page  of

[1] Empiricism and Optimization in the World of Big Data, INFORMS, Philadelphia, 11/2015