Data Science: from Half-Baked Ideas to Data-Driven Insights
Big data is leading to such a measurement-driven revolution, brought about by the new digital tools all around us, including our mobile phones; searches and web links; social media interactions; payments and transactions; and the myriads of smart sensors keeping track of the physical world. These digital tools are enabling us to collect massive amounts of information on who we are, what we do and how we interact as individuals, communities and institutions.
The explosive growth of big data is in turn giving rise to data science, one of the most exciting new professions and academic disciplines. Data science is a mashup of several different fields. Its data part deals with acquiring, ingesting, transforming, storing and retrieving vast volumes and varieties of information. Its science part seeks to extract insights from the data by applying tried-and-true scientific methods, that is, empirical and measurable evidence subject to testable explanations and predictions.
One of the most exciting part of data science is that it can be applied to many domains of knowledge, given our newfound ability to gather valuable data on almost any topic. But, doing so effectively requires domain expertise to identify the important problems to solve in a given area, the kinds of questions we should be asking and the kinds of answers we should be looking for, as well as how to best present whatever insights are discovered so they can be understood by domain practitioners in their own terms.
NYU’s new Center for Urban Science and Progress (CUSP), with which I’m associated as executive-in-residence, represents such a concrete application of data science. Our research and educational programs are centered on urban informatics, - the use of data to better understand how cities work and to use that understanding to help make cities more livable, efficient, and resilient.
Another new interesting application is data journalism, a term recently in the news because Nate Silver just relaunched his FiveThirtyEight website as a data journalism site. Silver is the founder and editor-in-chief of FiveThirtyEight, which he originally launched in March of 2008 as a polling aggregation website and political blog, was affiliated with the New York Times from 2010 to 2013, and is now owned and operated by ESPN.
Silver is one of the data scientists I most admire. In the 2012 presidential election, he correctly predicted the winner in all 50 states, including all nine highly contested swing states. He also correctly predicted the winner in 31of the 33 Senate races. His excellent book The Signal and the Noise: Why Most Predictions Fail but Some Don’t was published in September of 2012. The book describes not only his own particular approach to information-based predictions, but examines the growing field of predictions and why so many fail in spite of, or perhaps because of the vast quantities of information we now have available.
As part of the FiveThirtyEight relaunch, Silver wrote an article explaining what he means by data journalism. He starts out by noting that the FiveThirtyEight is best known for its election forecasting, and in particular, for having correctly called 50 out of 50 states in the 2012 presidential election. “Certainly we had a good night,” he says. “But this was and remains a tremendously overrated accomplishment.” Other forecasters, using broadly similar methods did just as well or nearly as well. He feels that if you paid attention to the overwhelming majority of nonpartisan polls, it was not all that hard to figure out that President Obama was the favorite to win the Electoral College.
What made the FiveThirtyEight and Nate Silver notorious is that their data-driven predictions stood out in contrast to what a number of prestigious commentators in the mainstream media were saying. For example, a few days before the November election, George Will predicted that Romney would win big, 321 to 217. At the time, the FiveThirtyEight was giving Obama a 90% probability of winning the election.
Author and columnist Peggy Noonan wrote in the Wall Street Journal on the eve of the election: “Nobody knows anything. Everyone is guessing. . . I think it’s Romney. I think he’s stealing in like a thief with good tools, in Walker Percy’s old words. While everyone is looking at the polls and the storm, Romney’s slipping into the presidency. He’s quietly rising, and he’s been rising for a while. . . All the vibrations are right.” In the end, it came down to a contest between vibrations and data science, - and data science won.
In the article, Silver discusses the strengths and weaknesses of conventional journalism along two dimensions: quantitative versus qualitative, and rigorous versus anecdotal. “[D]ata journalism isn’t just about using numbers as opposed to words,” he writes. “To be clear, our approach at FiveThirtyEight will be quantitative - there will be plenty of numbers at this site. But using numbers is neither necessary nor sufficient to produce good works of journalism. Indeed, as more human behaviors are being measured, the line between the quantitative and the qualitative has blurred. . . The problem is not the failure to cite quantitative evidence. It’s doing so in a way that can be anecdotal and ad-hoc, rather than rigorous and empirical, and failing to ask the right questions of the data.”
Can an article or project be rigorous and empirical without being quantitative, - in journalism, the humanities, social sciences, business or other fields? Absolutely. He cites solid investigative reporting and explanatory journalism as examples of well-researched, rigorous journalism which is not quantitative in nature, as well as the works of historians and biographers like Robert Caro and Richard Ben Cramer.
Is there a role for anecdotal evidence in journalism, in science and in other disciplines? Again the answer is yes. In fact, imagination and creativity are highly desirable qualities in people who think ouside-the-box. It’s what inspires them to take the next, increasingly rigorous steps. “Data does not have a virgin birth,” writes Silver. “It comes to us from somewhere. Someone set up a procedure to collect and record it. Sometimes this person is a scientist, but she also could be a journalist.”
What is needed is a process to transform the anecdotal evidence or half-baked idea into data-driven insights. Silver breaks down this process into four main steps:
Collection of data or evidence: “For a traditional journalist, this is likely to involve some combination of interviewing, documentary research and first-person observation. But data journalists also have ways of collecting information, such as by commissioning polls, performing experiments or scraping data from websites.”
Organization: “Traditional journalists have a well-established means of organizing information: They formulate a news story. The story might proceed chronologically, in order of importance (the inverted pyramid) or in some other fashion. Data journalists, meanwhile, can organize information by running descriptive statistics on it, by placing it into a relational database or by building a data visualization from it. Whether or not a picture is worth a thousand words, there is value in these approaches both as additional modes of storytelling and as foundations for further analysis.”
Explanation: “In journalistic terms, this might mean going beyond the who, what, where and when questions to those of why and how. In traditional journalism, stories of this nature are sometimes referred to as news analysis or explanatory journalism. Data journalists, again, have their own set of techniques - principally running various types of statistical tests to look for relationships in the data.”
Generalization: “No matter how well you understand a discrete event, it can be difficult to tell how much of it was unique to the circumstances, and how many of its lessons are generalizable into principles. But data journalism at least has some coherent methods of generalization. They are borrowed from the scientific method. Generalization is a fundamental concern of science, and it’s achieved by verifying hypotheses through predictions or repeated experiments.”
Communications is one of the key challenges faced by data scientists, in particular, how to best explain their data-driven insights to domain experts, - e.g., business executives, government officials, medical practitioners. This is an even bigger challenge for data journalists who are tying to reach a general audience: “one of the challenges that FiveThirtyEight faces is figuring out how to make data journalism vivid and accessible to a broad audience without sacrificing rigor and accuracy.” This will take considerable experimentation.
I hope that the FiveThirtyEight succeeds. It will be another application of data science that can help us learn how to be quantitative, empirical and rigorous without sacrificing creativity, imagination and compelling storytelling. I have no doubt that we will get there, but it will likely be a tough, long journey.