Who Will Drive Innovation around "Big Data"?
It’s a curious pattern, that when new technologies arise and enable access to new forms of data, it is often not the domain experts who drive innovation, but those with deep technical expertise who pick up the required domain knowledge along the way. In other words, it can be a costly mistake to ignore new methods. Let me illustrate this pattern with three examples, before making the connection to "big data."
- Until recently, yellow cab companies controlled access to taxis: You call them, they dispatch, you wait … sometimes in vain. The recent ubiquity of smart phones, digitized maps, geo information, credit cards, and online social ratings has created new ways of connecting drivers with users. Whereas cab services were abysmal 5 years ago in cities like Pittsburgh and Seattle, today (thanks to Lyft and Uber), comparable services are cheap and reliable. Notice that cab companies did not create today’s technology for hiring drivers with your cell phone.
- Library science was born from the need to organize access to knowledge (books, in particular) and to help users find what they are looking for. The organizing principles for physical goods rely on catalogues and curated, top-down, hierarchically organized repositories. Thus, Yahoo started with a portal that organized web resources into a hierarchy. Soon, search engines began using keywords to help users navigate the Web and, later, keywords morphed into tags. Today, keyword search (including tags) is the dominant way in which users search for resources. Notice that library science did not create today’s principal mechanism for accessing humanity's knowledge.
- Applied linguists and cognitive psychologists have been studying second-language learning, i.e., the process by which people learn a second language. Their studies have influenced the way high schools teach foreign languages, as well as commercial tools such as Pimsleur programs. Starting from the perspective of "gamification," Carnegie Mellon University (CMU) computer science professor Luis von Ahn and his PhD student Severin Hacker founded Duolingo in 2011. Today, an estimated 120 million people use this free app to learn a new language. Notice that neither linguists nor cognitive psychologists created today's most widely used free language-learning app.
The pattern common to all three examples is that domain experts are often too entrenched in the way they go about their daily job. This leaves space for entrants with novel technical expertise to leverage new advances and find simpler or better ways to get jobs done. Obviously, yellow cab companies knew how to dispatch cabs, librarians knew how to organize knowledge into hierarchies, and language experts knew how people learn languages – but those domain experts did not drive the innovations that matter today. (Notice that above examples were strongly influenced by Clayton Christenson's theory on "disruptive innovation" although Christenson emphasizes some conceptual distinctions)
Back to “Big Data”
So, what does this have to do with "big data"? In all three examples, new forms of data (or new forms of access to data and people) have opened new ways to solve problems in areas previously "protected" from data-driven decision making. In a way, the concept of "big data" is just a continuation of our centuries-long quest to use the scientific method and more information to make better-informed decisions. But the difference – and it’s a big one – is that we now have so much more data and so many more forms of data (digital behavioral traces, geo information, voluntary sharing of preferences, all the unstructured data on the Web, etc.). Domain knowledge (which took others years to acquire) is very important in the process, yet the new entrants are adept at learning the necessary context exactly because they embrace a process of data-driven experimentation. The interest in more and more data becomes comprehensible, when you consider that: a) simple algorithms trained on larger amounts of data often beat more sophisticated algorithms with less data and b) very large data sets and new GPU hardware allow training neural networks to solve problems with an accuracy that was unimaginable only 5 years ago.
The lesson to domain experts who wonder if they should engage themselves with this curious trend is: If you don't embrace new data management methods fast enough, others will and, in the process, may acquire enough of your previous domain expertise to render you obsolete! Don't believe that driving those changes requires only a marginal understanding of the various techniques; rather it requires a deep familiarity with the methods and with their underlying assumptions, biases, and limitations. It also requires the skill to continually step back and ask the question "so what?" (what higher-level insights have we have learned?). One cannot simply throw loads of data at cool new algorithms and expect meaningful solutions to arise.
A Data-driven Approach to Education
Let's see how this trend affects university faculty with respect to what and how they teach.
First, what we teach: At our school, the Tepper School of Business at CMU, the faculty historically shares a belief in an analytical, data-driven approach to problem solving. This approach includes a focus on teaching principal methods, as little can later compensate for the lack of a solid grounding in the fundamentals (e.g., statistics, linear algebra, or understanding concepts such as “confounding factors”). Data-driven decision making requires familiarity with advanced data tools and their shortcomings, the ability to question assumptions, and the capacity to synthesize an analysis into meaningful, evidence-based insights. The better you understand the fundamentals, the more you get your hands "dirty" with actual data; and the more real-life case studies you see involving data (both good and bad), the better prepared you will be. Thus, all our MBA students are well-trained in probabilistic and statistical methods. All the first-year PhD students in our research group (the "Business Technology” group) are required to take advanced machine learning and cloud computing classes from the computer science department.
Second, how we teach: Think about the hours students spend doing homework and solving test problems. How much of this work then feeds back into a process that helps improve the way they learn? Similarly for instructors: Think about the time spent giving feedback to students. How much of this effort is directed at activities that could help other students learn from their peers’ experiences and our feedback? Typically, after a homework assignment is over, this information is lost. Thus, a natural question for data management and design innovation is: Can we do better at capturing and reusing these data to personalize the learning experience and improve the learning rate? Motivated by this question, we have been developing a prototype of a learning management system at CMU that engages students in a series of complementary learning tasks in which each of the individual tasks helps the students learn and, at the same time, contributes to other students' learning. In parallel, instructors can observe and engage in the learning process, with each instructor’s feedback leveraged to help the learning of not just one but several students. After a course is over, the instructor has acquired valuable digital artifacts that can be used immediately to help new students learn.
This and similar projects at our school are driven by our desire to optimize the use of students’ and instructors’ time , as well as by our conviction that if we don't find ways to use learning data and continually improve our approach to teaching, someone else will. Although the "year of the MOOC" (2012) is past and many people now recall it as "that trend that was going to kill colleges but didn’t", it is obvious that a data-driven approach to education will considerably change (and improve) the way we learn over the next 20 years.
Let me add a reminder from the late Edward Deming, who single-handedly engineered the rise of competitiveness of the Japanese automotive sector. In his 1982-1986 book Out of the Crisis (regarded by TIME magazine as one of the 25 most influential business management books) Deming warned against management by numbers alone, an exclusive focus on outcomes, and obsession with short-term profits and annual merit ratings, all of which work against success in the long run. Thus, we must keep in mind that "data literacy" is not just manipulating numbers, but also asking the right questions, challenging assumptions, understanding the nature of evidence, and focusing on evidence-based insights that drive but do not dictate our conclusions. Just don’t make the fatal mistake of treating "big data" simply as a nice-to-have add-on for the job you’ve been doing so far.