This one’s already been passed around the blogosphere like a cheap bottle of wine, but I feel the urge to comment on it myself. A recent issue of Wired Magazine has an essay by Chris Anderson with the very provocative title, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” The argument, in short, is that computers have gotten so good at processing massive amounts of data that we don’t have to understand the underlying processes that produce the data anymore: we can just let computer algorithms sort through the data and give us explanations. Anderson labels this new, model-free age “the Petabyte Age.”
This article and this attitude really irks me, and not just because I am a theorist by trade: the attitude presented here is, to my mind, so wrong-headed that it is actually hurtful to the progress of science. Let’s take a look at some of the claims.
At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.
(Emphasis mine.) Google is a bad example of a new kind of science, for a simple reason: they’re not doing science. As the article itself states, “Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough.” Science is, pretty much by definition, the attempt to understand complicated phenomena. This is not a knock on Google, which is an amazing technical achievement; it is simply the observation that coming up with a “brute force” solution to a problem does not make it science, no matter how complicated the problem.
Scientists themselves use “brute force” computation all the time; for instance, electromagnetics researchers use genetic algorithms to design better communications antennas. The computer is optimizing the system for them, but none of these researchers, I believe, would think that the output is alone a scientific result. (If it is, so are the genetic algorithms which drive the artificial intelligence when I’m playing Quake III.) For instance, said antenna researchers know that they’ve found an optimal solution, but not necessarily the optimal solution. Similarly, Google has found an excellent search engine strategy, but I don’t believe they’ve proven that they’ve found the best strategy: they’ve found one that makes customers happy enough to make themselves a lot of money.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
There are very good reasons why scientists emphasize the difference between correlation and causation: one may not, and may never have, enough raw data to distinguish the two cases, and the results could be disastrous. The author seems to be assuming that an infinite amount of data of any desired kind is available to the researcher, which is a flawed and unrealistic attitude to take. A big reason why we have models in the first place is that they allow us to generalize from the data we have to all the situations for which we have no data.
There is another reason why models are highly important: they serve as a backwards check on the data themselves. The reality is that in science data and models form a rather symbiotic relationship. Scientific data leads to a model, which serves both as a check of future data and a way to test for new scientific discoveries.
This recalls an example from my own line of work: nano-optics. Some background can be found in this post I did on peer-reviewed research some time ago. Nano-optics is the study of the interactions of light with structures that are smaller than the wavelength. One of the challenges of nano-optics is that none of the simplifying approximations used in traditional optics can be applied in the nano case.
In 1998, a French group discovered what became known as ‘extraordinary optical transmission’. In short, they found that an array of subwavelength-size holes in a metal plate can transmit much more light than traditional theoretical results predict. They attributed this extraordinary transmission to the presence of ‘surface plasmons’, electron density waves on the surface of the metal.
It is here where things got difficult. Probably a dozen research groups performed experimental and numerical simulations to study the transmission through such a hole array. In fact, they often found conflicting results: one group found enhanced transmission, another found suppressed transmission. There was debate about whether plasmons were the cause of enhanced transmission at all.
I became involved with a research group that simplified the experiments considerably: we looked at the extremely basic case of two holes in a metal plate. We did numerical simulations of the phenomenon, and found in our simulations that the transmission was sometimes enhanced, sometimes suppressed, depending on the separation of the holes.
Was this a real effect? In any numerical simulation, or any physical experiment for that matter, there is always the possibility of mistakes. To check our results, we derived a very simple model for the enhancement (in fact, it was an equation with only two terms on the right-hand side). This model agreed almost perfectly with the numeric results, as well as the experiments done about the same time.
What is the advantage of the model? Without the model, or any interpretation of the experimental results, the best we can do is describe precisely what happens in the two-slit case. Without the model, we cannot generalize our results to an arbitrary number of holes in a metal plate. Could we simply do more experiments? Sure, but with a model we can proceed much quicker and predict what those experiments will do before we expend the time and effort.
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
If the words “discover a new species” call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.
By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.
I’m guessing biologists would be pretty upset with this last statement. The author of this article spends an entire paragraph explaining how all Venter has is data: he can tell you nothing about the species he’s discovered. Then he follows up with a completely unjustified statement that this is the biggest advancement in biology of his generation. Why? What have we learned that we didn’t know before? We aren’t told. Again, there’s nothing wrong with Venter’s work: it sounds like remarkable and great research, but that doesn’t make it the best research done in a generation.
All this misplaced enthusiasm reminds me that data collection has not only been a blessing for science, but has also added practical complications. Every research group has massive amounts of data which is typically difficult for another research group to decipher or reproduce. Furthermore, typically only a single research group is working on a particular project. Can one tell, from reading a 4-page scientific article about an experiment/numerical simulation, whether the mass of data presented has been interpreted properly? Sometimes yes, sometimes no, and sometimes researchers use mounds of generic data to commit scientific fraud. One of the only readily-available tools a researcher can use to analyze complicated results for errors or fraud is their own understanding of the underlying science behind it, i.e. the models.
To conclude, high-powered computing is certainly the future of science, but reams of data without interpretation or understanding is a scientific dead-end.
A final thought: I can summarize my problem with the Wired essay by the following analogy: The author’s “If you’ve got a computer, you don’t need a theory” sounds disturbingly like a grade-schooler claiming, “If I’ve got a calculator, I don’t need to learn arithmetic.” The result of the latter attitude is a depressingly large number of cashiers who can’t give change of $2.00 for a $1.50 purchase. I can hardly imagine what the consequences of “calculate, don’t think” would be for science.