Marc van Meel

Marc van Meel

Hi I’m Marc!

I speak and write about the intersections of technology, society and philosophy.

BLOG

Why Data Science is NOT Science

It’s like living a double life – being both an engineer and a Data Scientist. As the former, I pledged to practice scientific integrity in service of humanity. As the latter, I frequently found myself tuning AI hyperparameters and selecting candidate models, in what can best be described as a game of trial-and-error.

In this blog I define science as the project of uncovering objective truths, or facts, about the world by observation and experimentation. And Data Science as the application of mathematics, statistics and AI to extract actionable insights from data, often in service of a business case.

Let’s explore why Data Science is a misnomer, why Data Science has little to do with science, why Data Science is best understood by Darwinian evolution, why we can never know science, and why Data Scientists will be the final scientists.

Data Science – Nail in the Coffin of Science

Induction lies at the root of Data Science, AI and data-driven decision-making. The concept of induction states that sufficient confirmatory observations can be generalized to universal claims. As the famous example goes, if we only observe white swans, we can inductively conclude that ALL swans must be white. AI, and Machine Learning in particular, are deeply rooted in inductive statistics, by generalizing classifications and predictions from large quantities of historical data. Inductive reasoning dates back to the 15th century, and the idea that science is based on induction was later popularized by the likes of Francis Bacon and John Stuart Mill.

Induction has shortcomings. First, induction rests on the assumption that the future will be analogous to the past. David Hume criticized this assumption in his Problem of Induction.[1] At some point, we can encounter a non-white swan. Environments are seldom static, but frequently subject to change. In Data Science this is known as data drift. Second, induction itself cannot be proven to work inductively, this results in a circular argument – science cannot be proven scientifically. Because we don’t have access to all possible observations, we can’t be assured that contrary evidence will never be found. Therefore, we are not justified in deriving universal statements from confirmatory observations alone. In the same fasion, Data Science and Big Data do not automatically translate to new truths.

Would Karl Popper have liked Data Science?

The problem of induction was later resolved by Karl Popper, who argued that science does not progress by induction, but by falsification – we can never show hypotheses to be true, we can only show them to be false. For a theory to be scientific, it must be able to be empirically tested and conceivably proven false. Inductive theories and models are pernicious to Popper. They are difficult to falsify, leave us vulnerable to randomness in data, and don’t perform well on edge-cases. Furthermore, we run into trouble when we operate under false assumptions, for instance about racial or gender class distributions. The biased and discriminatory AI Systems making headlines today are proof that Popper had valid concerns.

Nothing lasts forever, as Popper’s falsificationism became contested as well. Because it sets unrealistic standards for science – not all theories can be falsified (Thomas Kuhn) and scientists shouldn’t be bounded by the scientific method, or at least don’t act like it (Paul Feyerabend). More so, falsification doesn’t provide us with a rational base for which candidate model or hypothesis should be preferred over another. Popper would adopt causal models over prediction models, because they can be falsified more easily. But causal inference has its own limitations. We generally examine aggregated causal effects, because it’s impossible to model causal effects on the level of the individual.

Regardless, Popper’s falsificationism was highly influential in the development of modern science and laid the foundation for peer reviews and the reproducibility of experiments. Data Science also contains elements of falsification, when we test and validate model performance against unseen data.

‘Science Works’ is an inductive argument to justify the scientific method.

Digital Darwinism

Charles Darwin was a poor mathematician, but his half-cousin Francis Galton certainly was not. Galton, one of the founding father of statistics, invented many of the foundational concepts of statistics including: correlation, ‘Wisdom of the Crowd’, regression and regression to the mean. Concepts which are deeply inspired by the theories of natural selection which his cousin Darwin put forward in On the Origins of Species. Statistics finds its roots in Darwinism. Truths produced by Data Science should therefore be viewed in the same light: not as universal truths in a materialist realist sense, but rather as local pragmatic truths.

AI Systems and models are best understood as task agents[2] seeking to optimize an objective function, by finding (local) minima or maxima by means of evolutionary trial and error processes in a task environment (solution space). Any resulting insight or truth should therefore be evaluated within a Darwinian framework, instead of a materialist realist one. Data Science doesn’t produce objective truths, it produces truths which have to be true enough to act on for a given business case. According to Popper, the scientific method is also based on Darwinism. His falsificationism essentially poses a survival-of-the-fittest scheme for scientific theories. Popper believed that the survival of the best theory is itself all the justification required. He was just too quick to exclude induction.

Looking closely, Data Science and AI are drenched in Darwinism! Competing candidate models. Gradient Descent. Evolutionary Algorithms (obviously). Reinforcement learning. Search algorithms. And much more. But also at the level of the organization. The Red Queen Hypothesis[3], a concept from evolutionary biology, holds true for organizations competing through data today. Organizations must adapt and evolve. Not just for competitive advantage, – but for survival, because competing organizations are evolving as well. Companies for which society and technology evolve faster than the companies’ ability to adapt are going the way of the dinosaur. It’s survival of the most disruptive and the most data-driven. Now that, is a truth.

True story.

Science Doesn’t Speak for Itself

Evolution could be the primary way to obtain knowledge. In that case, Science and Data Science are not that different. Data Science just operates on a lower level of abstraction. However, evolution brings a problem for science. If we accept evolution to be true, we lose reason to assume that we have evolved for objective truth-finding. Our cognitive faculties have evolved solely for survival and reproduction. This impedes on the simultaneous belief in evolution and naturalism, as Alvin Plantinga argues in his Evolutionary Argument Against Naturalism.[4] Many facts appear counter-intuitive to us and our senses are easily deceived. We are not the kind of monkey that has evolved to understand the universe. It could be the opposite, in which knowing certain truths would be maladaptive.

So what is our relationship to truth? Is evolution nested inside scientific realism, or is science nested within Darwinism? Do objective truths exist, or are truths contingent upon their utility for survival and reproduction? Whichever scenario we find ourselves in, we are never in direct contact with the truth. Evolution has produced a priori implicit structures, which mediate our interaction with reality. For this reason, Immanuel Kant believed that nature could only be accessed through culture – without an a priori structure we cannot make sense out of experience. Thomas Kuhn even argued that we each live in different worlds, because we experience reality differently. Essence might only exists to us as appearance, making Plato’s Allegory of the Cave[5] a false dichotomy.

These structures, trough which we evaluate and account for truth and science, cannot be scientific themselves. Therefore, we have no choice but to believe in science, because we can never know science. However, we have elevated science to be objective and universal, unaffected by human subjectivity. ‘Science Speaks for Itself‘, we often hear. But science cannot speak for itself. It’s the opposite. Scientific truths require interpretation and articulation, by us. Cracks appear in our idea of a universal science when science is subjectivized and articulated by scientists or scientific institutions. Science can then be disavowed and surplus knowledge and conspiracy theories[6] emerge, like during the COVID-19 pandemic. Our subjective structures collectively manifest our interpretation of science, and the idea of a universal science emerges only retroactively. Hence, when we change, science changes.

The End of Science

The insights from data and data-driven decisions will determine future scientific research and breakthroughs. Many ‘unscientific’ theories are already regarded as truisms today. Data Science is flowing over from data-driven fields, such as finance, economics, epidemiology and biology, to various domains and scientific disciplines. Data science will produce novel experimental tools for scientists. AI Systems will enable us to understand the universe better, challenge our facts and assumptions, and allow us to test hypotheses about complex systems. The predictive power and pragmatic utility of models will be too powerful to ignore – even if all possible observations are inaccessible to us. Just as organizations will become more and more data-driven, scientists will become more like data scientists.

Paul Feyerabend was famously dubbed ‘The Worst Enemy of Science‘ – for stating that science shouldn’t be restricted to one specific methodology. But he might be science’s greatest ally after all! We should resist the urge to resort to causal mechanisms alone, when statistics offers reasonable explanations. At the same time, falsificationism should keep us humble and skeptical. We should rigorously test our hypotheses and models. Data Science has a lot to learn from science and engineering, and as the AI market is currently maturing, organizations need to shift their focus towards reproducibility and auditability of their models and AI Systems.

Data Science compliments science, but requires us to update our concept of a universal science to something akin ‘realistic pragmatism‘ or ‘pragmatic realism‘. Sometimes we need to be realistic, sometimes we need to be practical, sometimes both.

In the end, all we monkeys have, is tools.

Links

More ideas

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *