Data Science is Learning from Data

Monday October 14, 2013

There are a lot of unhelpful definitions of data science. To be a useful term, it needs a sensible meaning. This is what I mean by data science:

Data science is learning from data.

The check for whether you've just done data science is a two-part test:

In some ways then, data science is generalized science - science without a specific field. On the other hand, sometimes a data scientist may not develop the "experiments" that generate the data, and in this case data science corresponds more closely to a subset or specialization of scientific skills.

An irate student of mine once argued that if data science is at all sensibly named, there should be hypotheses being tested. This often doesn't seem to be the case. I agree that the scientific method is a beautiful thing, but I also think that a lot of good science has been and continues to be observational. The reason for an expedition to Galapagos was never to test a hypothesis on the existence of the Blue-footed Booby, for example. Often there's much to be learned just in describing data.

If your machine is learning but you're not, you aren't doing data science. You may certainly be doing very good data engineering and solving real problems. Algorithms and their development is the field of computer science. Their application happens in the fields of software engineering and data engineering. Deep learning and so on work very nicely and solve real problems, but if there isn't a finding that humans can understand, it isn't data science. Data science can certainly use statistics and machine learning, but black box techniques are not generally helpful for human understanding.

It certainly isn't about the size of the data. There are techniques you need in order to work with big data, but these are just techniques. The scientific method is not obsolete. It is true that more is different, but the way that it's different is that it's more. If there's any change to science, it's that there's a backlog of analysis due to the large amount of data. But there again, the data that's backing up is usually not the kind that comes from proper experiments. We'll still need experiments.

This definition of a "data scientist" is not far at all from "data analyst". It may be that the reasons to use the name "data scientist" instead range only from "sexier buzzword" to "distance field from know-nothing low-level and business analysts".

Business abuses the term data science in two main ways. The first is understandable, since data science requires the use of some computer science and engineering techniques. But data science is not primarily about engineering (i.e., building) products. Data science could be involved, but most of that is engineering work. A more useless view from business folks is well explained by IBM: "[w]hat sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings". Data scientist should not just be a higher-ranking business title than data analyst, and everyone should be able to communicate - unless what you're really looking for is something like a data journalist, somehow parallel to the way science journalists communicate about science.

Let us go forth and learn about the world.


This post was originally hosted elsewhere.