Data and Analysis: another five-part model

Tuesday October 8, 2013

Learning from data is not a simple linear process. The following five parts are all essential and interconnected, with the data itself central but far from the most important component.

phenomenon of interest - data creation - data - analysis - system of beliefs

If your goal is to achieve a correct system of beliefs about the phenomenon of interest, it is important to attend to all the components of this superficially simple model and work to understand their connections.

Your system of beliefs is involved from the very start. Among the components of Conway's data science Venn diagram, subject matter expertise is sometimes viewed as unnecessary by those more interested in algorithms than science, but if the goal is learning about something in the world on the basis of data, you have to know something to start with. The absoluteness of this claim might be a little subtle and some people disagree - Hofstadter for one, I think - while others seem to generally be on board.

A simple example: you look at a data set of library visit timestamps and see that there are no visits on Wednesdays:

If you think everything that you see in your data is an accurate signal about the phenomenon you're interested in, you are likely to be wrong often and in ways you would much rather avoid.

I was thinking about this just before I had the pleasure of reading Victor's Magic Ink, where I found this similar model, itself "* Adapted from Claude Shannon, A Mathematical Theory of Communication (1948), p2.":


This is interestingly parallel to the data/analysis model above. A type of difference to articulate is that commonly, for the data analyst, the phenomenon of interest is not trying to communicate. And the "tool" is not necessarily designed for the purpose it finds itself being used for. And of course, the analyst is usually at least to some extent building and being the platform and user on an ongoing and iterative basis.

In all cases, the optimal success of learning on the right hand side hinges on appropriate meta-awareness of the whole process.

This post was originally hosted elsewhere.