Big Data or Big Distraction

Contrary to what you have heard, the unfolding technological transformation we are witnessing isn’t really about data, not directly at any rate. It’s not that data isn’t important, but the focus on data is obscuring the real nature of change, which is the transition from a world driven by essentially static and reactive systems to one driven by hyper-localized, adaptive control systems.

These controllers are already in our cars, homes, and offices, and will be in our clothing, our parks; literally woven into the fabric of our physical environment. The future will not be defined by how much data is collected, but by the complexity and responsiveness of our localized environments.

Data sounds nicer than control

Unfortunately, control or control systems aren’t commonly used terms/ideas, even in many of the applied data fields (Marketing, that’s you I am talking about), but they really should be. So what is control and why is it important? Control is a process of making decisions, and accepting feedback, in order to achieve some objective. In other words, it is something that senses and acts, it isn’t inert like data.

Let’s use simple example of a common controller – your basic thermostat. Your thermostat’s objective is to maintain a certain temperature in a room, or your house. It does this, in the simplest case, by checking the temperature of the room (this is data collection) and then based on its reading, will Heat, Cool, or do Nothing.

The rules that govern how the controller behave are called the control logic. In simple cases, like our thermostat, the control logic can be easily written out by a human. However, more advanced applications, like autonomous driving cars, are so complex that we will often need to learn much of the control logic from data, rather than have it directly programmed by people.

Why write it when the machine can learn it?

This is where data plays one of its major roles, in helping to learn the control logic. By employing machine learning (see our data science posts here and here) , we can learn the basic logic required for a particular controller. We can then hone and optimize the efficacy of the controller by embedding addition systems for updating the controller’s logic after it has been deployed – these adaptive systems use the current data from the system’s environment in order to continuously update and improve upon the control logic.

Big Data is afraid of its shadow prices

Folks who are excited about Big Data should start to think less about data per se, and more on how data will drive how we go about 1) creating more powerful controller logic and; 2) improving precision by enabling control systems access to more precise and higher dimensional data.

By framing data in terms of the control problem, naturally leads to real data questions, like, what if I didn’t have this bit of data, how much less effective would the system be? In other words, you can start to think about the marginal value of each new bit of data, so that you can move toward having an optimal volume and precision of data with respect to your goals and objectives.

Pearls of Wisdom or ‘Correlation isn’t Causation’

While true, you often hear “Correlation isn’t Causation” often proudly exclaimed without any real followup about what that really means. By taking a control perspective, we can begin to get a little clarity on how to differentiate data that provides correlations and data that provides causation relationships.

Data that is passively gathered will tend to give you correlations. The data that you gather from your controller’s actions, however, will give you causal relationships, at least with respect to the actions that the controller takes. In fact, you can think of AB Testing as employing a type of dumb controller, one that that takes random actions. If you want to learn a bit more about the topic from an actual expert take a look at Judea Pearl’s work (opens a Pdf).

Data is Lazy, and leads to lazy thinking.

Here is the thing, data is passive. That makes it easy to collect and talk about. Integrating it into a working system or process is the hard part. Control, by definition, is active, and that makes it hard, because you have to now think about how the entire system is going to respond to each control action. That is probably one of the main reasons there is so much attention on data, you get to dodge the hard, but ultimately most valuable questions.

*Edited 6/2/2018

Tags: Bigdata, DataScience

2 Comments

Ivan Sucharski

Posted May 26, 2013 at 6:56 am | Permalink

While I agree with a good deal of your post, I disagree with the wrap-up, and believe you are being short sighted, or singularly focused. Data has many purposes including system optimization (part of what you refer to as “control”). The short sighted portion is in the expectation that it is the only thing the data is useful for. In other words, it appears that you propose to only capture that data which allows you to tune your system towards the current goal (e.g. an automatic thermostat or other “smart” type of sensor). That suggests a singular focus on what I refer to as the “what” – what happened and how can sensor x make that “what” happen more efficiently or with less or no user input etc. Awesome. You now have the smartest sensor in the universe. However, you have no idea about the why. The human element (in a thermostat example) can be important for a lot of reasons including guiding the creation of the next product, understanding who to market to etc… in other words things beyond the basic product function. That said, collecting data for the sake of collection is absurd (and all to common). Curating quality data sets around the variety of ways that the environment (including humans) interact with your systems and products that extend beyond whatever optimizes the functioning of the current project is, in my opinion, a valuable pursuit that guides a huge number of initiatives beyond the local optimum.

- Matt Gershoff
  
  Posted May 29, 2013 at 5:31 pm | Permalink
  
  Hi Ivan, great comment, and you bring up a good point that data is also useful for supporting innovation.
  I have to think a bit if this is really outside of a control framework, since at a high level, I think it is fair to think about some optimal level of innovation, in which you might ask, ‘hey, how is innovation (rate, value, etc.) affected by the current state of collected data?’ We then ask what the marginal regret might be wrt this optimum if we did not have access to this bit of data for the innovation process.
  I’m not sure though, maybe that isn’t the right way to think about it. Let me give that some thought.
  
  As to the ‘Why’ rather than ‘What’ – which is obviously important – I am not sure one necessarily gets that from passively collected data. That’s why I stuck in that section on correlation / causation with a reference to Pearl. While I think he prefers to think in terms of interventions (controlled events on the system taken from outside) rather than Actions (events from within the system), the point I was attempting to make is that causal data is collected via the measurement of responses to explicit actions we take via the control process. So you can think of split (A/B) testing as a way of collecting data about causal relationships.
  My guess is that we don’t really disagree. On your blog (and in some email communications) you mention curating data. I think your idea of curation gets right to the point – that one needs to make resource allocation decisions at data collection time. The point of thinking about control is that it hopefully provides a framework for guiding the curation process as well as trying to get folks to think explicitly about what it is they actually might do with the data.
  Great comment – looking forward to more!

Post a Reply to Ivan Sucharski

Click here to cancel reply.

Conductrics Blog