Kelechi Akwataghibe, PhD

While working with data, the most powerful tool we have isn’t a model or algorithm; it is the question we ask. It is easy to start “model shopping” without pausing to ask, “what exactly am I trying to do with this data?”
When this question is asked, and thoroughly answered, we tend to eliminate a lot of unnecessary work.

Sometimes, it might even turn out that the question we are trying to investigate can be solved more rigorously with existing scientific/mathematical theory, and as such, we avoid some of the uncertainty that data provides.

The most common mistake in data science isn’t using the wrong model – it’s using a powerful model to answer the wrong question.

To show you what I mean, let’s consider together one dataset that is notoriously difficult: a financial time series.

The Scenario

Imagine we have ten years of daily price data for a single asset, like the S&P. The data is a time-series of open, high, low and close prices. A hedge fund manager brings you this data and says: “I need insights”.

What is the first thing you should ask?

“What do you want to know about this data?”

Depending on their answer, the path we take can look very different.

Path 1: The Predictive Question

Suppose the hedge fund asks: “Will the market go up or down tomorrow? I need a ‘buy/sell’ signal”.

Here, they want to predict the future direction of the market and to answer this question, we need a model that can predict the next day’s direction.

This is the classic Classification problem in Data Science.

The models that solve this problem will take in the dataset, D, and output a logical output: Up/Down, and I will be examining one of those in a separate blog post.

Path 2: The Inference Question

Now imagine the manager doesn’t just want a buy/sell signal, they want to understand risk. This time around, they might ask, “What is the underlying volatility of this asset? How does it change over time? How correlated is it to my other assets?”

This is an inference and parameter estimation problem.

Here, the goal is to quantify a hidden, unobservable variable in this case, volatility.

The general solution is to apply/develop a stochastic model that explains how the data depends on volatility, and from this model, infer the volatility. I demonstrate this inference process in a future post.

Interestingly, inference-based models can be applied to classification problems by appending a discrete function at the end. That’s another concept I will explore soon.

Path 3: The Generative Question

In a third scenario, the same manager might want to stress-test the firm’s risk model.

This risk model is a large, complex model designed to answer a question like “What is our firm’s total exposure to a market crash?” or “What is our 99% Value at Risk (VaR)?”

This risk model takes future price paths for all the firm’s assets as its input and forecasts the likely losses that could arise based on certain risks.

In order to stress-test the model, one needs to run thousands of plausible futures through it. The problem, though, is that we only have one real future.

So, here we would need to synthetically generate thousands of realistic one-year price paths for the firm’s assets.

This is roughly speaking, a Generative Question. It is a generative question because in order to respond to it, we need to create new, plausible futures that haven’t happened yet.

Now, there are several ways to do this. We could utilize the inference model from above as a simulator to generate thousands of new paths. Alternatively, we could employ specific machine learning models that deal specifically with Generation.

Again, I will explore this in a future blog post.

The main gist of this however is that a dataset can provoke a plethora of questions and each requires a different mindset, not just a different model.

Before writing a line of code or training a neural network, pause and ask yourself:

“What kind of question am I really asking?” Everything else follows from there.

See you in the next post!