
In data science, we wield sophisticated tools that can make data bend and twist in remarkable ways.
Data itself is amoral; it simply reflects the world as it was measured, not as we wish it to be.
And yet, we can make data say almost anything if we try hard enough.
Every data scientist and statistician understands this. With the right tools and determination, you could contort a dataset to support nearly any narrative.
Statistics and Probability theory provide a rigorous framework for studying data, but even then, they demand humility. We must occasionally zoom out and consider the big picture.
There will always be assumptions. And a good scientist should examine not only those assumptions but also their implications.
While speaking about mathematical models, George Box famously said: “All models are wrong, but some are useful”
I think this statement embodies the humility essential to do good data work. Real-world data carries uncertainty, and our models add their own: limited computation, imperfect design, and the simplifying assumptions we choose. The key is to acknowledge these uncertainties and still proceed to treat data honestly with as much rigor as one possibly can.
While this is very hard to do particularly in our click-bait society that prizes flashy results and the best numbers, it is the right thing to do.
Recently, I have been reflecting on the way we anthropomorphize some of these tools; especially, generative models like LLMs and image/video generators. It is a fact that they are extraordinary tools; able to create human-like content so convincingly that the boundary between reality and simulation begins to blur.
Yet, we should not forget that behind the fancy chatbot and the well-appareled image generator, is a bunch of mathematical operators, data-objects and pseudo-random number generators.
They are complex and extremely useful; however, they are closer to a calculator than to a conscious mind.
For the data scientist, this distinction matters. There are two ways to deceive in our field:
- With the data – by misrepresenting or cherry-picking it
- With the model- by overstating its power or reliability.
The first is a violation of honesty, the second, a sin of marketing.
While we might not be able to fully eliminate them, we might be able to mitigate them through transparency, documentation and thorough quantification of uncertainty.
Humility, then, is not weakness.
It is a posture of respect – for data, for models, and for the limits of our understanding.
It reminds us that while models may be wrong, honesty about their wrongness is what makes them useful.