The Raw Deal: Why Getting Answers from 'Messy' Data is So Hard

Christie Pronto

November 19, 2025

Christie Pronto ·

The Raw Deal: Why Getting Answers from 'Messy' Data is So Hard

Business leaders love to talk about data like it’s a vault of buried treasure.

They assume that if they can just ask the right question, the system will deliver a clean, perfect answer.

But anyone who’s actually tried to use raw data knows the truth. Getting answers isn’t like digging for gold. It’s more like digging through a landfill with a spoon.

Let’s start with what we mean by “messy” data.

It’s not always an unstructured blob. Messy can mean inconsistent. It can mean half-documented. It can mean semi-structured logs from a vendor’s tool, HTML scraped from a website, or five different spreadsheets from five different departments, each with its own idea of what “customer” means.

The problem isn’t just that data is messy.

It’s that the mess hides itself until you try to ask something meaningful.

‍

The Illusion of Readiness

Most execs assume that because data exists, it must be ready to use. The reality is closer to this:

The log files you pulled from that tracking system? Timestamp formats don’t match. Some are in ISO. Some are epoch. Some are in Pacific Time because no one updated the cron job after the migration.
The field labeled Customer_ID in one system is custid in another. Oh, and in a third it’s just called CID, but the values are hashed differently.
Your database has three tables named "users," "userbase," and "clients." Two are active. One is abandoned but still has better schema documentation than the rest.

Every data team has their version of this story.

It’s not rare. It’s normal.

And it gets worse when you try to hand that mess off to an AI or a visualization tool that expects a clean schema.

Most platforms aren’t built to untangle chaos. They expect the groundwork to be done already. And most of the time, it isn’t.

When Target expanded into Canada, they made that same assumption—that the backend inventory data they migrated would "just work."

Instead, mismatches across item records, warehouse systems, and shelf data meant stores opened with empty shelves while the system claimed they were fully stocked.

What looked like a software issue was actually a data integration failure. They shut the whole operation down in under two years.

The cost? Over $2 billion. Clean-looking dashboards didn’t save them from messy underlying truth.

‍

Why the Cleanup Never Ends

The biggest misconception about messy data is that it’s a one-time fix.

Just clean it once, standardize your tables, and move on. But real businesses don’t stop evolving.

New columns get added. Teams create workarounds. A new data source gets bolted on during an acquisition.

The mess regenerates.

Meanwhile, the questions keep coming:

What’s our customer retention rate from Q2?
How many orders were flagged as manual overrides?
Are we seeing seasonal patterns in ticket escalations?

Each one sounds simple.

But answering them often means tracing through raw logs, deciphering business logic no one wrote down, and loading schemas just to figure out what the system is actually doing.

This is where things fall apart.

Not because the data is impossible to clean, but because the cost to clean it every time is exhausting.

Even the best analysts get tired of repeating the same mental gymnastics.

And the wrong answer—especially when it sounds confident—can cost more than silence.

In 2023, Air Canada’s chatbot promised refunds that didn’t exist, pulling from old, unstructured documentation buried in the system.

The courts ruled the airline was still responsible.

That’s the danger of asking questions when you don’t actually know what’s powering the answers.

‍

When People Stop Trusting the Data

Bad data doesn’t just waste time. It erodes trust.

If your exec team has been burned by one bad report, they start to second-guess every future one.

If your frontline staff sees dashboards that never match what they experience day to day, they stop using the tools entirely.

Eventually, teams stop asking hard questions.

Or worse, they revert to gut instinct because it feels safer than sorting through a mess that never gets better.

And it’s exactly how organizations fall into cycles of finger-pointing and analysis paralysis.

‍

The Hidden Labor Behind Every Query

Here’s something most BI decks won’t show you: very clean answer someone pulls from messy data represents hours of hidden labor.

That includes manual joins, inferred mappings, guessed aliases, column cleanups, and sanity checks.

Even seasoned analysts are often stuck double-checking if a column labeled creation_date means account creation or something else entirely.

We like to pretend this is all solved with tools. But most tools just visualize what you give them.

If the foundation is cracked, the dashboard still stands—but the numbers don’t mean what you think they mean.

And if that dashboard breaks?

It’s never the platform’s fault. It’s the data person’s job to explain why the insights are suddenly wrong.

Sometimes, that breakdown goes far beyond reporting.

In Equifax’s 2017 breach, system logs flagged a vulnerability. But those logs were siloed, inconsistently tracked, and effectively invisible.

The cost of that buried insight?

A data breach affecting over 140 million people.

The raw deal with raw data isn’t that it’s broken.

It’s that we pretend it’s not. We cover the mess with charts and workflows, hoping no one notices the duct tape underneath.

But progress comes from honesty. From systems that show their work, that stream updates, that adapt to aliases and schema shifts without punishing the user. From tools that understand context, not just structure.

Transparency isn’t a nice-to-have. It’s the foundation of trust. And trust is the only thing that makes messy data usable.

‍

We believe that business is built on transparency and trust.

We believe that good software is built the same way.

Tech

Christie Pronto

November 19, 2025

Podcasts