How to Deal with a Problematic ETL flow
My Playbook for Fixing and Future-Proofing Data Pipelines
Picture this: It’s a quiet Tuesday. You’re immersed in building your next report when—boom!—your inbox lights up with messages and Jira tickets. One of your key reports is broken. Again.
Every Data Engineer has faced it: a broken ETL (Extract, Transform, Load) flow that used to work just fine—until it didn’t.
Whether you're dealing with legacy Flows, missing records, incorrect joins, or processes that silently fail, a problematic ETL pipeline can throw off your entire data infrastructure.
Here's how I approach these situations—step-by-step—from root cause analysis to future-proofing systems.
Step 1: Understand What Could Break
ETL Flows can be very fragile and can be easily broken; thus, maintenance and knowing what might go wrong is half the battle.
Database Schema Changes: New columns or modified formats in source tables can disrupt downstream processes.
Data Quality Issues: Incomplete, inconsistent, or incorrect data can crash transformations or produce misleading results.
Scalability Limits: Processes built for 1K records may choke on 10M.
Performance Bottlenecks: Heavy loads or inefficient queries can quietly degrade performance or bring flows to a halt.
Once, I tried to join 2 empty columns and nearly crashed our server. In my defence, I didn’t know the columns were empty.
Step 2: Diagnosing the Problem
Debugging a broken pipeline can feel like finding a needle in a haystack—unless you’re set up for it. Here are techniques that save me time and effort:
Validate Raw Data
Sometimes the issue might not be with how you are processing the Information, but how it is being stored.
Before diving into complex transformations, check your raw input. Invalid or inconsistent source data is a common root cause of ETL failures. Implement data profiling to assess the structure, content, and quality of source data before processing.
For example, check for missing values, incorrect formats, or outliers that could disrupt transformations.
Comprehensive Logging
A robust logging mechanism is essential for tracking problematic records at critical stages of the ETL process. Log details such as:
Start and end times of each step.
Number of records processed.
Error messages and their context.
I recommend creating a system that scans logs for keywords like "error" or "warning" and triggers automated alerts—emails, Teams messages, whatever works.
The goal: you find out something’s broken before the users do.
Thorough Documentation
The one thing I’m constantly talking with my team about is Documention. “Document everything and Version everything”.
Inline comments for code clarity
Flow diagrams for high-level understanding
Version-controlled documentation for auditability
Documentation helps identify redundant or unnecessary steps, facilitates optimisation, and ensures knowledge transfer among team members.
I strongly suggest documenting inline. A well-documented process can be the documenation itself.
Reconciliation Mechanism
Finally, once the pipeline runs, don’t assume it worked—verify it.
Implement a reconciliation process to compare raw and final data, ensuring data integrity. Simple counts (i.e., number of records) and aggregations (i.e., sum of key fields) can reveal discrepancies.
For example, if an ETL process extracts sales data, verify that the total sales amount in the raw data matches the final output.
Step 3: Proactive Optimisation
I know… your monolithic ETL process might work — for now — But could it run better? Cleaner? Faster? The answer is always YES!
Optimising it can radically improve performance, enhance its maintainability and scalability. Here are a few tips to turn your legacy code into well-optimised flows:
Modular Design
Since I started learning about software development, the concept of encapsulation and modularity really, really got my attention. The ability to be agile and quickly change, update or even remove a section of your code seems fascinating and very useful.
Break down your ETL process into smaller, reusable modules, such as separate extraction, transformation, and loading steps. Your flow will be easier to test, scale, and maintain.
For example, if extracting data from three databases, create distinct extraction processes for each, format the data consistently, and then merge them.
Unified Coding Standards
Consistency is key. Adopt universal formatting and coding standards across all your flows. This will improve code readability, reduce confusion and errors during development and maintenance.
For example, use snake_case for column names and prefix temporary tables with tmp_.
Smart Alerting
Don’t wait for someone to tell you your dashboard is empty! Set up automated alerts for: Job failures, delayed runs, and unexpected record counts.
With alerts, you can start debugging immediately when issues arise, enabling rapid response without affecting your reports.
For example, if a transformation step fails, an alert can include the error message and affected step, allowing developers to start debugging promptly.
Insights at a Glance
Build a central dashboard to monitor the status of all your ETL flows, providing insights into their health and performance.
You can use your existing visualisation tool or Power BI for your dashboard. You can visualise metrics such as run times, error rates, data volumes and overall status.
Having a dashboard like this can help you stay proactive and address potential issues before they escalate.
A Real Example: From 7 Hours to 3
Let me share a story that still sticks with me.
I was handed ownership of one of the most critical reports at my company—one that covers health insurance policies, claims, payments, SLAs, fraud detection mechanisms, and more.
The total runtime of this flow was 7 hours, generated 10 datasets and loaded 30 reports.
New features and reports were requested, and the stability of both the flow and our system was challenged. Due to the performance issues, many errors were generated, and due to the criticality of the report, it had frequent complaints
So I decided to refactor and optimize the whole process.
Here’s what I did:
Documented the entire flow — visually and technically.
Identified and removed unnecessary steps and heavy filters.
Introduced temporary staging tables to avoid redundant processing.
Split complex queries into smaller, more efficient ones.
Added a logging and alerting system to track the flow in real time
All this took around 6 months, but the results were worth it. The total runtime dropped from 7 hours to just 3, and errors were rare.
Now, whenever I start a new process, I use all of these principles to have an easy, calm and peaceful debugging experience whenever an error comes up.
P.S.: Errors are Rare now 😁
Let’s Connect!
💼 LinkedIn: alex-kazos
👨🏻💻 GitHub: alex-kazos