Data Lineage
Data lineage refers to the life cycle of data—where it originates, how it moves through systems, what transformations it undergoes, and where it ultimately ends up. It provides a clear, traceable path from the source to the destination, offering visibility into how data is collected, processed, modified, and consumed across the organization.
Also known as: Data traceability, data flow tracking
Comparisons
- Data Lineage vs. Data Provenance: Provenance focuses on the original source and creation of the data; lineage includes the entire journey, including transformations and interactions across systems.
- Data Lineage vs. Data Catalog: A catalog organizes metadata for discovery, while lineage focuses on tracking the flow and dependencies of data.
Pros
- Improves transparency: Helps developers and analysts understand how data was derived, which is critical for troubleshooting issues, debugging pipelines, and interpreting results.
- Strengthens compliance: Essential for audits and regulatory requirements (e.g., GDPR, SOX) by demonstrating how sensitive data is used, changed, or stored.
- Enhances trust: Increases confidence in data quality and decision-making by allowing stakeholders to verify sources and transformation logic.
Cons
- Complex to implement: Capturing end-to-end lineage in real-time across diverse tools, formats, and platforms often requires deep integration and advanced tooling.
- Hard to maintain: As systems evolve, pipelines change, and new tools are added, lineage documentation can become outdated unless automated tracking is in place.
- Limited visibility in black-box systems: Some tools or external APIs don’t expose internal transformations, creating blind spots in the lineage graph.
Example
A machine learning model shows unexpected results. A data engineer uses data lineage tools to trace the input dataset back to its source and discovers a pipeline update had introduced a silent schema change two steps upstream—pinpointing the root cause without guesswork.