Why Do We Need It
Machine learning practitioners have long treated data quality as a preprocessing concern — something to resolve before training begins. Clean labels, remove outliers, balance classes, then train. This assumption breaks down in practice.
Training runs fail not because data was dirty at rest, but because data was poorly distributed across the learning process. A dataset with one million samples concentrated in a narrow value range teaches a model one corner of the world with extreme redundancy. A dataset with one thousand samples spread across the full domain teaches the model its shape. Volume is not quality. Distribution is quality.
Existing approaches to data quality — Cleanlab, EL2N, GraNd — operate on the data itself. They require access to raw samples, labels, or embeddings. This creates two problems. First, they cannot be computed without exposing the dataset, which is a real constraint for privacy-sensitive domains. Second, they produce per-sample scores, not a scalar that reflects the health of the entire training trajectory.
There is currently no single, model-agnostic, real-time scalar that answers the question a practitioner actually asks mid-training: is my data feeding this model well, right now, at this phase of learning?
DQI is that scalar.
The Four Failure Modes
Training data quality is not a single problem. It decomposes into at least four distinct failure modes, each invisible to the others.
- Coverage Failure (DS)A model trained on data that never explores the full input domain will generalize poorly outside the region it saw. This is not a label problem or a noise problem — the labels can be perfect and the model still fails. The domain was simply never covered.
- Variation Failure (VS)Data that exists across a wide domain but carries no meaningful signal variation teaches the model nothing about the relationship between input and output. Flat data produces flat models.
- Structural Failure (SC)Even with good coverage and good variation, data that lacks complexity in its underlying shape — no curvature, no regime changes, no transitions — cannot teach a model to handle the non-linearities it will encounter in production.
- Density Failure (DU)Data clustered in one region of the domain, regardless of total volume, creates a model that is locally expert and globally ignorant. This is the mathematical expression of the intuition: ninety percent of samples in two percent of the bucket means the remaining ninety-eight percent of the world was never taught.
DQI solves all four simultaneously through a single multiplicative scalar. Each failure mode maps to one component — DS, VS, SC, DU respectively. The multiplicative structure enforces that all four must be healthy for DQI to be high. A perfect score on three components cannot compensate for collapse in the fourth. This models the AND logic of real data quality: coverage AND variation AND structure AND density.
Critically, DQI computes these four properties not from the raw data itself but from the training dynamics — the loss curve and gradient behavior that any training run produces regardless of modality. This makes DQI model agnostic by construction. An LLM and a vision model and a tabular classifier all produce a loss curve. DQI reads that curve. It never sees a token, a pixel, or a feature.
Methodology
DQI is computed in three sequential stages: signal extraction, phase detection, and component fusion.
STAGE 1Signal Extraction
The first stage converts any training run into a universal two-dimensional signal (x, y) regardless of model architecture or task type.
xis the training step index. It forms the domain axis — the temporal backbone of the training run.yis the primary quality signal. By default this is the loss value at each step. When loss is unavailable, gradient norm is used as a fallback.
This translation layer is what makes DQI model agnostic. Once the training run exists as (x, y), the architecture that produced it is irrelevant.
STAGE 2Phase Detection
A single DQI score computed across an entire training run obscures phase-specific behavior. DQI therefore decomposes every run into three phases — Early, Mid, Late — and scores each independently before fusing.
- Smooth y using a moving average window of size max(3, n/20) to suppress step-level noise
- Compute the absolute first derivative of the smoothed signal
- Build the cumulative sum of that derivative — representing total accumulated change
- Place the Early/Mid boundary where 40% of total change has occurred
- Place the Mid/Late boundary where 75% of total change has occurred
This means phase boundaries shift naturally with the run. The math follows the data.
STAGE 3Component Scoring
Each phase slice is scored independently across four components. All four operate on the (x, y) slice for that phase.
Applies the Nyquist-Shannon sampling intuition: effective coverage equals span divided by mean spacing, normalized via tanh to [0, 1].
Computed as 2σ/range. A flat loss — no learning — collapses VS to zero and therefore collapses DQI to zero.
Computes the variance of the second derivative normalized by the mean absolute first derivative, passed through tanh.
Uses the coefficient of variation of inter-step gaps: DU = 1 / (1 + CV). Perfectly uniform spacing gives DU = 1.
STAGE 4Phase Fusion
The three phase scores combine into a single total DQI via weighted sum:
The weights reflect an asymmetry in how training phases contribute to final model quality. The Early phase establishes the foundation. The 0.50 / 0.30 / 0.20 split encodes this asymmetry explicitly.
Edge Cases
DQI behaves predictably under pathological inputs. The following cases are documented explicitly because they represent real training scenarios.
- Flat loss. Loss remains constant across all steps — variation score VS collapses to zero. A DQI of zero is the right answer.
- Extremely short runs. Runs with fewer than three steps cannot produce meaningful component scores. DQI returns zero and exits cleanly.
- Duplicate steps. All component functions include a 1e-12 epsilon guard on denominators, preventing division by zero.
- Extreme values. The tanh normalization in DS and SC saturates smoothly. The clip operations on all outputs enforce the [0, 1] boundary.
- Negative loss. DQI handles negative y values correctly because all components operate on relative quantities. Absolute scale is irrelevant.
- NaN propagation. If loss values contain NaN, NaN propagates through the math and DQI returns NaN. This is mathematically honest, but a future version should strip or interpolate NaN values.
Limitations
Intellectual honesty requires stating what DQI does not measure and where its current implementation has known gaps.
It does not see the data itself. Two training runs can produce identical DQI scores from completely different underlying data realities. DQI tells you the training trajectory was healthy or unhealthy. It does not tell you why.
The 0.50 / 0.30 / 0.20 weighting across Early, Mid, and Late phases is grounded in the intuition that early training decisions compound. However, the specific numerical values have not been validated across a large population of diverse training runs.
When loss values contain NaN, NaN propagates. A future implementation will strip or interpolate NaN values inside the signal extractor.
Cross-modality comparability of DQI scores — whether a DQI of 0.72 on an LLM fine-tune is comparable to a DQI of 0.72 on a vision model — is an open question.
Future Possible Additions
DQI in its current form is a foundation. What follows are the natural extensions that become possible once real-world usage data accumulates.
Learning the optimal phase weights empirically via regression across run outcomes and phase DQI scores.
A pre-processing step inside the signal extractor that identifies NaN regions, strips them, and interpolates.
Accepting schedule annotations to align phase boundaries with the actual intended structure of the training run.
Contextualizing scores against a population of similar runs to produce actionable percentiles.
Fusing loss and gradient norm simultaneously to distinguish between getting stuck and genuine convergence.
Computing a DQI profile across layers for transformer models to identify which layers received rich gradient signals.