How does the PLARV Argus Anchor Point work?

The Anchor Point Protocol identifies the last genuinely stable step in a training run, allowing researchers to restore optimizer states and weights to a known clean baseline instead of restarting from scratch.

What is the Forensic Social Hub?

The Forensic Social Hub is a collaborative workspace where investigators and researchers can audit training runs together, deep-diving into signal breaches and tracking resolutions across the fleet.

PLARV Argus | ML Training Protection System

Training machine learning models is increasingly expensive in both compute and time. A single undetected failure in a multi-day run can waste thousands of dollars and set a project back by weeks. Yet the tools practitioners rely on for training oversight — experiment trackers, metric dashboards, manual threshold alerts — are fundamentally passive. They record what happened. They do not detect that something is going wrong.

Argus is a self-calibrating training anomaly engine that monitors active training runs without requiring threshold configuration, architecture-specific tuning, or prior training history. Rather than comparing observed values against fixed boundaries, Argus uses a proprietary calibration mechanism that adapts to each run without requiring user configuration. This makes detection relative rather than absolute — a property necessary for monitoring across architectures and training regimes where no universal definition of "anomalous" exists.

The system operates across two tiers. A cloud-based telemetry engine detects eleven failure modes observable from aggregated scalar signals — including gradient explosion, vanishing gradients, loss divergence, training stagnation, oscillation, and distribution shift — without receiving model weights, gradients, or training data. A companion open-source local detector runs entirely on the user's machine and covers failure modes that require raw model access, including dead neurons, activation saturation, and optimizer state corruption.

Argus reports which signals are anomalous and for how long. It does not assign failure mode labels, because identical signal patterns are consistent with multiple root causes and a wrong automated diagnosis is worse than no diagnosis. Intervention recommendations are surfaced to the user rather than applied automatically by default, reflecting the same epistemic honesty: detection is strong, causal attribution is not.

We validate Argus across five benchmarks covering ghost recovery, slow divergence, dual-signature catastrophic failure, three architecture families, and real-world fine-tuning simulation. Argus maintains a false positive rate below one percent on healthy runs while detecting injected failures with a mean lead time of tens to hundreds of steps before terminal collapse — providing actionable warning within a window where intervention is still meaningful.

Introduction

Every machine learning practitioner has experienced a version of the same failure. A training run is configured, launched, and left to execute. Hours or days later, the run is dead — loss diverged, gradients exploded, or the model simply stopped learning without any visible signal. By the time the failure is discovered, the compute budget is spent and the timeline has slipped. The question is never whether the failure was detectable in principle. It almost always was. The question is whether anyone or anything was watching in the right way.

The standard answer to this problem is threshold-based alerting. Set a gradient norm ceiling. Set a loss patience window. If the value crosses the number, fire an alert. This approach has a fundamental flaw that practitioners encounter immediately in practice: there is no universal number. A gradient norm of 5.0 is catastrophic for one architecture and completely normal for another. A loss of 2.5 means a transformer is healthy and an MLP has stopped learning. A threshold calibrated for one run will produce false alarms on the next and miss real failures on the one after that. The result is one of two failure modes — alerts that fire constantly and get ignored, or alerts so conservative they miss the failures they were designed to catch. Both outcomes are worse than no alerting at all.

Experiment tracking platforms like Weights and Biases and Comet address a related but different problem. They log and visualize training metrics with high fidelity, enabling post-run analysis and cross-run comparison. What they do not do is detect anomalies within a run in progress. A user watching a gradient norm chart can see an explosion. The platform does not tell them it is an explosion — they have to recognize it themselves. Detection remains a human responsibility.

Recent research has begun to address real-time training failure detection more directly. The R-Metric combines hardware signals, gradient variance, and validation loss into a composite reliability score, demonstrating that multi-signal approaches outperform single-metric monitoring. The L4 framework provides automated log analysis for diagnosing failures after they occur in large-scale distributed training. Both represent meaningful progress. Both retain assumptions that limit their generality — fixed signal weights, dependence on validation loss, or focus on post-failure diagnosis rather than real-time prevention.

The underlying problem none of these approaches fully solve is that training dynamics are run-specific. What constitutes an anomalous gradient norm, an unusual loss trajectory, or a suspicious direction similarity score depends entirely on what that specific run established as normal during its own early steps. A monitoring system that imports thresholds from outside the run — from the user, from a prior run, from a paper — is making an assumption that may be wrong for this model, this dataset, this hardware configuration, and this random seed. The assumption is invisible until it fails.

Argus is built on a different premise. Every detection boundary in the system is derived internally without user configuration. No threshold is imported from outside. No architecture-specific configuration is required. The engine asks not whether a value is large, but whether it is anomalous relative to what this specific run has established as normal. This makes the detection signal meaningful by construction, regardless of architecture or training regime.

This paper describes the design of Argus, its empirical validation across five benchmarks and three architecture families, and its honest limitations. We make no claim that Argus catches every failure — the signal taxonomy covers eleven observable failure modes out of a larger universe, and causal attribution from signals alone is not scientifically defensible. What we claim is that Argus catches the failures that matter most, catches them before terminal collapse, and reports what it knows without overstating it.

Background and Related Work

Training monitoring is not a new problem. The tools practitioners use today fall into three categories: experiment trackers, threshold alerting systems, and recent academic approaches to automated failure detection. Understanding what each does — and does not do — establishes the gap Argus occupies.

Experiment Trackers

Weights and Biases and Comet are the dominant platforms for training oversight. Both provide high-fidelity logging of gradients, losses, learning rates, and system metrics throughout a run, with rich visualization and cross-run comparison. Their value is real and well-established for experiment management. What they do not provide is detection. A user watching a gradient norm chart can see an explosion after it has happened. The platform records it. It does not identify it as anomalous, classify it, or act on it. Gradient norm logging, as the Weights and Biases documentation describes it, "helps debug training issues" — meaning a human must do the debugging. Detection remains a manual responsibility.

Threshold Alerting

The standard programmatic approach to automated detection is threshold alerting — fire when a metric crosses a user-defined boundary. PyTorch Lightning's built-in callbacks, custom training loop guards, and MLOps platform alert rules all follow this pattern. The structural limitation is well understood by practitioners even if rarely stated explicitly in the literature: thresholds require the user to know upfront what anomalous looks like for their specific run. A gradient norm ceiling calibrated for a transformer will produce false alarms on an LSTM and miss real failures on a diffusion model. A loss patience window appropriate for a fast-converging CNN will time out too early on a slow language model warmup. The threshold is always wrong for someone.

A further limitation is that threshold systems detect events, not trends. A gradient norm that climbs from 0.8 to 0.9 to 1.1 to 1.4 over three hundred steps never crosses a ceiling of 10.0. The run is dying. No alert fires. This is the slow divergence failure mode — silent, gradual, and consistently missed by threshold-based approaches because no single step produces a number large enough to trigger.

Recent Academic Work

Two recent papers address training-time failure detection more directly and represent the closest prior work to Argus.

The R-Metric, introduced at EMNLP 2025, combines hardware monitoring signals, gradient variance, and validation loss change into a composite reliability score. Evaluated across 720 simulated runs and four real architectures, it achieves strong detection performance with a mean lead time of 255 steps before terminal failure. The R-Metric demonstrates that multi-signal detection substantially outperforms single-metric monitoring. Its limitations relative to Argus are three: the signal weights are fixed across architectures rather than derived from each run, it requires validation loss which many training configurations do not produce at step frequency, and it produces a single composite score rather than a signal-level breakdown that practitioners can investigate.

The L4 framework, published at FSE 2025, addresses a different point in the failure lifecycle. Rather than detecting failures in progress, L4 automates the diagnosis of failures after they have occurred in large-scale distributed training, analyzing log patterns across nodes and training stages. Its empirical study of 428 production LLM training failures establishes that 74.1% of failures occur during the iterative training process itself and that 89.9% of cases require detailed manual log analysis for resolution. These figures establish both the prevalence of training failures and the cost of the current manual diagnosis process that motivates automated approaches.

The Gap

What the existing landscape lacks is a system that derives its detection boundaries from the run being monitored rather than importing them from outside. Experiment trackers visualize but do not detect. Threshold systems detect events but not trends, and require configuration that cannot generalize across architectures. The R-Metric advances multi-signal detection but retains fixed weights and external dependencies. L4 diagnoses failures but does not prevent them.

Argus occupies the space between active training and post-failure diagnosis — watching a run in progress, establishing what normal looks like without user input, and detecting deviation before the cliff arrives. No prior system fully occupies this space with zero threshold configuration and no dependence on validation loss or prior training history.

The Argus Architecture

Argus is designed around a single organizing principle: a monitoring system should never require the user to define what failure looks like. The user does not know upfront. The system should learn it from the run itself.

This principle has architectural consequences. It rules out any design where detection boundaries are configured before training begins. It rules out cross-run baselines, because what is normal for one run may be anomalous for another. It requires that the system establish what healthy dynamics look like for this specific run and measure all subsequent steps against that reference. Everything else in the architecture follows from this.

Two Tiers, One Boundary

Argus operates across two distinct tiers separated by a deliberate boundary: what can be detected from aggregated scalar telemetry, and what requires raw access to model weights and activations.

The boundary is not arbitrary. It reflects a real constraint in production machine learning: practitioners training valuable models on proprietary data cannot be expected to send weight tensors or gradient vectors to an external service. The information required to reconstruct a model's learned representations, or to infer properties of the training data, must never leave the user's machine. This is not a limitation Argus works around — it is a design constraint Argus is built to respect.

Supporting diverse architectures requires understanding that a transformer's healthy gradient dynamics differ fundamentally from a CNN's, which differ from an LSTM's. Argus handles this through internal architecture-aware signal interpretation that requires no user configuration.

The consequence is a two-tier system. The cloud-based telemetry engine receives only aggregated scalar signals — loss values, gradient norms, similarity scores, per-sample statistics — and performs all detection from these. The local detector runs entirely on the user's machine as open-source software, has full access to model weights and activations, and covers the failure modes that aggregated scalars cannot observe. No data from the local detector is transmitted anywhere.

The Telemetry Engine

The telemetry engine is the cloud component. It receives per-step telemetry from the SDK, maintains run state in a persistent store, and returns a harm pressure score and signal breakdown with each response.

What it receives per step is deliberately minimal. Loss, loss delta, gradient norm, gradient similarity, optional per-sample loss and confidence distributions, and lightweight architecture metadata. No weight tensors. No raw gradients. No activation values. No training data. The complete list of fields is documented in Section 7.

The engine maintains state across steps for each run and continuously monitors training dynamics.

The engine returns a harm pressure value from zero to three with each step response. Zero means all signals within expected range. One means developing pattern — monitoring closely. Two means confirmed anomaly — investigation recommended. Three means critical — immediate action required. It also returns the specific signals that are active, their severity relative to the run's own baseline, and how many steps they have been active. It does not return a failure mode label, for reasons discussed in Section 5.

The Local Detector

The local detector is a Python library distributed under the MIT license. It runs entirely within the user's training process. Nothing it observes or computes is transmitted anywhere.

It attaches forward and backward hooks to the model at training start and collects per-layer statistics on each step. From these it detects nine failure modes that require raw model access: dead neurons, activation saturation, gradient flow blockage, weight norm imbalance, optimizer state corruption, attention collapse, hardware integrity anomalies, precision erosion, and representation collapse.

The local detector is designed for zero overhead on healthy runs. Hooks are lightweight, tensors are never copied, sampling is deterministic and strided rather than random, and deep analytical scans run on a stride of one hundred steps rather than every step. The design target is under one percent training time overhead on large models.

Results from the local detector are summarized into a lightweight health report — worst level, trend direction, affected fraction, and key scalar summaries — which the SDK can optionally include in the telemetry payload. The engine uses this summary as a corroborating signal. The raw layer-level data never leaves the machine.

The SDK

The SDK is the integration layer. It sits between the user's training loop and the telemetry engine, computes the required scalar signals from the model and optimizer at each step, packages them into a validated payload, and sends them asynchronously so training is never blocked on a network call.

The SDK handles run registration, adaptive batch packing based on measured network latency, local disk spooling during network outages, and a rolling checkpoint system driven by engine signals rather than a fixed schedule. In AUTO mode it also applies interventions returned by the engine — learning rate adjustments, optimizer resets, sample filtering — directly to the optimizer and model. In MANUAL mode, the default, all candidate interventions are surfaced as recommendations without being applied. The reasons for this default are discussed in Section 6.

What the Architecture Does Not Do

The two-tier boundary means certain failure modes are outside the telemetry engine's scope by design. Dead neurons require per-layer activation statistics. Attention collapse requires the attention weight matrices. Precision erosion requires per-parameter gradient values. These are covered by the local detector, not the telemetry engine. A user who does not attach the local detector will not receive these signals from the cloud component.

The architecture also does not cover overfitting. Validation loss is tracked in run state when provided, but the engine does not fire signals on the train-validation gap in version one. This is a conscious deferral, not an oversight, and is discussed further in Section 9.

Signal Taxonomy

The telemetry engine detects eleven signals. Each signal monitors a distinct aspect of training dynamics through architecture-aware interpretation, and fires when that aspect deviates anomalously from expected behavior for that run's configuration. This section describes each signal in user-facing terms — what it observes, what it means when it fires, and what it does not tell you.

One thing this section does not do is map signals to failure modes. The mapping is many-to-many. Loss Divergence and Gradient Explosion firing together is consistent with a learning rate that is too high, with optimizer momentum corruption following a previous explosion, with a corrupted batch cluster, and with several other root causes. Assigning a confident failure mode label from signal co-occurrence alone is not defensible — the same signal pattern is the evidence for multiple competing hypotheses. Argus reports the evidence. Investigation of root cause is the practitioner's responsibility. The deeper reason this attribution is not attempted — and cannot be — is discussed in Section 9.

Loss Divergence

Observes the loss trajectory. Fires when loss has been rising anomalously. When this signal fires, loss is doing something this run has not done before during healthy phases. It does not tell you why.

Gradient Explosion

Observes the gradient norm. Fires when the gradient norm is anomalously high. When this signal fires alongside Loss Divergence, the combination is consistent with learning rate pressure. When it fires alone, it may indicate a bad batch or a transient spike. Duration matters.

Vanishing Gradients

Observes the gradient norm collapsing toward zero. This signal is directionally opposite to Gradient Explosion — it fires when gradients are too small, not too large. When this fires, early layers are likely receiving insufficient signal for meaningful parameter updates. Reducing learning rate further would worsen this condition, not improve it. This signal requires different remediation than gradient explosion and is one reason Argus does not apply a single automated response to all gradient anomalies.

Oscillation

Observes the loss delta sequence for rapid reversals without net progress. Fires when loss is bouncing rather than trending in either direction — improving or worsening. Oscillation is distinct from noise. A noisy but improving loss does not fire this signal. A loss that improves and regresses in alternating steps without covering ground fires it. This pattern is consistent with a learning rate that is overshooting the local loss landscape.

Training Stagnation

Observes loss improvement over time. Fires when the model has stagnated and stopped learning. When this signal fires, the model may have saturated its current capacity, exhausted useful signal in the data, or reached a flat region the current learning rate cannot escape.

Distribution Shift

Observes the spread of per-sample losses within a batch. Fires when batch loss spread has widened significantly. This signal requires per-sample loss data from the SDK — it is part of the optional Layer 2 telemetry. When it fires, the composition of batches has changed in a way that makes individual samples much more heterogeneous than the run previously observed. This is consistent with data pipeline issues, domain shift in a streaming dataset, or batch sampler failures.

Confidence Collapse

Observes mean model confidence across samples in a batch. Fires when confidence has dropped significantly. This signal also requires Layer 2 telemetry. When confidence collapses, the model's discriminative representations are degrading — it is becoming less certain about predictions it was previously making confidently. This is a representation-level signal, not a loss-level signal, and can fire while loss looks superficially acceptable.

Entropy Drift

Observes prediction entropy over time. Fires when entropy has been rising consistently. Rising entropy means the model's output distribution is becoming more uniform — the model is becoming less confident across the board. This is consistent with label noise entering the training stream or out-of-distribution samples accumulating. It also requires Layer 2 telemetry.

Outlier Batch

Observes the ratio of unusually high-loss samples within a batch. Fires when an anomalous number of samples are producing losses far above the batch mean. This is the sample-level signal — it identifies that specific samples within a batch are destabilizing training, rather than the batch as a whole drifting. When this fires, the spike samples returned by the engine identify which sample IDs are responsible.

Label Noise

Observes the correct prediction rate across samples. Fires when the rate at which the model correctly predicts its training labels has dropped significantly. This is distinct from loss rising — a model can have rising loss and stable correct rate, or stable loss and falling correct rate. When this signal fires alongside Distribution Shift, data quality is the primary suspect.

Representational Collapse

Observes the ratio of L1 to L2 weight norms, which captures the degree to which the network's weight distributions are becoming sparse and concentrated. Fires when this ratio has drifted significantly, indicating that the network is over-specializing — concentrating its representational capacity in a small number of pathways. This is a slow, structural signal. It does not produce a visible loss spike. When it fires it typically indicates a brittleness that will manifest as failure under distribution shift or in later training phases.

Reading the Signal Breakdown

When Argus returns a signal breakdown, each active signal carries three pieces of information: which signal fired, how far above the established baseline it is currently sitting, and how many steps it has been active. These three together — identity, magnitude, and duration — are what the practitioner uses to form a hypothesis about root cause.

A single signal active for two steps is different from three signals active for forty steps. The former may be noise or a transient batch artifact. The latter is a systemic pattern that warrants investigation. Argus does not collapse this information into a single label. It preserves the evidence so the practitioner can reason about it.

6. Intervention Honesty

Detection and intervention are separate problems. Argus is confident about the first. It is honest about the second.

When a signal fires, the engine knows that something is anomalous relative to the internally established baseline for this run. What it does not know is why. And the why determines the correct response. A learning rate that is too high and an optimizer whose momentum buffers are corrupted from a previous explosion can produce identical signal patterns. The right intervention for high learning rate is to reduce it. The right intervention for corrupted momentum is to reset the optimizer state. Applying the wrong intervention does not leave the run where it was — it actively makes things worse. A momentum reset on a run that simply needs a learning rate cut clears state that would have been useful. A learning rate cut on a run with corrupted momentum slows the dying process without addressing the cause.

This is the fundamental reason Argus defaults to MANUAL mode. The engine has the signal evidence. The practitioner has the context — what hyperparameters changed recently, what the data pipeline looks like, whether a similar failure has occurred in previous runs on this model family. The combination of signal evidence from Argus and domain context from the practitioner produces better intervention decisions than either alone. MANUAL mode is not a safety feature bolted on after the fact. It is the correct default given what the engine can and cannot know.

What MANUAL Mode Means

In MANUAL mode, which is the default for all users, Argus detects anomalies and returns a recommended intervention with each step response. The recommendation is based on which signals are active, their severity, and their duration. It is a starting point for the practitioner's investigation, not an instruction to be followed automatically.

The practitioner sees which signals are responsible, how long they have been active, and what intervention the engine considers most appropriate given the signal pattern. They can act on this recommendation, modify it based on their own context, or investigate further before doing anything. Training continues uninterrupted. Nothing is applied to the optimizer or model without the practitioner's explicit decision.

What AUTO Mode Means

AUTO mode is available and is explicitly documented as experimental. In AUTO mode, Argus applies intervention recommendations directly to the optimizer and model without waiting for practitioner confirmation. Learning rate adjustments, optimizer state resets, and sample filtering all fire automatically when harm pressure reaches critical levels.

AUTO mode is appropriate for unattended long-running jobs where the cost of a missed failure — hours or days of wasted compute — outweighs the cost of an occasional incorrect intervention. It is not appropriate as a general default because the signal-to-cause mapping is ambiguous, as described above.

Users who enable AUTO mode should understand that automated interventions are probabilistic. The engine is making a structured inference from incomplete information. It will be right most of the time on common failure patterns. It will be wrong on unusual combinations of signals, on failure modes that produce ambiguous evidence, and on runs where the practitioner's domain context would have pointed to a different root cause.

The Closed Loop

When AUTO mode fires an intervention, the engine does not fire and forget. It continues monitoring the run after the intervention and classifies the outcome. If signals clear and training stabilizes, the engine recognizes the intervention as successful and releases the applied constraints gradually. If signals persist or worsen despite the intervention, the engine escalates — either applying stronger pressure on the same lever or shifting to a different intervention class entirely.

This closed loop is meaningful but does not resolve the fundamental ambiguity. The engine can observe that an intervention worked or did not work. It cannot observe why it worked, or whether a different intervention would have worked better and faster. The outcome classification improves the engine's ability to avoid repeating failed interventions within a run. It does not provide the causal understanding that would make automated intervention fully reliable.

The Honest Position

Argus's detection is strong. The benchmarks in Section 8 demonstrate recall across diverse failure modes and architectures with low false positive rates. The signal evidence the engine produces is reliable and actionable.

Argus's intervention is probabilistic. The mapping from signal pattern to root cause is many-to-many. The mapping from root cause to correct intervention requires domain knowledge the engine does not have. Automated intervention improves outcomes in aggregate — the benchmarks show that AUTO mode survives failures that kill unmonitored runs. But it does so through structured inference, not certain knowledge, and practitioners who treat its recommendations as diagnoses rather than hypotheses will occasionally be misled.

We consider this honesty more valuable than false confidence. A system that claims to know what it does not know erodes trust the first time it is wrong. A system that accurately represents its own uncertainty gives practitioners the information they need to use it correctly.

7. Data Transparency

Argus operates across two data streams with fundamentally different properties. This section documents both completely. Users should know exactly what leaves their machine, what does not, and why.

Stream One — Per-Step Detection Telemetry

This stream flows from the SDK to the telemetry engine with each training step. It is the data the engine uses for detection.

Every field is a scalar or small structured value derived from aggregated training statistics. No field contains or can be used to reconstruct model weights, gradient vectors, activation values, or training data samples.

The complete field list:

Run identifiers: run ID, step number, epoch number, model type string, SDK version string.

Training dynamics: loss value, loss delta from previous step, gradient norm, gradient direction similarity to previous step, current learning rate, whether a learning rate restart occurred this step.

Optional per-sample statistics (Layer 2): per-sample loss values, per-sample confidence scores, per-sample prediction margins, per-sample entropy values, per-sample correctness flags, corresponding sample IDs. These are only transmitted when the user explicitly provides them to the SDK. They are scalar values per sample — not embeddings, not raw logits, not activations.

Architecture metadata: batch size, sequence length, vocabulary size, number of layers, hidden dimension, parameter dtype. These describe the model's configuration, not its learned state.

Gradient distribution summary: L1 norm, L2 norm, gradient norm by layer depth bucket. These are aggregate scalar summaries of the gradient distribution, not the gradients themselves. It is mathematically impossible to reconstruct gradient vectors from these values.

Timing metadata: forward pass milliseconds, backward pass milliseconds, when provided.

Local health summary: when the local detector is attached, a summary of its current findings — worst severity level, trend direction, fraction of layers affected, key scalar summaries. Raw layer-level data is never included.

Control fields: operating mode, NaN detection flag, run registration metadata.

Several fields in the acceptor schema are reserved for future SDK versions and are currently transmitted as zero or null. These include gradient distribution statistics beyond norm and per-layer gradient segment breakdowns. They are documented here for completeness.

Stream Two — Run Analytics Telemetry

This stream flows from the SDK to a separate analytics endpoint on a periodic heartbeat and at run milestones. It is used for platform analytics and product improvement, not for detection.

The complete field list:

Hardware: GPU vendor name and VRAM capacity per device. Exact GPU model is masked — the SDK reports vendor and memory only, not the specific product name.

Software environment: Python version, PyTorch version, Transformers library version if installed.

Model metadata: total parameter count, trainable parameter count, model name or type string derived from model configuration. No weights.

Training throughput: steps per second, samples per second, elapsed hours, current loss, current step, batch size. These are operational metrics, not model internals.

Run lifecycle events: timestamps for run start, milestone steps, and run end. Final run status. Error message on failure, if any.

Data Quality Index: a scalar score computed from the loss trajectory over the full run, transmitted at run end. This is a summary statistic, not raw training data.

What Is Never Transmitted

To be explicit:

Model weights are never transmitted. At no point does any component of Argus receive, request, or store parameter tensors from the user's model.

Raw gradients are never transmitted. The engine receives gradient norm — a single scalar — and gradient direction similarity — another single scalar. These are derived statistics. The gradient vectors themselves stay on the user's machine.

Training data is never transmitted. The engine receives per-sample loss values when the user opts into Layer 2 telemetry. It does not receive the input samples, labels, or any representation of the training data content.

Activation values are never transmitted. The local detector observes activations entirely on the user's machine. Summaries — sparsity fractions, variance estimates — may be included in the local health summary field. Raw activation tensors are never sent anywhere.

The Local Detector

The local detector is open source under the MIT license. Its complete source code is publicly available. It runs entirely within the user's training process. It makes no network calls. It has no mechanism for transmitting data anywhere. Users who require full auditability of what runs inside their training process can inspect the source directly.

The local detector's findings are summarized into a small structured report per step. The SDK can include a condensed version of this report in the per-step detection payload. This is opt-in — users who do not attach the local detector transmit nothing from it. Users who do attach it transmit only the summary fields described above, never raw layer statistics.

Why This Design

The data boundary is not a compliance checkbox. It is a design constraint that follows from who uses Argus and what they are training. Practitioners training proprietary models on sensitive data cannot send weight tensors or gradient vectors to an external service. A monitoring system that required this would be unusable for the majority of serious production training workloads.

The detection engine is specifically designed to work within this constraint — to extract meaningful signal from the aggregated scalar telemetry that practitioners can safely transmit, without requiring the information that would compromise their model IP or data privacy. The local detector exists precisely to cover the failure modes that require raw access, while keeping that raw access entirely on the user's machine.

---

8. Empirical Validation

Argus was validated across five benchmarks. Each benchmark was designed to test a distinct property of the system — detection of a specific failure mode, generalization across architectures, behavior on clean runs, and end-to-end integration. This section reports what each benchmark tests, how it is structured, and what the results demonstrate.

All benchmarks are publicly available in the Argus GitHub repository. Seeds are randomized per run to prevent overfitting to specific initialization conditions. Results reported here represent typical outcomes across multiple seeds — exact figures vary by seed as expected for any stochastic system.

Benchmark 1 — Ghost Recovery

Ghost recovery is a failure mode where a training run explodes, loss subsequently returns to an acceptable range, and the run appears to have recovered. It has not. The optimizer's momentum buffers accumulated the explosion and continue to carry it forward. The run collapses again forty to sixty steps later, typically more severely than the first time, because the corrupted momentum has been compounding.

This failure mode is invisible to threshold-based systems because the recovery phase produces metrics that look healthy. Loss is falling. Gradient norm has returned to range. No threshold fires. The second collapse arrives without warning.

The benchmark runs three scenarios against the same failure sequence across multiple random seeds. Scenario one applies no protection — the run explodes, fake-recovers, and dies permanently, establishing that the failure is real and naturally occurring. Scenario two attaches Argus — the engine detects the anomaly during the explosion phase, the SDK restores the model to a pre-explosion checkpoint and resets the optimizer state, and training completes successfully. Scenario three runs clean training with Argus watching silently, establishing that the engine does not intervene on healthy runs.

Across three seeds with randomized explosion steps and learning rate spike magnitudes, Argus detected the explosion in all cases and completed the run successfully in all cases. The healthy baseline scenario produced no false interventions.

Standard threshold implementations cannot distinguish the fake recovery phase from genuine recovery. They have no mechanism to detect that optimizer momentum remains corrupted after loss returns to range. This is a structural limitation of what threshold systems can observe from scalar metrics alone.

Benchmark 2 — Slow Divergence

Slow divergence is the failure mode that most consistently defeats threshold-based monitoring in practice. Gradient norm climbs two to four times its healthy baseline over one hundred fifty to two hundred fifty steps. Loss continues falling throughout — the run looks completely healthy by any loss-based metric. Eventually weight magnitudes cross a chaotic regime boundary and the loss cliff arrives. By this point the run has been dying for hundreds of steps while appearing healthy.

The benchmark injects drift at a randomized step with a randomized rate across multiple seeds. Scenario one confirms the failure is real and silent — the cliff arrives while loss is still improving. Scenario two attaches Argus — the engine detects the trend before the cliff and the SDK intervenes. Scenario three runs clean training with Argus watching silently.

The key measurement is lead time — how many steps before the cliff did Argus fire. Across five seeds the engine consistently detected the developing anomaly with significant lead time before terminal collapse, providing a window where intervention is meaningful.

Benchmark 3 — Dual Catastrophe

The dual catastrophe benchmark is the most adversarial test in the suite. It injects two collapses with deliberately different signatures into the same run — a fast spike followed by a slow drift — to test whether the engine can handle failure mode transitions within a single run.

Step forty introduces a learning rate spike causing fast gradient explosion. Loss spikes hard and then comes back down — ghost recovery phase. Steps eighty to one hundred forty are a clean interlude where the engine must stay silent. Step one hundred fifty begins slow weight drift, producing silent divergence trending toward a cliff at approximately step two hundred eighty.

The benchmark tests four properties simultaneously: detection of the fast spike, silence during the clean interlude, detection of the slow drift before the cliff, and completion of a four-hundred-step run in good health. A system that only detects one type of failure will fail this benchmark. A system with a high false positive rate will fire during the clean interlude.

Argus detected both collapses, remained silent during the clean interlude, and completed the run with healthy final loss.

Benchmark 4 — Three Architecture Test

This benchmark tests model-agnosticism. Three architectures are trained to convergence with Argus monitoring in MANUAL mode: GPT-2 Small on WikiText-2, ResNet-18 on CIFAR-10, and a two-layer LSTM on WikiText-2. All three use identical Argus configuration — because there is no configuration to change.

The benchmark measures two things: whether Argus stays calm on genuine convergence across all three architectures, and whether checkpoints are issued at appropriate intervals. False positive rate below one percent on all three architectures confirms that the self-calibrating baseline approach generalizes without architecture-specific tuning. Checkpoint issuance confirms the engine is actively tracking run health rather than silently doing nothing.

All three architectures passed with false positive rates below one percent. All three received checkpoint signals during training. The same engine, with no configuration, handled transformer, CNN, and recurrent architectures correctly.

Benchmark 5 — Real User Simulation

This benchmark tests the end-to-end path a new user hits on their first day. A GPT-2 Small model is fine-tuned for two hundred steps using AdamW with cosine learning rate schedule — a standard configuration representative of the majority of practical fine-tuning workloads.

The benchmark measures false intervention rate, checkpoint signal issuance, API latency, and successful run completion with certificate generation. It is not an adversarial test. It is a confirmation that a clean, well-configured run produces no false alarms and completes correctly.

False intervention rate was below one percent across runs. Average API latency was below three hundred milliseconds. P95 latency was below five hundred milliseconds. Checkpoint signals were issued at healthy intervals. The run completed and generated a certificate.

Benchmark 6 — Overhead Validation

Two overhead benchmarks are provided in the public repository under `tests/test_local_overhead.py` and `tests/test_api_overhead.py`.

`test_local_overhead.py` measures the wall-clock cost of attaching the local detector to a training loop. It runs a baseline training loop without the detector, then the same loop with the detector attached, and reports the percentage overhead per step across multiple model sizes. The target is under one percent. Users can run this benchmark directly against their own model to verify overhead on their specific hardware and architecture before committing to production use.

`test_api_overhead.py` measures the latency introduced by the per-step telemetry call to the detection engine. It reports average latency, P95 latency, and the impact of adaptive batch packing at different network conditions. The targets are average latency below three hundred milliseconds and P95 below five hundred milliseconds. Local detector overhead was measured at 0.49ms at zero batch time, falling below measurement noise floor at 50ms+ batch times. SDK telemetry overhead was measured at 4-5ms flat regardless of batch size, representing 1.38% overhead at 250ms step time and below 0.5% at 500ms step time — the regime of production 7B+ training workloads, consistent with the real user simulation results in Benchmark 5.

Both benchmarks are runnable with a valid API key and a standard PyTorch installation. They are included not as controlled experiments but as reproducibility tools — users who observe different overhead numbers on their own hardware have a starting point for investigation.

Summary

Taken together the six benchmarks establish five properties. First, Argus detects failure modes that threshold systems miss or catch too late, specifically slow divergence and ghost recovery. Second, Argus generalizes across architectures without reconfiguration. Third, Argus maintains low false positive rates on healthy runs. Fourth, Argus operates within practical latency constraints for production training workloads. Fifth, users have concrete reproducibility tools to verify these claims on their own hardware.

No benchmark establishes that Argus catches every failure. The coverage of the signal taxonomy is discussed in Section 9. No benchmark tests Argus on architectures beyond the three validated — diffusion models, reinforcement learning agents, and mixture-of-experts architectures are outside the empirically validated scope of version one.

---

Limitations

Argus is an honest system. It follows that the paper describing it should be equally honest about what it does not do, what it has not been tested on, and where its claims are bounded by the absence of public ground truth data.

Before enumerating specific limitations it is worth establishing the context in which they exist. There are exactly two approaches available to any training monitoring system. The first is fixed thresholds defined before the run begins. The second is adaptive baselines established automatically during training. Every limitation in this section is a property of one of these two approaches — not a unique flaw in Argus. Where Argus inherits limitations from the second approach, we note them honestly. Where the first approach would have the same or worse limitations, we say so. The choice between the two approaches is not a choice between a flawed system and a perfect one. It is a choice between two imperfect systems with different failure modes. We believe the evidence in this paper demonstrates that Argus's approach fails less often, fails more gracefully, and requires less from the practitioner than fixed thresholds. But they are not without cost. What follows is an honest account of that cost.

No Public Benchmark for Training Failure Detection

The empirical validation in Section 8 uses benchmarks we designed and injected. This is the current state of the field — there is no publicly available labeled dataset of real training failures with ground truth annotations that would allow standardized evaluation across systems. The L4 paper's 428 production failures are the closest approximation in the literature, and they cover post-failure diagnosis in distributed LLM training rather than real-time detection in the training loop.

This means our recall and precision figures are measured against controlled injected failures, not a representative sample of failures encountered in production. Injected failures are cleaner than real ones. Real failures arrive with confounding factors — hardware noise, data pipeline irregularities, checkpoint restores mid-run, learning rate schedule interactions — that the benchmarks do not fully capture. We believe the results are directionally valid, but we cannot claim they are predictive of exact performance on arbitrary production workloads.

The absence of a public benchmark is itself a limitation of the field, not only of Argus. We note it explicitly because any system that reports detection metrics without acknowledging this gap is overstating the certainty of its claims.

Coverage Percentage Is Unknown

Section 5 documents eleven signals covering eleven observable failure modes. We mapped these against the failure taxonomy in the literature and concluded they cover the majority of failure modes detectable from step-level scalar telemetry. We cannot state a precise coverage percentage.

The reason is the same as above — there is no labeled ground truth for what percentage of real production failures fall into each failure mode category. The L4 study documents that hardware and user faults dominate in large-scale distributed training. For single-node fine-tuning workloads, which represent the majority of Argus's target users, the distribution of algorithm-level failures is not empirically established in the public literature.

What we can say is that the signal taxonomy was designed to cover the failure modes that appear repeatedly across the literature on training instability — gradient pathologies, loss dynamics, data quality degradation, optimizer state corruption, and representational collapse. Whether these account for sixty percent or ninety percent of real failures in practice is an open empirical question.

Tested on Three Architectures

The three-architecture benchmark covers transformers, CNNs, and LSTMs. Architecture priors exist in the engine for eight model families including diffusion models, reinforcement learning agents, and MLPs. These priors seed the baseline accumulators to accelerate calibration on architectures with known dynamics.

However, empirical validation was conducted on three families only. The behavior of the engine on diffusion models, RL agents, mixture-of-experts architectures, and state space models has not been benchmarked. The self-calibrating design means the engine should generalize — it does not rely on architecture-specific thresholds — but should generalize and has been tested to generalize are different claims. Users training on untested architectures should treat version one as experimental for their use case.

Tested on Small and Medium Models Only

All benchmarks use small to medium models — GPT-2 Small, ResNet-18, a two-layer LSTM, and micro transformer variants for the failure injection benchmarks. This is not a design choice. Validating detection behavior on 7B+ parameter models requires multi-hour training runs at significant compute cost. As an early-stage system we have not conducted these runs.

The self-calibrating design is scale-invariant in principle — the engine operates on scalar aggregates regardless of model size. This was validated empirically at 3B and 7B scale using OpenLLaMA-3B and Mistral-7B-v0.1 under QLoRA fine-tuning on WikiText-2 and Alpaca instruction data respectively. Detection behavior was consistent across both scales and both architectures with false positive rates below one percent on healthy runs. Users running training beyond 7B parameters should treat version one as promising but unvalidated at their scale.

Overfitting Is Not Detected

The engine tracks validation loss in run state when it is provided. It does not fire signals on the train-validation gap in version one. Overfitting — the failure mode where training loss continues improving while generalization degrades — is outside the current detection scope. This is a conscious deferral. Version two will address this.

Failure Mode Attribution Is Not Attempted

Argus does not assign failure mode labels. The signal pattern is reported. The root cause is not inferred. The reason is physical, not pragmatic. Training dynamics exist on a continuous spectrum. The boundary between loss divergence and oscillation is not a line — it is a gradient, like the boundary between green and blue in the visible spectrum. Two observers will draw it differently. A single value can sit in both simultaneously. The same signal pattern is consistent with multiple root causes requiring different interventions. Assigning a confident label from signal co-occurrence alone is not scientifically defensible. Argus reports the evidence. The practitioner supplies the domain knowledge required to form a hypothesis.

Intervention Is Probabilistic

Section 6 covers this in detail. Identical signal patterns are consistent with multiple root causes requiring different interventions. AUTO mode makes structured inferences and will be wrong on ambiguous cases. MANUAL mode is the correct default for any run where the cost of a wrong intervention is significant.

The Local Detector Requires Attachment

The nine additional failure modes covered by the local detector are only available when the user explicitly attaches it to their training loop. A user who integrates only the SDK telemetry will not receive these signals. This is by design given the IP boundary and data transparency commitments described in Sections 4 and 8. It is nonetheless a friction point in practice.

Scale Has Not Been Tested

All benchmarks use small to medium models. The physics of training instability are scale-invariant in the sense that gradient explosion, vanishing gradients, and loss divergence occur across model scales. However, the specific dynamics of very large models have not been tested. Argus's applicability to hundred-billion-parameter training runs is theoretically supported but empirically unvalidated.

10. Conclusion

Argus is a self-calibrating training anomaly engine. It monitors active training runs, establishes what healthy looks like through proprietary internal calibration, and detects deviation from that baseline without requiring threshold configuration, architecture-specific tuning, or prior training history. It is not an experiment tracker. It is not a dashboard. It is a detection system with an honest account of what detection can and cannot tell you.

The core technical contribution is the elimination of hardcoded thresholds from the detection path. Every boundary the engine uses is derived internally without user input. This makes the system model-agnostic by construction — the same engine handles transformers, CNNs, and recurrent architectures without any configuration change. It also makes the system correct by construction for the failure modes it covers — a threshold that is wrong for this run cannot exist if no threshold was ever set.

The two-tier architecture reflects a deliberate commitment to user trust. The telemetry engine works entirely from aggregated scalar signals that carry no model IP and no training data. The local detector, which requires raw model access, runs entirely on the user's machine as open-source software. The boundary between these tiers is not a technical compromise — it is the right design for a monitoring system that practitioners can actually use on proprietary workloads.

The empirical validation demonstrates that Argus catches the failures that matter most in practice — the ones that look healthy until they do not. Ghost recovery, slow divergence, and dual-signature catastrophic failure are the expensive failures. They cost days of compute and weeks of project time. Threshold systems miss them because they never cross a number. Argus catches them because it watches trends relative to internally established baselines, not absolute values relative to a number someone guessed before training began.

The honest limitations are equally part of the contribution. There is no public benchmark for training failure detection, so coverage claims are bounded by the absence of labeled ground truth. The signal taxonomy covers eleven observable failure modes — we believe this covers the majority of algorithm-level failures detectable from scalar telemetry, but the precise percentage is an open empirical question. Failure mode attribution from signals alone is not defensible, and we do not attempt it. Intervention is probabilistic, not certain, and MANUAL mode is the correct default for any run where a wrong automated action would cause meaningful harm.

These limitations are not temporary embarrassments to be quietly fixed in the next version. They reflect genuine epistemological boundaries on what automated monitoring can know from the information it has access to. A system that pretended to know more would not be more useful — it would be less trustworthy.

What comes next is shaped by what version one proved and what it honestly could not do. Validation loss integration and overfitting detection are the highest-priority additions for version two. Broader architecture coverage — diffusion models, reinforcement learning, mixture-of-experts — requires empirical benchmarking rather than design work, since the self-calibrating engine already supports these families in principle. Failure mode attribution, if it is ever attempted, requires a fundamentally different approach — either labeled training data from real production failures, or causal intervention experiments that go beyond what passive monitoring can provide.

Argus began from a simple observation: the information needed to detect a training failure is always present in the training dynamics. Loss trajectories, gradient behavior, and prediction statistics all carry signal about what is happening inside a run. The problem was never that the signal was absent. The problem was that existing tools required users to know upfront what the signal should look like — and nobody knows that before the run begins.

Argus removes that requirement. The engine establishes what normal looks like for each run without user input. It watches for anything that is not normal. The user is told what was found, not what the engine guessed the cause to be. This is the correct division of responsibility between an automated system and the practitioner who built the model, knows the data, and will ultimately decide what to do.

---