The Domain Gap Is Bigger Than It Looks
When a software team proposes to "apply computer vision" to foundry inspection, they typically start with a pretrained model - ResNet-50, EfficientNet, or a more recent architecture trained on ImageNet or a similar large-scale dataset. The assumption is that transfer learning will handle the domain gap: fine-tune on a few thousand foundry images and the model adapts.
This assumption works reasonably well for many industrial domains. Printed circuit board inspection, label verification, and package integrity checking are domains where the defects have high contrast relative to the background, appear consistently, and involve classification problems (present/absent, correct/incorrect) similar to the kinds of distinctions the pretrained features already capture.
Metal casting surfaces are different in ways that matter for model performance. The defects being detected - surface cracks, porosity clusters, cold shuts, lap marks - are low-contrast features on surfaces with complex, variable textures. The texture of an aluminum die casting surface varies with die temperature, lubricant coverage, alloy grain structure, and ejector pin contact marks. The appearance of a cold shut on a hot die casting taken at the beginning of a run is visually different from the same defect type on the same part at the end of a 4-hour run when die temperature has stabilized. Generic pretrained features trained on object classification do not encode the subtle texture gradient information that distinguishes these cases.
The Specific Failure Modes
The performance gap manifests in predictable ways when generic vision models are applied to casting inspection. The three most common failure modes are: high false-positive rates on normal surface texture variation, low recall on low-contrast defect types, and poor generalization across alloy changes or die changes.
High false-positive rates occur because the model has not learned to distinguish defect texture from normal variation. An HPDC aluminum surface has parting line marks, ejector pin witness marks, mold release film traces, and grain flow patterns that are all expected and acceptable. A generic model trained without sufficient negative examples of these specific texture types will flag them at elevated rates. The consequence is either high escape rate (if detection threshold is raised to reduce false positives) or operator alarm fatigue (if thresholds are left sensitive and operators start ignoring alerts).
Low recall on low-contrast defects is a second systematic failure. Surface sinks, micro-porosity clusters, and cold shuts in partial wall sections often appear as subtle intensity variations - 5-8 gray-level differences over a field of complex surface texture. A generic model has not been trained to be sensitive to this range of variation in this specific surface context. It will miss defects that an experienced quality inspector would catch on a light table because it has no learned representation of what "subtle anomaly in foundry surface texture" looks like.
Generalization failures appear when the part alloy, die lubricant formulation, or surface finish changes. A model tuned for A380 die casting will not transfer without retraining to 6061 sand casting. The surface texture, reflectance, and typical defect appearance are different enough that the model's learned representations no longer apply. This matters operationally because foundries change alloys, run different customer parts on the same line, and periodically change lubricant formulations.
What Purpose-Built Training Data Requires
A model capable of reliable casting defect detection needs a training dataset built around the specific characteristics of the application domain. The minimum requirements are substantially more than a few thousand images:
First, the dataset needs defect examples across the full range of defect severity - not just obvious, high-contrast defects that are easy to label, but also marginal cases that are at or near the acceptance limit. Models trained primarily on severe defects learn to detect severe defects. The defects that escape inspection are typically marginal ones, and marginal examples must be represented in training data.
Second, the dataset needs negative examples (acceptable parts) that represent the full range of normal surface variation: different positions in the die cycle, different operators applying lubricant, different ambient temperatures, different conveyor speeds affecting part cooling before inspection. Generic negative examples that don't represent the specific sources of variation in the deployment environment produce models that flag those sources of variation as defects.
Third, ground truth labels need to come from domain experts who have validated defect classification against downstream quality outcomes - not from crowdsource labeling platforms. A crack that is acceptable per the customer specification (below minimum acceptance criteria depth) and one that requires rejection look similar to an annotator without metallurgical context. Label quality directly determines model precision ceiling.
ForgePuls's training data for casting defect detection contains 4.2 million labeled images accumulated across sand casting, HPDC, LPDC, and closed-die forging deployments. The dataset is continuously augmented with new images from production deployments, and labels are validated by application engineers with casting backgrounds, not general-purpose annotators.
The Lighting and Optics Problem
Software is only part of the problem. Foundry vision inspection failure attributable to model limitations is often actually a failure in optical design - the wrong lighting for the defect type being detected.
Surface cracks require grazing-angle illumination to be visible. Low-angle illumination cast at 10-15 degrees from the surface creates shadow contrast across crack openings, making them visible even when their width is below 0.1mm. Overhead illumination - the default configuration for many general-purpose vision systems - does not produce this shadow contrast and misses the cracks the system was deployed to detect.
Porosity cluster detection benefits from structured light or coaxial illumination that highlights surface texture variation differently from ambient reflections. Die casting aluminum surfaces under coaxial LED ring illumination show porosity signatures clearly that are invisible under dome illumination because coaxial light reflects differently off porous surfaces versus smooth surfaces.
Generic vision systems deployed without lighting engineered for the specific defect type and surface finish will fail regardless of model quality. This is an area where foundry-specific application knowledge is not optional.
Why Pre-Training on Foundry Data Changes the Picture
The alternative to fine-tuning generic pretrained models is to pre-train on large-scale foundry inspection data. The premise is that a model trained from scratch (or from a general visual representation model) on millions of casting images learns the relevant low-level texture representations - grain structure gradients, solidification patterns, surface finish variation by alloy - before any fine-tuning on specific defect categories.
Pre-training on domain-specific data changes the transfer learning baseline. Instead of starting with features tuned for object classification in natural images, the model starts with features tuned for metallic surface texture analysis. Fine-tuning on a specific casting line then requires far fewer examples to reach acceptable performance - because the feature extractor already knows what "normal casting surface" looks like.
This is the architectural approach behind ForgePuls's detection models. The base feature extractor is trained on casting and forging surface images before any defect-specific fine-tuning. The result is that a new deployment on a casting line geometry we have not seen before can reach 95% detection rate with 200-300 calibration parts rather than the 10,000+ that a generic model requires to fine-tune adequately.
What "97% Accuracy" Actually Means in This Context
Accuracy metrics for casting inspection need careful interpretation. A model that predicts "no defect" for every part achieves 95-99% accuracy in most foundry environments because defect rates are typically 1-5% of production volume. Accuracy is not the relevant metric. Detection rate (recall on actual defects) and false-positive rate (false alerts as a fraction of good parts inspected) are the metrics that determine whether the system is useful.
A system with 99% accuracy, 70% recall, and 0.5% false-positive rate will miss 30% of defects while flagging 1 in 200 good parts for rejection. That is not acceptable quality performance for most automotive supply chain applications. The targets that actually matter are recall above 95% on defects at or above the acceptance limit, and false-positive rate below 0.2% to maintain operator confidence in alerts.
When evaluating any vision inspection system - generic or purpose-built - ask for performance metrics stated as recall and false-positive rate on a held-out test set that represents your specific part geometry, alloy, and defect distribution. Claims stated only as accuracy are not informative for inspection decisions.
For more on how defect type classification improves process control, see our article on shrinkage versus gas porosity detection approaches.
Learn how ForgePuls models are trained for casting-specific performance: Detection Model Overview