Targeted Deep Learning System Boundary Testing

1Technical University of Munich, 2fortiss GmbH 3North Carolina State University 4University of Udine

Mimicry mixes features of a blue car with those of a white truck, producing different outputs depending on the latent layers used.

Abstract

Evaluating the behavioral boundaries of deep learning (DL) systems is crucial for understanding their reliability across diverse, unseen inputs. Existing solutions fall short as they rely on untargeted, random perturbations with limited controlled input variations. In this work, we introduce Mimicry, a novel black-box test generator for fine-grained, targeted exploration of DL system boundaries. Mimicry performs boundary testing by leveraging the probabilistic nature of DL outputs to identify promising directions for exploration. By using style-based GANs to disentangle inputs into content and style components, Mimicry generates boundary test inputs by mimicking features from both source and target classes. We evaluated Mimicry’s effectiveness in generating boundary inputs for five DL image classification systems, comparing it to two baselines from the literature. Our results show that Mimicry consistently identifies inputs up to 25× closer to the true decision boundary, outperforming the baselines with statistical significance. Moreover, it generates semantically meaningful boundary test cases that reveal new functional misbehaviors, while the baselines mostly produce corrupted or invalid inputs. Thanks to its enhanced control over latent space manipulations, Mimicry remains effective as dataset complexity grows, resulting in a up to 36% higher validity rate and competitive diversity, as supported by a comprehensive human assessment.

Boundary Testing Approaches

Most boundary-testing methodologies focus on the untargeted case, where the goal is simply to find any test case that crosses the decision boundary—regardless of its location. In contrast, Mimicry can operate not only in untargeted, but also in a targeted fashion, directing tests toward specific classes. This targeted capability is particularly valuable when only a limited subset of class pairs yield meaningful semantic boundaries.

Untargeted boundary testing.

Untargeted Boundary Testing

Targeted boundary testing.

Targeted Boundary Testing

Comparing Mimicry with Baselines

We compare Mimicry against two established baselines. DeepJanus is a model-based approach that generates candidate pairs near decision boundaries, while Sinvad is a Generative AI method that uses VAEs to produce test cases. In our experiments, Mimicry not only produced test cases closer to the decision boundary, but also generated images that were generally more semantically meaningful. DeepJanus is constrained by its model-based nature, which can force outputs to be out-of-distribution in certain datasets. Sinvad, on the other hand, quickly loses realism, as its manipulation strategy tends to blur the final outputs. Mimicry avoids both limitations: it is not restricted by domain models, and instead of simply blurring test cases, it introduces structural changes that preserve realism while probing the boundary.

SVHN example for Mimicry.

Mimicry - SVHN

SVHN example for Sinvad.

Sinvad - SVHN

SVHN example for DeepJanus.

DeepJanus - SVHN

ImageNet example for Mimicry.

Mimicry - ImageNet

ImageNet example for Sinvad.

Sinvad - ImageNet

Executive Summary

We present Mimicry, a targeted boundary-testing method for deep learning classifiers that locates decision boundary inputs through latent feature mixing and SUT feedback. By leveraging disentangled latent space representations, Mimicry delivers high-control, high-fidelity test cases that remain in-distribution, outperforming DeepJanus and Sinvad across all benchmarks—especially in complex datasets. Its targeted exploration enables meaningful class-to-class boundary testing, improving functional coverage while avoiding unrealistic or overly blurred outputs. For future work Mimicry’s realism and boundary precision make it particularly promising for safety-critical domains such as autonomous driving and medical imaging, where valid, semantically rich test cases are essential.

BibTeX

@article{weissl2024targeted,
  title={Targeted Deep Learning System Boundary Testing},
  author={Wei{\ss}l, Oliver and Abdellatif, Amr and Chen, Xingcheng and Merabishvili, Giorgi and Riccio, Vincenzo and Kacianka, Severin and Stocco, Andrea},
  journal={arXiv preprint arXiv:2408.06258},
  year={2024}
}