Artificial active

Improving datasets for artificial intelligence through model-based methods

By Dirk Mayer and Ulf Wetzker

Factories and industrial processes are now digitized and networked, and AI can be used to evaluate the data generated by these facilities to increase productivity and quality.

Machine learning (ML) methods can be applied to:

  • Classification of product quality in complex production processes.
  • Condition monitoring of technical systems, which is used, for example, in the diagnosis of drive systems, production facilities, as well as for wireless communication of critical automation components.
  • Detection of anomalies in sensor data or process information. Early initiation of countermeasures reduces system failures.
  • Prediction of results or events based on previous measurements.
  • Optimization of production processes, taking into account the flow of materials.
  • Training of intelligent robotic systems.

Some of the most popular machine learning applications are based on smartphone user data or internet sources (social media, Wikipedia, image databases, etc.). For the latter, very large training data sets are used, for example, about 45 TB of text data. for OpenAI GPT-3 training [1].

Real industrial applications operate on much smaller data sets. This makes it difficult to train high-performance ML models and therefore fully exploit the potential added value. Datasets are often incomplete for the following reasons:

  • In industrial measurement campaigns, it is often not possible to comprehensively monitor all important conditions that need to be classified. This applies, in particular, to data from faults or faults in the system.
  • It is often not possible to collect data on all units of a machine model, so mechanical or electronic differences and environmental influences (eg temperature fluctuations) are not reflected in the data.
  • Data is usually scanned, filtered and compressed, so information is lost.
  • Therefore, the data is also not completely labelled, i.e. assigned to a state for further classification.

These incomplete datasets lead directly to over-fitted AI models and a lack of generalization. At the same time, measurement campaigns covering all possible variations are not economically feasible.


To overcome these problems, insufficient datasets need to be cleaned up and augmented.

In industrial processes, time series data plays a particularly important role (e.g. sensor data, process parameters, log files, communication protocols). They are available in very different time resolutions – a temperature sensor can provide values ​​every minute, while for a spectral analysis of a wireless network, more than 100 million samples per second are needed.

The goal is to reflect all relevant process states and uncertainties due to stochastic effects in the augmented time series. To add additional values ​​to the measured time series of an industrial process, information about the process is useful. Such a representation of the physical background may be called a model. In terms of model building, a division can be made into the following levels:

This allows us to derive strategies for model-based data generation. In order to generate longer and more suitable time series for training AI models, the described strategies should be combined in an application-specific way:

  • Black Box. Unsupervised learning can be used to generate artificial time series. This creates new “similar” sections of data without a deeper physical understanding of the waveforms. However, a relatively large amount of data is required and the relationship between the sections is not physically motivated.
  • Gray box. Generation of sections in the time series from the physical understanding, for example, overlay with certain models belonging to relevant classes or distortion of the measured time series. This requires many measurements and a basic understanding of which waveforms are assigned to which states or classes.
  • white box. Generation of time series from a system simulation, which theoretically does not require any measurement. In reality, however, completely white (“snow white”) models are usually not possible, since the parameters must always correspond to reality.

In the field of image processing, data augmentation might be intuitively easier. In contrast, increasing time series primarily requires understanding the underlying process. Depending on the depth of prior knowledge, model-based and synthetic data can be used. The optimal strategy for extending data generally follows economic considerations. Depending on the problem, collecting a full set of measurement data or generating physically meaningful models can be very expensive. In industrial practice, methods of the “grey box” category will mainly be used with limited experimental and analytical effort.

Interesting perspectives for interdisciplinary approaches also arise. Time series can be found in completely different processes even outside of technology and industry. The underlying processes are completely different, but the characteristics of the time series are very similar. In the image below, two time series are shown, which have some similarity due to the oscillation of the values. However, they are generated by completely different processes. On the left is represented the periodic oscillation of solar activity (period of about 10 years, abscissa in years from the year 1700, sampling frequency 1 year [2]). On the right the ECG of a human, period approx. 1s, 1/300s sampling rate [3]). This offers potential for cross-domain method transfer, for example using sophisticated speech and text processing models for data augmentation in the medical field. [4].

In order to achieve a sustained increase in performance of a trained model, it is necessary to integrate human knowledge about the process. Methods from the domain of human ML in the loop, such as active learning, offer the possibility of moving from a black box approach to a gray box model.

Currently, there is no systematic approach or simple tools for industrial applications that combine the above methods in a meaningful way to enable efficient data augmentation. This is the subject of ongoing research.

The references



[3] Richter. A: Entwicklung eines Systems zur Erfassung affektiver Zustände auf der Grundlage von Vitalparametersensordaten, Master Thesis, TU Chemnitz, July 2021.

[4] Bird, JJ, Pritchard, M., Fratini, A., Ekart, A. and Faria, DR (2021). Machine-generated synthetic biological signals by GPT-2 improve EEG and EMG classification through data augmentation. IEEE Robotics and Automation Letters, 6(2), 3498-3504.

Ulf Wetzker is part of the Industrial Wireless Communication working group at Fraunhofer IIS EAS.

Dirk Mayer

(All posts)

Dirk Mayer is responsible for the distributed data processing and control department in the Adaptive Systems Engineering division of Fraunhofer IIS.