Data pre-processing methods for NPP equipment diagnostics algorithms: an overview

The main tasks of diagnostics at nuclear power plants are detection, localization, diagnosis, and prognosis of the development of malfunctions. Analytical algorithms of varying degrees of complexity are used to solve these tasks. Many of these algorithms require pre-processed input data for high-quality and efficient operation. The pre-processing stage can help to reduce the volume of the analyzed data, generate additional informative diagnostic features, find complex dependencies and hidden patterns, discard uninformative source signals and remove noise. Finally, it can produce an improvement in detection, localization and prognosis quality. This overview briefly describes the data collected at nuclear power plants and provides methods for their preliminary processing. The pre-processing techniques are system-atized according to the tasks performed. Their advantages and disadvantages are presented and the requirements for the initial raw data are considered. The references include both fundamental scientific works and applied industrial research on the methods applied. The paper also indicates the mechanisms for applying the methods of signal pre-processing in real-time. The overview of the data pre-processing methods in application to nuclear power plants is obtained, their classification and characteristics are given, and the comparative analysis of the methods is presented.


Introduction
Modern nuclear power plants (NPP) generate large amounts of data. The methods of intellectual analysis make it possible to apply the generated data for the purpose of detecting malfunctions, determining the operating lifetime of equipment and solving other urgent problems in NPP operation.
Such data contain valuable information about incipient faults, but it can be extremely difficult to use the socalled raw or unprocessed data in analytical algorithms. The algorithms of fault detection, pattern recognition, fault localization, prognosis of fault development, etc. require signal pre-processing for high-quality output. The pre-processing techniques include both machine learning methods (Bishop 2006, Hastie et al. 2009) and classical signal processing methods (Chiang et al. 2001, Sergienko 2011. Modern diagnostic systems at NPPs use such pre-processing methods as spectral analysis, filtering, moving averages, generation of diagnostic features from recorded signals, and others. The academic literature on technical diagnostics has described the application of such methods for NPPs (Arkadov et al. 2004(Arkadov et al. , 2019(Arkadov et al. , 2020.
The pre-processing stage is very important in detection algorithms. Its relevance seems rather evident since it is an integral part to the overwhelming majority of the methods mentioned in this overview and other reviews of data processing methods (Venkatasubramanian et al. 2003a, 2003b, Qin 2009, Ma and Jiang 2011, Si et al. 2011, Dai and Gao 2013, Patel and Shah 2018. In Fig. 1 we propose the taxonomy of data pre-processing methods, which summarizes many such works. Fig. 2 shows the flow diagram of equipment diagnostics according to GOST R ISO 13381-1-2016(2017.
The main path of the equipment diagnostics is the sequential execution of all stages, starting with data acquisition, followed by pre-processing, fault detection, localization, diagnosis or root cause identification and prognosis of how the detected faults may develop. The dashed line indicates an auxiliary path of equipment diagnostics, in which the stages do not follow from one another. The auxiliary path can be taken either in deferred analysis when any stage is considered separately from the others; or when using the original data in its unprocessed form or adding new data at any stage; or in other pre-processing methods to prepare the original data and thus ensure algorithm operation.
It is necessary here to clarify some of the terms used in this article. The offline mode will refer to working with the full data sample; in this case full realization of the sig-nals is available for analysis. The online mode will mean working in real time; in this case, the full data sample is unavailable for analysis, data objects (vectors) can arrive one after another as streaming data -hence, the analysis is called the pointwise analysis -or there can be a buffer with batch data -hence the analysis is called the batch analysis.
Supervised learning refers to tasks in which all the operating modes of equipment are known and the data classes are marked; in other words, the data on both the normal mode of operation and the abnormal mode of operation (preferably also on all types of abnormalities) are available. Semi-supervised learning refers to tasks in which only the data on normal mode of operation is available; this means that only the part of data describing normal operation of equipment has a class mark. Unsupervised learning refers to tasks in which there is no data on either normal or abnormal operation and no class marks for any data.
This article focuses on the Data and Pre-Processing stages, traced with heavy line in Fig. 2. It discusses the methods of signal pre-processing that help cleanse time series data and transform, isolate and select data features with respect to NPPs and other complex technical systems.

Data
An NPP may have tens of thousands of instrument channels (Akimov et al. 2015, Arkadov et al. 2019). These include approximately 3,000 temperature signals, 450 electrical signals, 4,700 binary input signals, and 3,200 pressure, level, consumption and other signals. In addition, monitoring, control and diagnostics systems generate a large amount of useful data and, in most cases, transmit only aggregated information to the Supervisory Control  And Data Acquisition (SCADA) system. Arkadov et al. (2020) distinguished the following main groups of raw data parameters: • geometric quantities (measurements of length, position, angle of inclination, etc.); • thermotechnical quantities (temperature, pressure, flow rate, volume of working fluid); • electrical quantities (current, voltage, power, frequency, induction, etc.); • mechanical quantities (deformation, forces, torques, vibration, noise level, etc.); • chemical composition (concentration, chemical properties, etc.); • physical properties (humidity, electrical conductivity, viscosity, radioactivity); • parameters of ionizing radiation (radiation fields inside and outside of zoned fluxes of neutrons and gamma radiation); • other parameters.
Most of the generated and aggregated signals relate to the raw data and represent time-series type of data. Asynchronous generation and acquisition of data present a problem in data analysis. Malfunctions of measurement channels result in data omissions, inaccurate readings and noise contamination. Moreover, self-monitoring or self-diagnostic systems of measuring equipment can either detect invalid values or skip them. However, various pre-processing methods make it possible to minimize the impact of such factors on the quality of technical diagnostics.

Data Pre-Processing
In general, the Pre-Processing stage consists of the four main steps shown in Fig. 1: Data Cleansing, Feature Transformation, Feature Engineering and Feature Selection. The following sub-sections give a more detailed account of each step.

Data cleansing
The Data Cleansing helps eliminate invalid values and outliers by removing or correcting them. At this stage, either the missing data are filled in, or the data objects containing such gaps are deleted if their share is small. The features with a large number of data gaps or invalid values can also be excluded from further analysis.
All measurements affecting NPP safety should be promptly diagnosed and marked by a validity indicator (Arkadov et al. 2019) that shows the degree of information reliability. It allows eliminating invalid data in the SCADA. However, not all measurements come with reliable self-monitoring. There is a growing body of studies that aim at solving the problem of diagnosing the measuring equipment and controlling the reliability of measure-ments, for example (Zavaljevski and Gross 2000, Li et al. 2018a, 2018c, Arkadov et al. 2020. Data gaps appear due to the imperfection of modern measuring systems, communication channels and other infrastructure. This poses a problem when working with anomaly detection methods and other techniques. The simplest approaches here are to ignore features with gaps or replace the gaps with specially assigned values, for example, 0 or −1. Also, missing values can be filled in by standard methods, such as the moving average or median over the selected window; the average (quantitative characteristic), mode (categorical characteristic) or median value over the entire time series; and the last value obtained before the gap. Alternatively, there are advanced methods to fill in missing data, for example, the machine-learning methods (for regression, see Honghai et al. (2005); for nearest neighbor method, see Batista and Monard (2002), Jonsson and Wohlin (2004); for neural networks, see Gupta and Lam (1998); for k-means and fuzzy k-means method, see Li et al. (2004), etc.) Batista and Monard (2003) and Wohlrab and Fürnkranz (2009) compared different gap filling procedures. Zagoruyko (1999) and Marlin (2008) gave reviews of gap-filling techniques with different approaches.
To tackle the problem of outliers, one can either apply conventional methods, for example, remove values that contradict the laws of physics or fail to meet the standard deviation of a feature, or resort to modern methods of data mining and machine learning. However, in most cases, the problem of finding anomalies in data is an unsupervised learning task and hence it is suggested to use the class of unsupervised learning methods. In his textbook on models for detecting outliers and anomalies, Aggarwal (2015) identified six main approaches, each corresponding to a class of models: 1. extreme value analysis; 2. clustering; 3. distance-based models; 4. density-based models; 5. probabilistic models; 6. information-theoretical models. Zhao et al. (2019a) described the PyOD library, which includes twenty outlier detection methods, for the Python Programming Language.
Another approach to solving the problem of outlier detection is the use of ensembles (Aggarwal 2013, Aggarwal and Sathe 2015, Aggarwal and Sathe 2017, Zhao et al. 2019b. Ensembles are based on sequential or parallel application of a single base algorithm or a set of base algorithms to data subsamples or feature subspaces, with the following evaluation of the resulting response sets. Gradient boosting, random forest, bagging and some other common methods are founded on building such ensembles. Turning now to support vector machines (SVM), there are two principal SVM-based methods for detecting anomalies in data (Scholkopf et al. 2000). The first one, One-Class Support Vector Machine, is used to detect novelties (Scholkopf et al. 2000) and anomalies (Amer et al. 2013) in data. The idea behind this method is to apply such a transformation of the feature space that in the new space all the objects and the hyperplane, separating them from the origin of coordinates, lie as far as possible from the origin. Zhang et al. (2009) presented the online application of One-Class Support Vector Machine for outlier detection. The second one is Support Vector Data Description (Tax and Duin 2004). It transforms the feature space and then draws a boundary sphere around the data, pulling the maximum number of objects inside the sphere and keeping its radius as small as possible. Note that Support Vector Data Description is sometimes referred to as the SVM-based one-class classifier, and it causes confusion of the two methods. These methods are computationally complex and often show weak results, though the advantage is its clear mathematical and statistical base.
Isolation Forest, or iForest, identifies outliers by the low depth of outlying values in the constructed tree . The method cannot be applied to streaming data in real time, since building a tree and selecting outlying values require data sample. Tan et al. (2011) and Ding and Fei 2013 gave examples of the algorithm operation in the online mode with a buffer. The advantage of the method is low computational complexity and the ability to work with heterogeneous input data. The disadvantage is the inability to work with data as with a time series -they are perceived as a non-temporal set of states or instances.
Cluster analysis is the process of categorizing a set of objects into groups (clusters) so that objects in one group are similar by some of the attributes. The study by Jiang et al. (2001) was one of the first to employ cluster analysis to detect outliers in data. Breunig et al. (2000) examined the degree of being an outlier, called the Local Outlier Factor (LOF), depending on the point density. In a follow-up study, He et al. (2003) presented the Cluster-Based Local Outlier Factor and an outlier detection algorithm based on cluster analysis. Such algorithms as ROCK (Guha et al. 2000) and DBSCAN (Ester et al. 1996, Duan et al. 2009) are able to detect outliers, but these algorithms regard noise, i.e. objects that are not assigned to any selected cluster, as outliers. Loureiro et al. (2004), Pachgade and Dhande (2012) and many other studies also report attempted approaches and algorithms for detecting outliers in data using cluster analysis. As for the initial data for clustering algorithms, both initial signals and diagnostic features generated from them can be used. In addition, generally, clustering algorithms have no requirements for input data, which is one of the advantages of this method. A disadvantage is the use of heuristics in most clustering methods at various stages of solving the problem. Katser et al. (2019) give a more detailed description of One class Support Vector Machine, Isolation Forest, and cluster analysis in terms of detecting data anomalies and equipment faults.
Let us now consider minimum covariance determinant (MCD), another method to control outliers in data (Rousseeuw 1984). Its objective is to find the data subsample whose covariance matrix has the lowest determinant. Thus, when calculating the covariance matrix, the values that are considered to be outliers, get excluded. It improves the quality of problem solving by finding the covariance matrix (Principal Component Analysis, Independent Component Analysis, etc.). The FAST-MCD algorithm, developed for the purpose of quick search of the exact subsample, selects at least half of the observations from the total pool, making an acceptable number of operations, which allows using the method in practice (Rousseeuw and Driessen 1999). Hubert and Debruyne (2009) presented the advantages, disadvantages, limitations and examples of MCD application in various fields. Similarly, Hardin and Rocke (2004), Fauconnier and Haesbroeck (2009) and Leys et al. (2018) examine the application of this outlier detection method to solve some practical problems. It can also be used for fault detection, for example, in conjunction with Independent Component Analysis (Cai and Tian 2014).

Feature transformation
At the Feature Transformation stage, the transformation affects the features values (scaling, change in the sampling rate), their type (categorization of discrete and continuous values), modality (videos are converted into a sequence of pictures, pictures into tables of numerical data), etc.
Most of the pre-processing algorithms require input data, the features of which are on the same scale, since the mean value and variance of features impact their significance for algorithms (Bishop 2006, Hastie et al. 2009). Among numerous scaling methods, the most common ones are the following (Shalabi et al. 2006): • Linear: • Z-Normalization/Standardization: normalizing the mean to 0, standardizing the variance at 1; • Min-Max Normalization: rescaling the range of features to bring data to the 0 to 1 scale, with zero corresponding to the minimum value before normalization and one corresponding to the maximum; In addition to scaling, the Box-Cox transformation (taking of logarithm) is often applied to features (Sakia 1992) to make the distribution of features similar to nor-mal. The transformation can be applied multiple times but only to positive values.
Another important problem is to bring signals with different sampling rates to a single one. In their monograph, Arkadov et al. (2020) described the main approaches in its chapter Combining Measurement Information of Different Systems: • reducing the sampling rate of all processes to the minimum; • increasing the sampling rate of all processes to the maximum; • converting to an intermediate or any other sampling rate.
The choice of a specific rate, which all signals must be converted to, should be based on the characteristic rate of the analyzed process and be consistent with the subsequent stages of diagnostics. A significant decrease in the rate can lead to the loss of information in the signals while an unreasonable increase in the rate can affect the computational complexity of subsequent data analysis processes. Arkadov et al. (2020) outlined the conditions of applicability, advantages and disadvantages of the approaches but only to the extent of spectral analysis. It is worth supplementing the chapter with several observations: Firstly, now that the machine learning methods are gaining popularity, including due to the ability to work with Big Data, sometimes it pays to bring signals to a low frequency to reduce the total computational complexity of the problem. It also may be necessary to reduce the sampling rate if the set of sequentially applied methods is large, to be able to solve problems in real time.
Secondly, the monograph missed an important point of applying the above approaches in the real time mode. Since interpolation is not applicable in real time mode (in the pointwise analysis) and extrapolation is complex and rarely used, simpler methods can deliver the reduction to a single sampling rate, namely: • increasing the sampling rate by filling the current range in with the last received value with subsequent sampling; • increasing the sampling rate by filling in the average or median value at the last range with subsequent sampling; • decreasing the sampling rate by selecting extrema, mean or median values in the range.

Feature selection and generation
Feature selection can be generally understood as declining in the number of features, for example, by searching for a subspace of a lower dimension using dimensionality reduction methods or by simply discarding a part of uninformative features. Feature selection simplifies models, reduces the complexity of the models problem training, and helps avoid the curse of dimensionality.
Zagoruyko (1999), Bishop (2006) and Hastie et al. (2009) reflected on the problem of selecting a system of informative features and the variety of methods for that purpose. According to these authors, the most common algorithms are as follows: • complete rummage of all the feature sets; • sequential feature selection of features (Add); • sequential feature elimination (Del); • genetic algorithm; • random search; • clustering of features.
Well-known extensions of some of these algorithms like SHAP (Lipovetsky et al. 2001) and LIME (Ribeiro et al. 2016) are successfully used nowadays for interpreting machine learning model predictions measuring feature importance. A variety of such methods are shown by Lundberg et al. (2017) in their work and references therein.
Regularization, which imposes a penalty the complexity of the model, is often applied to machine learning problems (Bishop 2006, Hastie et al. 2009). The L1 regularization and the least absolute shrinkage and selection operator (LASSO; see Tibshirani (1996)) solves the problem of feature selection, by excluding some of the original uninformative features from the subsample used for training and operation of the model.
Feature generation is possible if based on the logic and physics of the process or on standard transformations, i.e. raising to the polynomial power or performing multiplication on feature values. Engineering of new diagnostic features is also the acquisition of signal auto-features by using a sliding buffer and all kinds of correlating pairs, and other rather trivial transformations. In respect to NPPs, they are discussed in the monographs by Arkadov et al. (2004Arkadov et al. ( , 2018Arkadov et al. ( , 2020. Most techniques of dimensionality reduction solve both the problem of reducing the number of features and the problem of engineering new diagnostic features. The techniques of dimensionality reduction project data into a lower-dimensional space and, unlike selection methods, considers all the original information, thus making it possible to simplify and improve the procedure for monitoring and searching for anomalies in signals. The dimensionality reduction problem has many applications (Chiang et al. 2001). A notable example of using the dimensionality reduction is visualization, i.e. representing a dataset in a two-or three-dimensional space.
Principle Component Analysis (PCA) is a widely used technique for reducing the dimensionality of datasets. The idea of the method is to search for a hyperplane of a given dimensionality in the original space with the subsequent projection of the data onto the found hyperplane. The axes of the new space are a linear combination of the original ones and get selected based on the variance of the original features. The transformation of the measurement space into a new orthogonal space is performed by bringing the covariance (correlation) matrix to a diagonal form; for this reason, the original features in the new space are uncorrelated. Li et al. (2018aLi et al. ( , b, c, 2019 and Ayodeji et al. (2018) studied applications of Principal Component Analysis for signal pre-processing and feature generation in problems of diagnosing equipment and sensors.
Independent Component Analysis (ICA), unlike Principle Component Analysis, finds a space in which the original features are not only uncorrelated, but also independent in terms of statistical moments of a higher order. In other words, Independent Component Analysis solves the problem of finding any, including non-orthogonal, space where the axes are a linear combination of the original ones. The goal is to transform the original signals so that in the new space they would be statistically independent from each other as much as possible (Kano et al. 2003, Lee et al. 2004a. Both PCA and ICA build transformations into a new space only based on the matrix of features, without taking into account the response vector. This solves the problem of the mutual dependence of features, but fails to tackle the presence of features that do not affect the target variable (response vector). That is why such features are used in further analysis.
Compared to PCA where the axes of the new space are selected based on the variance of the original features, the Partial Least Squares (PLS) method, or Projection to Latent Structures, selects the axes of the new space proceeding from the maximization of the covariance between the matrix of features and the matrix of responses. At that, new spaces are found for both matrices. The new axes for the feature space are calculated to provide the maximum variance along the axes in the new space for the matrix of responses. Using the data on equipment faults as responses, one can obtain a lower-dimensional space for the matrix of feature and hence more accurately determine various faults (MacGregor and Kourti 1995, Chiang et al. 2001, Wang et al. 2003, Ma and Jiang 2011. The application of the PLS method is limited due to the need to know the classes of events (faults) when training the model. For that reason, the method is often used at the pre-processing stage when solving the problem of making a diagnosis or determining the causes.
The wide applicability of these techniques is explained by the fact that they can tame multidimensional, noisy data with correlated parameters by translating the data into a lower-dimensional space that contains most of the Cumulative Percentage Variance of the original data (Chiang et al. 2001, Jiang and Yan 2014, Xu et al. 2017). However, the standard PCA, ICA and PLS methods can only find linear relationships of features and sometimes fail to solve problems efficiently enough. Hence, there appeared a number of modifications improving them: • kernel methods: for PCA, see Lee et al. (2004a) and Choi and Lee (2004); for ICA, see Zhang and Qin (2007); for PLS, see Zhang et al. (2010), Zhang and Hu (2011), Jiao et al. (2017). Unlike the linear methods of dimensionality reduction, the non-linear ones produce an effective dimensionality reduction due to the creation of a non-linear combination of features to create a new lower-dimensional space; • dynamic methods: for PCA, see Ku et al. 1995, Russell et al. (2000; for ICA, see Lee et al. (2004b); for PLS, see Chen and Liu (2002). The dynamic methods, used for analysis of transient phenomena, supplement the studied sample with a certain number of previous observations and factor in autocorrelations and cross-correlations with displacements in time; • probabilistic methods: for PCA, see Tipping and Bishop (1999), Kim and Lee (2003); for ICA, see Zhu et al. (2017); for PLS, see Li et al. (2011). The probabilistic methods model the data distribution as a multivariate Gaussian distribution. With PPCA, it is possible to construct a PPCA mixture model, which consists of several local PPCAs and detects faults in data with multimodal or complex non-Gaussian distributions ( Linear Discriminant Analysis (LDA), or Fisher Discriminant Analysis (FDA), is a statistical analysis method that searches for a linear combination of features able to separate events from different classes (determining different faults) in the best way possible (Chiang et al. 2001). It is used for the problems of classification and dimensionality reduction of the original feature space. de Lazaro et al. (2015) demonstrated that the kernel LDA (FDA with kernels in Mika et al. (1999)) showed better results as compared to the kernel PCA. By analogy with the above methods, the probabilistic version of LDA was developed and presented by Prince and Elder (2007). The method has proven itself well in many fields, including nuclear indus- Canonical Correlation Analysis (CCA), or Canonical Variate Analysis (CVA) is a technique of searching for lower-dimensional spaces for two sets of variables (features and responses) when projecting the data in which the cross-correlations between the two sets of variables are maximal among all possible variants of spaces (Chiang et al. 2001, Hardoon et al. 2004, Manly and Alberto 2016. The basis of the variables in the new space is a linear combination of the original variables. CCA is used as a method of dimensionality reduction but it can also be applied to informative feature selection (Kaya et al. 2014). Chen et al. (2016bChen et al. ( , 2016c used CCA to monitor industrial processes, and Chen et al. (2018b) applied a modification of this technique for monitoring processes with a non-Gaussian distribution. CCA is similar to PLS and LDA by the need to resort to a response vector (Chiang et al. 2001).
Factor Analysis is a multivariate statistical analysis that serves to determine the relationship between variables and reduce their number (Harman 1976, Kim 1989, Warne and Larsen 2014, Manly and Alberto 2016. It is based on the assumption that known variables depend on fewer unknown variables and random error. This allows using Factor Analysis to replace correlated measurements with a smaller number of new variables (factors), although losing a small amount of information contained in the original data. Another requirement is to represent the factors in terms of the original variables. The factor itself is interpreted as the cause of the joint variability of several original variables. The main difficulty in Factor Analysis is the selection and interpretation of the principal factors.
Feature bagging, or bootstrap aggregation, is a learning method that searches through randomly selected feature subsamples from n/2 to n − 1 from the number of original n features and uses the basic algorithm on each subsample, and after that all results are aggregated by summation or another method (Breiman 1996). Feature bagging allows improving the performance of algorithms, for example, classification accuracy (Bryll et al. 2003). Lazarevic and Kumar (2005) provided an algorithm to solve the problem of detecting outliers in data with examples. Aggarwal and Sathe (2015) proposed a modification of the algorithm that reduces the dependence of the basic algorithms on themselves.
Bagging in combination with basic algorithms turns the problem solution into an ensemble of algorithms, increasing the computational complexity of the basic algorithms but improving the accuracy and robustness of the results. If all features are independent and important, bagging often degrades the quality of responses as each algorithm has an insufficiently informative subsample to learn.
Neural networks are also used for data processing and dimensionality reduction. Today, one of the most effective methods for the latter purpose is an autoencoder -a type of artificial neural network applied to encode data, usually in unsupervised learning (Bourlard and Kamp 1988, Sakurada and Yairi 2014, Chen et al. 2016a, Chalapathy et al. 2017. Each subsequent layer of the autoencoder up to the middle layer -the bottleneck -nearly always has fewer neurons than the previous one. Time series can be input to the network, and the main requirement to them is preliminary data normalization. An autoencoder aims to learn a representation for data in another subspace, usually for a dimensionality reduction problem. An autoencoder learns to reduce the dimensionality of the feature space of the data, received at the network input, to a specified number of features, and then to decode the compressed data back to a representation that most closely matches the original data. Thus, the original data is supplied to the input and output of the neural network, and at each training iteration (epoch), the error between the original data and the output data is minimized.
In addition to feed-forward networks, there are a large number of modernized architectures; some of them are as follows: • convolutional autoencoders whose architecture includes a convolutional layer that creates a convolutional kernel for the convolution of input data by one feature. It is used for data noise removal (Grais and Plumbley 2017), clustering (Chen 2015, Ghasedi et al. 2017, fault detection (Chen et al. 2018a) and other purposes; • Recurrent Neural Network (RNN) based Autoencoders and their varieties (Elman 1990, Chung et al. 2016, such as Long Short-Term Memory (Hochreiter and Schmidhuber 1997) and Gated Recurrent Units (Chung et al. 2014); • Variational Autoencoders (VAE), by studying the probability distributions that simulate the input data, allow the hidden-variables model to learn (Everett 2013). For more details on VAE architecture and applications, refer to Kingma and Welling (2013), Doersch (2016).
Autoencoders can be used jointly with standard fault detection methods, for example, with statistical detection criteria (Yang et al. 2015, Xiao et al. 2017. A high degree of compression of the original data, due to finding complex non-linear dependencies, and the possibility of architecture upgrade, for example, in order to remove noise (Vincent et al. 2008), are the advantages of the above neural networks, but it is worth noting the computational complexity of the algorithms and the complexity of the models tuning. Generally, neural networks, especially deep ones, are considered as techniques that extract useful features automatically. And sometimes, it is an advantage over classical machine learning and other approaches, where feature extraction is often a manual and laborious part of work. Even though this advantage of neural networks increases the quality of the model and final results by extracting more complex nonlinear features, it can also be considered a disadvantage due to a lack of knowledge of how the feature is extracted. So, data scientists mostly can't reproduce the logic of how the feature is pulled out from the original subset and what intuition and physics are behind. The popularity of this field of knowledge has grown recently. Here we recommend selecting either the problem solving quality is important or the transparency in the feature extraction and modeling processes is important.
Spectral Analysis includes time series processing associated with obtaining a representation of signals in the frequency domain. The main application of Spectral Analysis is to assess the vibration of equipment. The most popular techniques of spectral processing are the Fourier transform, the Laplace transform, the Hilbert transform and the Hilbert-Huang transform. The results of Spectral Analysis are rather easy to interpret, and it is possible to detect faults, determine the nature of their occurrence and make a diagnosis on their basis. Arkadov et al. (2004Arkadov et al. ( , 2018Arkadov et al. ( , 2020 described the application Spectral Analysis to NPP diagnostics in detail. As for non-stationary time series, time-frequency analysis is widely used to detect malfunctions in rotary equipment under time-varying operating conditions. Kim et al. (2007) provided a comparative analysis of the windowed Fourier transform, the Wigner-Ville distribution, and the wavelet transform.
Another tool of fault detection can be to generate diagnostic features that serve as equipment health indicators. Such diagnostic features that characterize the system condition, are identified by an expert based on their experience for a clear and effective understanding of the state of a technical system and, accordingly, for detecting anomalies in operation (Leskin et al. 2011, Costa et al. 2015, Baraldi et al. 2018, Arkadov et al. 2020. In effect, principal components in PCA, bottleneck features in an autoencoder, and the Fourier spectrum in a signal are the diagnostic features, but the main distinction of equipment health indicators is that they are formulated in a purely heuristic way. An expert builds the equipment health indicators upon processing and formalization of a pattern of regularities that are not described by known physical and mathematical models of equipment.
The advantages of the diagnostic features approach include the possibility of creating a rational solution that accumulates experts' experience, and the ease of health indicator implementation. The disadvantages are the lack of physical or mathematical models that could form the foundation of the method, and its limitations for, as a rule, an indicator points only to malfunctions of the same kind in one unit of equipment.

Time series data augmentation
The problem of lacking time series data leads to the inapplicability of deep learning algorithms in some applications. In such cases, augmentation or data generation is used for adding more synthetic data for better training and working of machine learning algorithms. Though quite a bit of attention is paid to this field of knowledge, the surveys by Ivana et al. (2020) and Wen et al. (2021) highlight the state of this research field. The latter work provides the following taxonomy for time series data augmentation: Although data augmentation is quite a useful tool for improving the quality of various models, it mainly relates to the training stage. Data augmentation almost never is being a part of the equipment diagnostics pipeline. Moreover, time series data augmentation methods are not appropriately researched for real-world industrial data with noise and possible various statistical changes happening all the time.

Online application of preprocessing methods
Each pre-processing method has its own distinctive nature in relation to the original data: some are capable of working with one data object while others require the calculation of values based on a learning sample or a buffer. Moreover, real-time pre-processing must match the diagnostics model selected for learning; otherwise, the models may give incorrect results. For such cases, it is worth discussing the mechanisms for applying pre-processing methods: • The pointwise transformation in learning and operation. This mechanism is used when the applied pre-processing methods require a state vector only at the current time. Examples of such transformations are deleting data exceeding a certain (for example, physically justified) threshold, raising a feature to the polynomial power, performing multiplication on feature values, etc. • Complete or batch transformation during learning, pointwise transformation during operation. This mechanism is used when the transformation requires the calculation of values, for example, the mean or the variance of a learning sample. The values obtained at the learning stage are saved and applied in real-time operation for each new state vector. Examples of such transformations are One-Class SVM, iForest, MCD, PCA and all linear methods for reducing features to a single scale mentioned in this article. • Batch transformation. It refers to the transformation of features based on the calculation of characteristics using a sliding window or a batch. An example here is calculating a moving average of a signal per a window or obtaining auto-characteristics of signals using a sliding buffer and all kinds of correlated pairs.
Let us demonstrate how methods are applied in real-time mode, assuming that our preprocessing pipeline consists of the following steps: 1. Moving average for gaps filling; 2. Z-Normalization; 3. PCA applying; 1. Selecting the first principal component for further comparison with the threshold for anomaly detection.
First of all, the new point for multivariate time series is received. Then the average value for the window with previous points is calculated if some of the values in the novel vector are missing. Into the gaps, calculated points are inserted. After that, Z-normalization is applied using previously (during the training stage, commonly, for fault-free mode) defined mean and standard deviation values. Afterward, PCA is applied using a transformation matrix calculated for the train set. Finally, the value over the first principal axis is selected for further comparison.

Conclusion
This overview has described the peculiarities of the data collected at NPPs and its pre-processing in real time. Table 1 summarizes the methods of data pre-processing, carried out before solving the main problem of diagnostics.
The problems encountered in data are not unique to the nuclear industry, but the outstanding aspect of NPPs is the large amount of generated information, the variety of its sources and data types. Pre-processing is necessary to prepare the data for input to the diagnostic algorithms, since many of them either have requirements that rule out the input of data with gaps, outliers, signals with different sampling rates, or produce incorrect results when working with unscaled data. Another reason for using pre-processing methods is the possibility of improving the quality of the diagnostic algorithms and reducing the computational complexity of the problem, for example, by reducing the dimensionality of the initial data or lowering the sampling frequency of signals.
We find it necessary to give a summary with providing our opinion on which methods are commonly used, which are not, and why: • When filling in gaps, the most intuitive way is to use specially assigned values to avoid generating false information about the data. But not all machine learning methods can process such values properly. That is why the most common techniques fill the gaps with some data characteristics from moving windows or over the whole signal realization.  Lee et al. 2004b, 2004cZhang and Qin 2007Zhu et al. 2017 Machine learning techniques are quite rare and situational for such problems. • As for outliers and impossible values detection, the most straightforward approaches to detecting values that contradict the laws of physics are the most popular ones due to the transparency of such rules for engineering personnel. Searching for deviation from some statistical characteristics, even utilizing machine learning techniques, is still fighting for attention. They are primarily used in retrospective analysis or in diagnostic systems that provide rec-ommendations for operating personnel but not in critical safety systems. • When transforming the data, Z-Normalization and Min-Max scaling are the most common scaling techniques because in the overwhelming majority of cases they show better results. Moreover, other methods are used when they are required for some specific reason for further analysis. Box-Cox transformation and other techniques like derivating the data are situational and used when further research requires working with normally distributed data or stationary time-series.  Baraldi et al. 2018Arkadov et al. 2020 Note: * cannot be applied to some models in the method.
• A lack of sample rate for the signal or various sample rates is a frequent problem for industrial data. When selecting a unified sample rate, achieving a trade-off between the loss of information and computational complexity is vital. At the same time, the choice of a specific rate should be based on the characteristic rate of the analyzed process. When increasing the sample rate in the real-time mode, filling the current range with the last received value is the most common technique. When decreasing, both extrema and mean/median values are commonly used. • For feature selection, a thorough analysis combining with various mentioned algorithms works the best. Analysis may also include finding dependencies of target vector from features when the problem is supervised. One of the most common ways is fitting some simple model, calculating feature importance for this model, and then selecting the most important features for fitting a more complex model. Regularisation is also commonly used when applicable. Among dimensionality reduction techniques, PCA is the most popular since it is unsupervised and provides linear transformation easy-to-understand and transparent for personnel. Although nonlin-ear techniques, including neural networks, show state-of-the-art results, they lack interpretability of how transformation is constructed, making the approaches not popular in industrial applications. • Feature generation in real-world applications is primarily based on the logic and physics of the process resulting in heuristical health indicators and various meaningful characteristics from spectral analysis.
The methods described in this work have already successfully proven themselves in industrial application, including at NPPs. At the same time, these methods continue to develop, and there appear supplements that improve their operation or expand their field of application. This overview, together with Katser et al. (2019), gives a sufficiently complete understanding of how the process at an NPP can be monitored from the moment of pre-processing of the collected data to the moment of solving the first diagnostic problem, i.e. detecting equipment malfunction.
Further research can be focused on overviewing the methods used to solve such diagnostic problems at NPPs as arriving at the correct diagnosis, fault localization, and prognosis of the malfunction development.