Research Article |
Corresponding author: Victor P. Telnov ( telnov@bk.ru ) Academic editor: Georgy Tikhomirov
© 2023 Victor P. Telnov, Yury A. Korovin.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Telnov VP, Korovin YuA (2023) Application of machine learning methods for filling and updating nuclear knowledge bases. Nuclear Energy and Technology 9(2): 115-120. https://doi.org/10.3897/nucet.9.106759
|
The paper deals with issues of designing and creating knowledge bases in the field of nuclear science and technology. The authors present the results of searching for and testing optimal classification and semantic annotation algorithms applied to the textual network content for the convenience of computer-aided filling and updating of scalable semantic repositories (knowledge bases) in the field of nuclear physics and nuclear power engineering and, in the future, for other subject areas, both in Russian and English. The proposed algorithms will provide a methodological and technological basis for creating problem-oriented knowledge bases as artificial intelligence systems, as well as prerequisites for the development of semantic technologies for acquiring new knowledge on the Internet without direct human participation. Testing of the studied machine learning algorithms is carried out by the cross-validation method using corpora of specialized texts. The novelty of the presented study lies in the application of the Pareto optimality principle for multi-criteria evaluation and ranking of the studied algorithms in the absence of a priori information about the comparative significance of the criteria. The project is implemented in accordance with the Semantic Web standards (RDF, OWL, SPARQL, etc.). There are no technological restrictions for integrating the created knowledge bases with third-party data repositories as well as metasearch, library, reference or information and question-answer systems. The proposed software solutions are based on cloud computing using DBaaS and PaaS service models to ensure the scalability of data warehouses and network services. The created software is in the public domain and can be freely replicated.
semantic web, knowledge base, machine learning, classification, semantic annotation, cloud computing
Nuclear science and technology are among the areas with a high intensity of information exchange and knowledge generation. Research carried out at elementary particle accelerators annually produces hundreds of terabytes of new experimental results (CERN Document Server). World nuclear data centers accumulate and systematize information on thousands of nuclear reactions and nuclear constants (Centre for Photonuclear Experiments Data). The IAEA (IAEA Nuclear Knowledge Management) and respective national agencies (Rosatom State Corporation. Knowledge Management System) create and maintain databases and knowledge bases on nuclear technology and radiation safety.
The practical contribution of the authors of this paper to the development of knowledge bases consists in the creation of working prototypes, and then scalable semantic web portals, which are deployed on cloud platforms and are intended for use in the educational activities of universities (
The relevance of the first project is explained by the fact that it is aimed at the creation and computer-aided filling of semantic repositories (knowledge bases) on nuclear physics and nuclear power engineering. These are areas in which Russia is able to achieve competitive advantages and world leadership. As of 2022, the educational web portals of universities, nuclear data centers, and nuclear knowledge management systems of the IAEA and Rosatom State Corporation do not use sufficiently the capabilities of the semantic web and machine learning methods.
This study is aimed at searching for and testing optimal classification and semantic annotation algorithms applied to the textual network content for computer-aided filling and updating of nuclear knowledge graphs, both in Russian and English. The corresponding optimization problem is formulated and solved below in the section on the results of computational experiments. The proposed algorithms will provide a methodological and technological basis for continuously filling and updating problem-oriented knowledge bases as artificial intelligence systems, as well as prerequisites for the development of semantic technologies for acquiring new knowledge on the Internet without direct human participation.
From a practical standpoint, the software implementation of effective classification and semantic annotation algorithms is carried out as part of a scalable semantic web portal hosted on a cloud platform. Fig.
The created online solutions are in the public domain (excluding confidential information) and can be freely replicated. The project is implemented in accordance with the Semantic Web standards (RDF, OWL, SPARQL, etc.) (W3C Semantic Web, W3C RDF Schema 1.1, W3C OWL 2 Web Ontology Language). For this reason, there are no technological restrictions for integrating the created knowledge bases with third-party data repositories as well as metasearch, library, reference or information and question-answer systems.
The scalability of semantic repositories (knowledge bases) is carried out directly by means of the cloud platform used. The scientific novelty of the approaches used in this project is defined by the use of the Pareto optimality principle, which allows for multi-criteria evaluation and ranking of the studied machine learning algorithms in the absence of a priori information about the comparative significance of the criteria.
Classifying text data refers to the tasks of Machine Learning (ML) in the field of Natural Language Processing (NLP). By 2022, at least a dozen machine learning methods had been created that were potentially suitable for solving problems related to text classification and semantic annotation Geron A (2019). There are now dozens of software implementations of these methods (Scikit-learn. Machine Learning in Python).
This classification algorithm (Naive Bayes Classifier) is considered to be one of the simplest classification algorithms. Bayes’ theorem is invariant with respect to the causes and effects of events. If we know the probability with which a particular cause leads to a certain effect, Bayes’ theorem allows us to calculate the probability that this particular cause has led to the observed event. This idea underlies the Bayes classifier, while the principle of maximum likelihood is used to determine the most probable class.
For natural languages, the probability of the next word or phrase appearing in the text is highly dependent on the current context. The Bayes classifier ignores this circumstance and represents the document as a set of words, the probabilities of which are conditionally independent of each other. This approach is sometimes referred to as the bag-of-words model. Despite strong simplifying assumptions, the Naive Bayes classifier performs well in many real world problems. It does not require a large amount of training data and, with moderate text corpora, is often not inferior to more sophisticated algorithms.
If the Naive Bayes classifier ignores correlations between words, then the MaxEnt stochastic classifier allows and takes these correlations into account. From the logistic regression models corresponding to the training data, the one is selected that contains the least number of assumptions about the true probability distribution of the text data. In other words, an empirical probability distribution with the maximum information entropy is chosen. This approach is especially productive exactly for solving text classification problems, when the words in the text are obviously not independent.
The Softmax function, or normalized exponential function, is a generalization of the logistic function for the multivariate case. In multiclass classification problems, the Softmax function is built in such a way that the number of neurons on the last layer of the neural network is equal to the number of the desired classes. In this case, each neuron should give the value of the probability that the object belongs to the class, and the values of all the neurons in the sum should give unity.
The Maxent classifier usually takes more time to train compared to the Naive Bayes classifier due to the optimization that needs to be done to estimate the model parameters. After calculating these parameters, the method yields very reliable results and is competitive in terms of computing resource and memory consumption.
Support vector machines (SVM) are a set of linear binary classification methods with a simple and clear interpretation. For example, the task is to find in a multidimensional space such a surface, a hyperplane in the simplest case, which divides objects into two classes with the largest gap. The SVM classifier is equivalent to a two-layer neural network, where the number of neurons in the hidden layer is defined as the number of support vectors.
Stochastic gradient descent (SGD) is an iterative method for optimizing an objective function with suitable smoothness properties, which is widely practiced in deep learning models. Here, the gradient of the function being optimized is calculated not as the sum of the gradients from each sample element, but as the gradient from one randomly selected subset of elements. The slower convergence of the algorithm can be compensated by the high speed of iterations on large data sets.
The following three proven binary (Classification Metrics) are used here as quality functionals for the machine learning algorithms.
The first two metrics do not depend on the filling of classes with objects and therefore are applicable in conditions of unbalanced samples. The Precision metric characterizes the algorithm’s ability to distinguish among classes, and the Recall metric shows the algorithm’s ability to detect a particular class in general. The third metric, F1-score, is the most informative in cases where the values of the first two metrics differ significantly from each other. To assess the quality of the multiclass classification algorithms, the so-called macro-averages are used, when the metric values are averaged over all the classes, regardless of the number of objects in these classes.
Cross-validation is used to improve the reliability of classification algorithm testing results. The initial training set is randomly divided N times into N samples of approximately the same length. Each of the N samples is in turn declared a control sample, the remaining N – 1 samples are combined into a training sample. The algorithm is tuned to the training sample and then classifies the control sample objects. The described procedure is repeated N times, and the value of N varies from 3 to 10.
To identify the most effective methods for classifying text data for the purpose of computer-aided filling and updating of nuclear knowledge graphs, a series of tests was carried out using corpora of specialized texts on nuclear physics and nuclear power engineering. In total, seven nuclear knowledge graphs (Semantic Educational Portal. Nuclear Knowledge Graphs. Intelligent Search Agents) were used (see Table
Metrics for three text data classification methods calculated using seven nuclear knowledge graphs
Nuclear knowledge graphs as training and control samples | Method | ||||||||
---|---|---|---|---|---|---|---|---|---|
Naive Bayes Classifier, % | Maxent Classifier (Softmax), % | SVM Classifier with SGD, % | |||||||
P | R | F | P | R | F | P | R | F | |
World nuclear data centers | 55 | 33 | 42 | 46 | 51 | 48 | 87 | 83 | 85 |
Events and publications (CERN) | 94 | 72 | 82 | 72 | 71 | 71 | 42 | 43 | 43 |
IAEA databases and services | 97 | 46 | 62 | 46 | 51 | 48 | 56 | 57 | 56 |
Nuclear physicsа at MSU, MEPhI | 96 | 57 | 71 | 78 | 59 | 67 | 87 | 82 | 84 |
Russian nuclear research centers | 75 | 13 | 21 | 82 | 68 | 74 | 95 | 94 | 95 |
Magazines in nuclear physics | 86 | 57 | 69 | 83 | 100 | 91 | 17 | 25 | 20 |
Combined nuclear knowledge graph | 99 | 25 | 39 | 63 | 37 | 46 | 88 | 85 | 86 |
The results of calculations from Table
(1)
The set of P-optimal elements on W is the Pareto set WP:
(2)
The Pareto ratio provides a universal mathematical model for the multicriteria context-independent choice in a Euclidean space. If we denote by d(y, x) the number of the criteria by which the element y is superior to the element x, then the value
(3)
is called the dominance index of the element x when the set W is presented. Roughly speaking, the dominance index is equal to the number of the criteria by which the element x does not exceed all other elements from the set W. We shall determine the function CD (W) for choosing the best elements as follows:
(4)
The value DW is the dominance index of the entire set W. The elements with the minimum value of the dominance index form the so-called Pareto set. The Pareto set includes elements that are best in terms of the totality of all the criteria taken into account, without any a priori assumptions about the comparative significance of these criteria. In conditions of a real choice, the Pareto set often contains more than one element.
Returning to the original task of finding the most effective method for classifying text data, we shall turn to the data in Table
Dominance indices for the three text data classification methods calculated using seven nuclear knowledge graphs
Text data classification method | Dominance index | ||
---|---|---|---|
Calculated by F1-score | Calculated by Precision and Recall | Calculated by F1-score, Precision and Recall | |
Naive Bayes Classifier | 4 | 7 | 11 |
MaxEnt Classifier (Softmax) | 5 | 10 | 15 |
SVM Classifier with SGD | 3 | 7 | 10 |
As can be seen from the data in Table
The software solutions implemented in the project are based on cloud computing using DBaaS and PaaS service models to ensure the scalability of data warehouses and network services. The server scripts of the working software prototype run on the Jelastic cloud platform in a Java and Python runtime environment.
Fig.
Research groups from Stanford University (Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014)), the Massachusetts Institute of Technology, the University of Bari, the University of Leipzig, and the University of Manchester are focusing on the development of the Semantic Web, related issues of machine learning and natural language processing. The global IT giants are actively developing knowledge representation models and machine learning technologies, including IBM Watson Studio, Google AI and Machine Learning, Amazon Comprehend NLP, AWS Machine Learning, Yandex DataSphere (Jupyter Notebook), etc. Software tools for research in the field of artificial intelligence and natural language processing are provided by Matlab (Machine Learning with MATLAB & Simulink), Stanford NLP (Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014)), Scikit-learn (Scikit-learn. Machine Learning in Python) etc. In Russia, specialized research is carried out at the Competence Center of the National Technological Initiative of Moscow Institute of Physics and Technology (MIPT), Institute of Precision Mechanics and Optics (ITMO), the Faculty of Computational Mathematics and Cybernetics of Moscow State University (MSU), Ivannikov Institute for System Programming (ISP RAS) (Stupnikov S, Kalinichenko A (2019)), and Russian divisions of Hyawei.
In this study, seven corpora of specialized texts on nuclear physics and nuclear power engineering were used to show the effectiveness of relatively simple, intuitive machine learning methods for solving the problem of continuous filling and updating of nuclear knowledge bases without direct human participation. The SVMs and the Naive Bayes classifier ensure the competence of semantized knowledge bases as artificial intelligence systems. Some of the results that were obtained by the authors in the study of such classifiers as the “K nearest neighbors (kNN)” algorithm and terminological decision trees, which are built based on the results of text parsing, remained outside the scope of this article.
The study was supported by the Russian Science Foundation grant No. 22-21-00182, https://rscf.ru/project/22-21-00182/.