Corresponding author: Victor P. Telnov (

Academic editor: Georgy Tikhomirov

The paper deals with issues of designing and creating knowledge bases in the field of nuclear science and technology. The authors present the results of searching for and testing optimal classification and semantic annotation algorithms applied to the textual network content for the convenience of computer-aided filling and updating of scalable semantic repositories (knowledge bases) in the field of nuclear physics and nuclear power engineering and, in the future, for other subject areas, both in Russian and English. The proposed algorithms will provide a methodological and technological basis for creating problem-oriented knowledge bases as artificial intelligence systems, as well as prerequisites for the development of semantic technologies for acquiring new knowledge on the Internet without direct human participation. Testing of the studied machine learning algorithms is carried out by the cross-validation method using corpora of specialized texts. The novelty of the presented study lies in the application of the Pareto optimality principle for multi-criteria evaluation and ranking of the studied algorithms in the absence of a priori information about the comparative significance of the criteria. The project is implemented in accordance with the Semantic Web standards (RDF, OWL, SPARQL, etc.). There are no technological restrictions for integrating the created knowledge bases with third-party data repositories as well as metasearch, library, reference or information and question-answer systems. The proposed software solutions are based on cloud computing using DBaaS and PaaS service models to ensure the scalability of data warehouses and network services. The created software is in the public domain and can be freely replicated.

Nuclear science and technology are among the areas with a high intensity of information exchange and knowledge generation. Research carried out at elementary particle accelerators annually produces hundreds of terabytes of new experimental results (CERN Document Server). World nuclear data centers accumulate and systematize information on thousands of nuclear reactions and nuclear constants (Centre for Photonuclear Experiments Data). The IAEA (IAEA Nuclear Knowledge Management) and respective national agencies (Rosatom State Corporation. Knowledge Management System) create and maintain databases and knowledge bases on nuclear technology and radiation safety.

The practical contribution of the authors of this paper to the development of knowledge bases consists in the creation of working prototypes, and then scalable semantic web portals, which are deployed on cloud platforms and are intended for use in the educational activities of universities (

The relevance of the first project is explained by the fact that it is aimed at the creation and computer-aided filling of semantic repositories (knowledge bases) on nuclear physics and nuclear power engineering. These are areas in which Russia is able to achieve competitive advantages and world leadership. As of 2022, the educational web portals of universities, nuclear data centers, and nuclear knowledge management systems of the IAEA and Rosatom State Corporation do not use sufficiently the capabilities of the semantic web and machine learning methods.

This study is aimed at searching for and testing optimal classification and semantic annotation algorithms applied to the textual network content for computer-aided filling and updating of nuclear knowledge graphs, both in Russian and English. The corresponding optimization problem is formulated and solved below in the section on the results of computational experiments. The proposed algorithms will provide a methodological and technological basis for continuously filling and updating problem-oriented knowledge bases as artificial intelligence systems, as well as prerequisites for the development of semantic technologies for acquiring new knowledge on the Internet without direct human participation.

From a practical standpoint, the software implementation of effective classification and semantic annotation algorithms is carried out as part of a scalable semantic web portal hosted on a cloud platform. Fig.

Setting the parameters of the semantic annotation/classification process: 1) choice of semantic annotation (classification) technology; 2) network addresses (URL) of documents to be annotated (classified).

The created online solutions are in the public domain (excluding confidential information) and can be freely replicated. The project is implemented in accordance with the Semantic Web standards (RDF, OWL, SPARQL, etc.) (W3C Semantic Web, W3C RDF Schema 1.1, W3C OWL 2 Web Ontology Language). For this reason, there are no technological restrictions for integrating the created knowledge bases with third-party data repositories as well as metasearch, library, reference or information and question-answer systems.

The scalability of semantic repositories (knowledge bases) is carried out directly by means of the cloud platform used. The scientific novelty of the approaches used in this project is defined by the use of the Pareto optimality principle, which allows for multi-criteria evaluation and ranking of the studied machine learning algorithms in the absence of a priori information about the comparative significance of the criteria.

Classifying text data refers to the tasks of Machine Learning (

This classification algorithm (Naive Bayes Classifier) is considered to be one of the simplest classification algorithms. Bayes’ theorem is invariant with respect to the causes and effects of events. If we know the probability with which a particular cause leads to a certain effect, Bayes’ theorem allows us to calculate the probability that this particular cause has led to the observed event. This idea underlies the Bayes classifier, while the principle of maximum likelihood is used to determine the most probable class.

For natural languages, the probability of the next word or phrase appearing in the text is highly dependent on the current context. The Bayes classifier ignores this circumstance and represents the document as a set of words, the probabilities of which are conditionally independent of each other. This approach is sometimes referred to as the bag-of-words model. Despite strong simplifying assumptions, the Naive Bayes classifier performs well in many real world problems. It does not require a large amount of training data and, with moderate text corpora, is often not inferior to more sophisticated algorithms.

If the Naive Bayes classifier ignores correlations between words, then the MaxEnt stochastic classifier allows and takes these correlations into account. From the logistic regression models corresponding to the training data, the one is selected that contains the least number of assumptions about the true probability distribution of the text data. In other words, an empirical probability distribution with the maximum information entropy is chosen. This approach is especially productive exactly for solving text classification problems, when the words in the text are obviously not independent.

The Softmax function, or normalized exponential function, is a generalization of the logistic function for the multivariate case. In multiclass classification problems, the Softmax function is built in such a way that the number of neurons on the last layer of the neural network is equal to the number of the desired classes. In this case, each neuron should give the value of the probability that the object belongs to the class, and the values of all the neurons in the sum should give unity.

The Maxent classifier usually takes more time to train compared to the Naive Bayes classifier due to the optimization that needs to be done to estimate the model parameters. After calculating these parameters, the method yields very reliable results and is competitive in terms of computing resource and memory consumption.

Support vector machines (

Stochastic gradient descent (

The following three proven binary (Classification Metrics) are used here as quality functionals for the machine learning algorithms.

Precision (classification accuracy). It is calculated as the proportion of objects that really belong to some positive class and are classified correctly.

Recall (completeness of classification). It is calculated as the proportion of objects that are assigned by the algorithm to some positive class and are classified correctly.

F1-score. This aggregated metric is calculated as the harmonic mean of the accuracy and completeness of the classification.

The first two metrics do not depend on the filling of classes with objects and therefore are applicable in conditions of unbalanced samples. The Precision metric characterizes the algorithm’s ability to distinguish among classes, and the Recall metric shows the algorithm’s ability to detect a particular class in general. The third metric, F1-score, is the most informative in cases where the values of the first two metrics differ significantly from each other. To assess the quality of the multiclass classification algorithms, the so-called macro-averages are used, when the metric values are averaged over all the classes, regardless of the number of objects in these classes.

Cross-validation is used to improve the reliability of classification algorithm testing results. The initial training set is randomly divided

To identify the most effective methods for classifying text data for the purpose of computer-aided filling and updating of nuclear knowledge graphs, a series of tests was carried out using corpora of specialized texts on nuclear physics and nuclear power engineering. In total, seven nuclear knowledge graphs (Semantic Educational Portal. Nuclear Knowledge Graphs. Intelligent Search Agents) were used (see Table

Metrics for three text data classification methods calculated using seven nuclear knowledge graphs

Nuclear knowledge graphs as training and control samples | Method | ||||||||
---|---|---|---|---|---|---|---|---|---|

Naive Bayes Classifier, % | Maxent Classifier (Softmax), % | ||||||||

P | R | F | P | R | F | P | R | F | |

World nuclear data centers | 55 | 33 | 42 | 46 | 51 | 48 | 87 | 83 | 85 |

Events and publications (CERN) | 94 | 72 | 82 | 72 | 71 | 71 | 42 | 43 | 43 |

IAEA databases and services | 97 | 46 | 62 | 46 | 51 | 48 | 56 | 57 | 56 |

Nuclear physicsа at |
96 | 57 | 71 | 78 | 59 | 67 | 87 | 82 | 84 |

Russian nuclear research centers | 75 | 13 | 21 | 82 | 68 | 74 | 95 | 94 | 95 |

Magazines in nuclear physics | 86 | 57 | 69 | 83 | 100 | 91 | 17 | 25 | 20 |

Combined nuclear knowledge graph | 99 | 25 | 39 | 63 | 37 | 46 | 88 | 85 | 86 |

Note: P = Precision metric, R = Recall metric, F = F1-score metric.

The results of calculations from Table

Visual representation of data from Table

The set of ^{P}:

The Pareto ratio provides a universal mathematical model for the multicriteria context-independent choice in a Euclidean space. If we denote by d(y, x) the number of the criteria by which the element

is called the dominance index of the element ^{D}

The value _{W} is the dominance index of the entire set W. The elements with the minimum value of the dominance index form the so-called Pareto set. The Pareto set includes elements that are best in terms of the totality of all the criteria taken into account, without any a priori assumptions about the comparative significance of these criteria. In conditions of a real choice, the Pareto set often contains more than one element.

Returning to the original task of finding the most effective method for classifying text data, we shall turn to the data in Table

Dominance indices for the three text data classification methods calculated using seven nuclear knowledge graphs

Text data classification method | Dominance index | ||
---|---|---|---|

Calculated by F1-score | Calculated by Precision and Recall | Calculated by F1-score, Precision and Recall | |

Naive Bayes Classifier | 4 | 7 | 11 |

MaxEnt Classifier (Softmax) | 5 | 10 | 15 |

3 | 7 | 10 |

As can be seen from the data in Table

The software solutions implemented in the project are based on cloud computing using DBaaS and PaaS service models to ensure the scalability of data warehouses and network services. The server scripts of the working software prototype run on the Jelastic cloud platform in a Java and Python runtime environment.

Fig.

Diagram of the components of a scalable semantic web portal.

Research groups from Stanford University (Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014)), the Massachusetts Institute of Technology, the University of Bari, the University of Leipzig, and the University of Manchester are focusing on the development of the Semantic Web, related issues of machine learning and natural language processing. The global IT giants are actively developing knowledge representation models and machine learning technologies, including IBM Watson Studio, Google AI and Machine Learning, Amazon Comprehend

In this study, seven corpora of specialized texts on nuclear physics and nuclear power engineering were used to show the effectiveness of relatively simple, intuitive machine learning methods for solving the problem of continuous filling and updating of nuclear knowledge bases without direct human participation. The SVMs and the Naive Bayes classifier ensure the competence of semantized knowledge bases as artificial intelligence systems. Some of the results that were obtained by the authors in the study of such classifiers as the “

The study was supported by the Russian Science Foundation grant No. 22-21-00182,

^{nd}edn. O’Reilly Media, Inc. Boston.

^{th}International conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2018), Springer, 17–39.

^{nd}edn).

Russian text published: Izvestiya vuzov. Yadernaya Energetika (ISSN 0204-3327), 2022, n. 4, pp. 122–133.