Speeding up the ODETTA code for solving particle transport problems*

Mathematical simulation of fast neutron reactors requires high-precision calculations of protection problems based on unstructured meshes. The paper considers and analyzes a parallel version of the ODETTA code (Belousov et al. 2019) with the use of the MPI (Message Passing Interface) library technology (Knyazeva et al. 2006). The code is designed for numerical simulation of neutronic processes in shielding compositions of fast neutron lead cooled reactor plants in normal operating modes, and can be used to calculate the radiation conditions of using structural components and equipment of nuclear power facilities which are assumed to be the sources of and/or exposed to ionizing radiation during their safety justification. The operation of the generated code is compared against the previous version. The MPIbased development of the ODETTA code’s algorithmic part is described. Peculiarities and specific features of the code parallelization are presented, the code modification is given, and respective algorithms are considered. The structure of the ODETTA code based on the MPI is described in brief. The results of using the ODETTA code’s serial and parallel versions in OS Linux (Kostromin 2012) for NRNU MEPhI’s HPC cluster are provided (Savchenko et al. 2020). A comparative analysis is presented for two code implementation options in terms of speed and accuracy of results when using two different clusters and different numbers of nodes for these. Peculiarities of cluster-based calculations are noted.


Introduction
The purpose of the study is to use parallel computations for solving problems of neutron and gamma quanta transport in a multi-group SnPm approximation by the finite element method based on unstructured tetrahedral meshes, including mesh data handling. The study is conducted as part of an investigation by the Nuclear Safety Institute of the Russian Academy of Sciences (IBRAE) at MEPhI's computation center. Extremely complex problems are involved in mathematical simulation of fast neutron reactors. It is exactly the solution of such problems that requires high-precision mass calculations of protection problems based on unstructured meshes. An analog is the ATTILA code (McGhee et al. 2007) which uses the discontinuous finite element method (DFEM) to solve transport equations based on spatial meshes consisting of tetrahedral and hexagonal finite elements with nonplanar faces.

Peculiarities of the ODETTA code
Mathematical simulation of fast neutron reactors requires highly precise calculations of protection problems using unstructured meshes. The ODETTA code in the neutronic calculation implementation module simulates the solution of the transport equation by the finite element method (FEM) based on unstructured tetrahedral meshes in a discrete ordinate method approximation (Sychugova and Seleznev 2014).
With the use of discrete ordinates, the transport equation is written as where the group index number g is omitted, and the index m (m = 1, 2, …, M) is matched by the discrete direction Ω m = (m m , h m , x m ) out of a quadrature set (with equal weights ESn or Chebyshev-Legendre's CLn), the unit sphere surface value being measured in 4π, i.e. Σω m = 1. FEM formulas are developed by approximating the transport equation according to Galerkin using the weighted residuals method. The equation includes the full macroscopic interaction cross-section Σ T and the Q m function, the right-hand part of the transport equation depending on the solution of Ψ m . The right-hand part includes the source of intergroup and intragroup transitions, the source of fissions, and the given internal source. Zero values of the angular flow are given on the boundary G of the considered 3D region for directions inside of the region or the reflection condition. This results in a boundary problem for solving the equation of particle transport in a convex 3D region.
The anisotropic scattering is represented by a series expansion in associated Legendre functions up to the fifth order. The spatial rebalance method is used to accelerate the convergence of internal iterations. The code has been developed in Fortran (standard Fortran 90 and later) using OpenMP parallelizing. The ODETTA code operation steps are as follows: • preparation of mesh data; • preparation of group macroscopic cross-sections; • formation of the radiation source; • neutronic calculation; • calculation of functionals.
The SALOME code is used to handle CAD models and unstructured meshes (SALOME 2020). The CONSYST code with the BNAB-RF library is used to prepare group constants (Manturov 2017). The SIEGFRIED preprocessor developed at IBRAE specifically for the ODETTA code, including such modules as Readconst (the interface between the NISN multigroup constant storage format and the ODETTA code internal format), Source (formation of the radiation source in the original geometry and interpolation in the unstructured mesh nodes), MKE3D (neutronic calculation module), and RECONST (post-processing, calculation of functionals), is used to ensure that the user can handle the ODETTA code in the most comfortable way.
The solution obtained after the ODETTA simulation can be visualized by a 3D graphic plotter, e.g., the VisIt program (About VisIt 2020)).
The spatial region under consideration is decomposed into a finite number of elements with the fixed number of endpoints referred to as nodes. The accuracy of the results depends greatly on the decomposition quality. Tetrahedrons are used as elements for the calculation. They have common nodal points and, taken together, approximate the region shape. Normally, decomposition is started from the region boundaries so that to obtain as accurate approximation of the region shape as possible (Fig. 1).
The considered region and the tetrahedrons can be decomposed by directions. The unit sphere of directions (an angular variable mesh) is decomposed into eight octants (Fig. 2a). Since calculation in each of these can be done independently, the use of parallel calculations will speed up noticeably the program execution. This forms the basis for using the MPI in the neutronic calculation module of the ODETTA code. Decomposition by directions ( Fig. 2b) is used for the calculation in each octant -this is illustrated with an example of the first octant. Such space fragmentation is required to speed up the ODETTA neutronic calculation using OpenMP (OpenMP 2020). Decomposing a unit sphere of directions into octants and the independence of computations in each of these enable calculations in parallel and independently for each direction of the given space using the MPI. Thus, parallelizing through the MPI was used between octants, and that through OpenMP was used inside of octants.

Developing the algorithmic part of the MPI application for the ODETTA code
The growing interest in parallel programming these days is explained by the transition to mass production of multicore architectures (Nemnyugin 2007). The МРI technology is a set of utilities and library functions (for such languages as С/С++ (Pavlovskaya 2003) and Fortran (Ryzhikov 2000, Barteniev 2000, Chapman Stephen 2018, Shterenlikht 2018, Building Programs with GNU 2020)) which make it possible to create and start applications operating in parallel computation units of a different nature.
The ODETTA code uses a Fortran source code like most of the codes currently used in nuclear industry. Many of them cannot use more than one process, so they require conversion to parallel computations to speed up their operation. There are two ways to do this.
The first one is to use a compiler in which case the developer needs to indicate the code region where exactly parallelizing should be used inside of the program text. Such approach is used in the OpenMP system but is possible only in shared memory systems. This parallelizing option was introduced earlier in the ODETTA code.
The second approach suggests that the developer as such specifies the distribution of and communication among processes in the program code. The second ap-proach was used in the new version of the ODETTA code. This section describes the alterations made to the ODETTA source code for its new implementation with the MPI technology.
Decomposing the space into eight octants, the calculation in which can be done independently, makes it possible to use several computation nodes to optimize the code runtime. The ODETTA calculation algorithm is executed in parallel, and, theoretically, the more nodes are used per cluster, the shorter is the problem simulation time. The program's source code used one computation node, and the octant calculation was done in series.
The resultant program was tested using two and four nodes of MEPhI's cluster, though the eight-node cluster was assumed to be the best option. The use of the MPI cuts the calculation time considerably since all computations are done in parallel. Different numbers of nodes and their respective distributions of processes are shown in Fig. 3.  Computations were supposed to be done using eight octants, so a maximum of eight processes, distributed by the nodes requested on the cluster, were used for the code implementation. The parallel computation capability is expected to provide a distinct advantage in terms of the calculation speed.
Directives controlling the compiler operation were engaged in the ODETTA code implementation using the MPI, this making it possible to extend the code capabilities.

Design and implementation
The final ODETTA algorithm using the MPI is represented by a block diagram with standard symbols commonly used in structured programming, including oval (the beginning or end of the considered program unit), rectangle (operations block), hexagon (the algorithm cycle containing the "body" of repetitive operations, one input, and one output (Fig. 4).
The MPI use blocks in the implemented algorithm are highlighted with a grey background and an White Italic text. Changes pertain largely to the use of data in the interaction between parallel processes.

Results
A radiation safety problem, which modeled a fuel assembly of the MOX-1000 fast neutron reactor benchmark model with mixed oxide uranium-plutonium fuel and sodium coolant (NEA/NSC/R(2015)9 2016) was used as the test problem for the ODETTA code. The assembly was withdrawn after staying in the operating reactor and put into a cylindrical steel container (shell) which was then filled with lead. The common calculation simulation elements are shown in Table 1.
It was assumed for simplifying the model calculation that the cylindrical container shell was made of HT-9 steel. SnPm approximation with the parameters n = 12 and m = 3 was used for the problem simulation which means that there is a total of 288 directions for the unit sphere or 36 directions per one octant (CL12 Chebyshev-Legendre quadrature). The number of energy groups for neutrons is 299 and of those for gamma quanta is 127. The number of tetrahedrons in the decomposition mesh is 132883.
The software for the ODETTA code was implemented at MEPhI's high-performance computation center the clusters of which are Linux-based. The remote computer is controlled through the command line and using the SSH protocol through the Putty program (Putty Documentation 2020). Basov and Cherenkov clusters (Savchenko et al. 2020) with different numbers of nodes were used to calculate the radiation safety test problem using the ODETTA code. The results are presented in Tables 2 and 3. The speed up S p = t 1 /t p and the efficiency E p = S p /p for the parallel algorithms (t 1 is the algorithm execution time for one process, t p is the algorithm execution time for a system of p processes) are determined depending on the number of processes.
The radiation safety test problem was calculated using computational nodes numbering one to four, the use of one node being equivalent to the problem simulation without the use of the MPI, and the number of OpenMP processes for each process was equal to 32, that is, the maximum one for one node. Due to the cluster overload,    there was no eight-node computation, which had the maximum efficiency of problem solving in terms of time.
Since the clusters use different systems of processors, it can be noted that the problem runtimes are different with the best one being in the event of the Basov cluster, as the program runtime with four nodes is longer than with two due to the specific data transmission between the cluster nodes. The ODETTA algorithm with the MPI requires exchange of large data arrays between the cluster nodes, so data transmission leads to an additional calculation delay. An additional speed up was however expected with eight cluster nodes thanks to processors which would minimize the costs of the node data exchange thanks to their speed. A comparative analysis for the accuracy of computing full fluxes of particles is presented in Tables 4 and 5 where the calculation with one node, that is, without the use of the MPI, was assumed to be the reference result. The tables also present the maximum estimated relative error d max and the root-mean-square deviation s. The following relations were used to compute the errors where n is the number of the solution points; x i is the calculated solution; x * i is the reference solution; and Δx i = x i -x * i . The root-mean-square deviation from the reference solution is The relative error during the calculations with two and four nodes do not exceed 0.00155% when the Basov cluster is used and 0.00204% when the Cherenkov cluster is used, this indicating a minor deviation from the reference result.
The best option for obtaining the ODETTA cluster-based computation results is the use of one node with two MPI processors and 16 OpenMP processors installed for each process. This is possibly explained by the peculiarities of the cluster node structure which allow the program parallelizing algorithms to achieve the maximum speed up in terms of the calculation rate. The advantages of this computation option are shown in Tables 6 and 7.
The relative error during the calculations with one node using two MPI processors does not exceed 0.00134% when the Basov cluster is used and 0.00196% when the Cherenkov cluster is used, this indicating high precision of the obtained results.

Conclusion
The MPI technology was used to speed up the ODETTA code for solving the neutron and gamma quanta transport problem in a multigroup SnPm approximation by the finite element method based on unstructured tetrahedron meshes. The results of the source code modification were tested using a radiation safety problem as an example.
The program modifications were tested using a radiation safety problem (Bereznev et al. 2017). To check the accuracy of the obtained results, these were compared with the results of the ODETTA serial version which were assumed to be standard. The deviation from the standard was about 0.002% which indicates that the MPI parallelizing algorithm has been implemented correctly.
When analyzing the starts of computations using different numbers of the MPI nodes and 32 maximally possible OpenMP processes, a conclusion can be made that the most efficient distribution was that by two Cherenkov cluster nodes where the speed up was about 1.52. The calculation was slower with four computation nodes used than with two, this being explained by the peculiarities of the algorithm and the cluster architecture, that is, using the Hyper-threading technology (Savchenko et al. 2020) for the Basov and the Cherenkov, which makes it possible to use two computation threads in one physical processor core. The increase in the efficiency varies among applications. The execution of some programs can slow down as was the case in this study. A possible reason is connected with the "system of repetitions" of the Xeon processors which occupies the required computational resources, leading to the idle state of other computation threads.   As the result, the use of one node with two MPI processors and 16 OpenMP processors was the best option for the ODETTA cluster operation. The maximum speed up achieved with the Cherenkov cluster was 1.911 (nearly double). The maximum time gain amounted so to about 48%. In turn, the minimum time gain was about 30% as compared even with the best one-processor calculation.
It should be additionally noted that the ODETTA algorithm with the use of the MPI requires large arrays of data to be exchanged between the cluster nodes, so the transmission of data led to a major calculation delay. The use of several processors has a negative impact on speed due to an extra interprocessor interaction, but the calculation time remains much smaller as compared with the serial version.