## Abstract

Ultrasound computed tomography (USCT) shows great promise in nondestructive evaluation and medical imaging due to its ability to quickly scan and collect data from a region of interest. However, existing approaches are a tradeoff between the accuracy of the prediction and the speed at which the data can be analyzed, and processing the collected data into a meaningful image requires both time and computational resources. We propose to develop convolutional neural networks (CNNs) to accelerate and enhance the inversion results to reveal underlying structures or abnormalities that may be located within the region of interest. For training, the ultrasonic signals were first processed using the full waveform inversion (FWI) technique for only a single iteration; the resulting image and the corresponding true model were used as the input and output, respectively. The proposed machine learning approach is based on implementing two-dimensional CNNs to find an approximate solution to the inverse problem of a partial differential equation-based model reconstruction. To alleviate the time-consuming and computationally intensive data generation process, a high-performance computing-based framework has been developed to generate the training data in parallel. At the inference stage, the acquired signals will be first processed by FWI for a single iteration; then the resulting image will be processed by a pre-trained CNN to instantaneously generate the final output image. The results showed that once trained, the CNNs can quickly generate the predicted wave speed distributions with significantly enhanced speed and accuracy.

## 1 Introduction

Nondestructive testing (NDT) is a critical part of quality control and ensures the safety of products and infrastructure. The challenge of NDT is often in finding a way to examine and analyze internal properties or defects that we are unable to physically see or measure. Medical fields face similar challenges in visualizing the properties of internal bodily tissue. In both cases, ultrasound computed tomography (USCT) can be used to help us visualize and analyze these internal properties. Full waveform inversion (FWI) is a USCT technique based on numerical wave propagation [1] that seeks to find a high-resolution subsurface model [2] of a region of interest. The ability to visualize internal defects or varying material properties without causing damage to the underlying structure has applications in many areas, including NDT [3], structural health monitoring [4], aquifer identification [5], early medical diagnosis [6], and industrial process monitoring [7].

The FWI [8,9] process starts with an initial model predicting the region of interest [10] which estimates the underlying distribution of material properties or potential defects. This model is used in combination with the wave equation to generate a predicted wavefield with ultrasonic or seismic excitation [11]. The wavefield signals are recorded at a series of receiver locations as the measurements or data [12]. These predicted measurements are compared to the observed measurements to calculate the data misfit between them. Based on adjoint theory, the misfit data are used to update the model of the region and this process is repeated until the misfit between the predicted and observed data is minimized to an acceptable level [8,13].

While FWI-based USCT [8,9,14–19] has been shown to have applications for numerous industries and is becoming a more commonly used method to reconstruct material properties or damage distribution, this technique is not without its own challenges. Waveform inversion is a typical non-linear and ill-posed inverse problem and existing physics-driven computational methods for solving waveform inversions suffer from cycle-skipping [12] and local-minima issues [20]. Due to the complexity of solving the related inverse problem in FWI, this is often an intensive task, both in terms of mathematical complexity [21] and computational time [22]. In an effort to reduce the required computational time, the number of iterations of the calculations are often limited, or larger errors are allowed leading to more of an approximation rather than an exact solution, which often compromises the quality of the reconstructed model [23]. Because of these complexities in solving traditional FWI, various methods have been researched to solve the inverse problem of FWI and provide a model of the underlying structure directly from the observed full waveform dataset. Classical solutions to the FWI inverse problem are numerically calculated, and early attempts [24] at solving the inverse problem were iterative and data-driven [25].

Modern advancements in machine learning and image generation have shown great success in solving the inverse problem in various image reconstruction-related areas (e.g., X-ray tomography) [26–28]. Related deep-learning approaches also show promise in alleviating the aforementioned challenges in FWI. In recent years, efforts have been made to incorporate a generative adversarial network (GAN) with seismic FWI [4] and similar data-driven inversion methods such as seismic impedance inversion problems [29]. Zhang and Lin [30] proposed a GAN-based model to estimate a velocity map from raw seismic waveform data. Studies where FWI was applied to elastic and transversely isotropic media using recurrent neural networks (RNNs) also showed significant potential in deep-learning-based FWI research [31,32]. In addition to GAN and RNN, some attempts are being made at incorporating physics information into machine learning loss functions to be used in neural networks. These are termed physics-informed neural networks (PINNs) in ultrasound inversion studies [33,34]. Raissi et al. [33] demonstrated that PINN approximates the solution of a partial differential equation (PDE) by training on control points and constraining the solution’s time and space derivatives to obey the PDE. In RNN-based FWI, on the other hand, the partial derivatives are computed using finite differencing, and automated differentiation is only used on the model parameters. However, the RNN method requires storage for the entire multi-component wavefield [31].

Most of the above deep-learning-based improvements for FWI are still in geophysics. For FWI-based USCT, deep-learning-based enhancement is still in its infancy. Robins et al., in their deep-learning-based USCT study [35], implemented convolutional neural network (CNN) as a preprocessing step to the input data of FWI and were able to recover high-resolution velocity models. However, to achieve such effective results they had to go through 208 iterations of FWI, making the process still computationally expensive. Feigin et al. [36] analyzed deep-learning-based FWI with a single iteration using high frequencies (5–40 MHz). Nevertheless, high frequencies lead to non-negligible attenuation [37], which creates another model parameter to deal with. Several studies have thereby been performed on neural network architecture in FWI problems, and there is ample room for research on theoretical analysis of network architecture on FWI, the effects of different hyperparameters on final outputs, etc. [31]. Another possible method is to treat the problem as an image-to-image translation and implement a CNN to solve the inverse problems in different fields [27,35,36,38,39].

The present work aims to develop a CNN that will solve the inverse problem of FWI in USCT with low computational processing requirements of the trained machine learning model. Due to the complexity of solving the FWI, some limitations are set in the initial testing and training of the CNN for this application. However, applying adjoint tomography-based deep-learning to the FWI problem takes into account the full non-linearity of the wave propagation [40], allowing for more usable information from each iteration. This can improve the reconstruction quality while also requiring less computational cost than classical FWI methods [41]. If CNNs can be trained to accurately predict the inverse problem for FWI, large reductions in computation requirements can be realized. This serves to reduce the entry cost of implementing FWI-based USCT systems and expand its application, use, and adoption in the industry. We propose to leverage modern advancements in machine learning and image generation to solve the inverse problem as an image translation issue. Our previous work [38] shows promise in applying 1D CNNs for single wave speed regions with the same wave speed. In this paper, we propose to build and train a two-dimensional (2D) CNN to identify and learn patterns to solve the inverse problem from the first iteration FWI input cost-effectively without compensating much on the performance, for multiple regions with various wave speeds. The effects of various loss functions in this CNN-based approach for FWI-based USCT problems have also been studied.

The remainder of the paper is organized as follows. First, the dataset generation will be introduced. This includes the creation of ground truth and FWI datasets which will be used to train, verify, and test the capabilities of the machine learning model, followed by the preparation of this data for use in machine learning. Next, the architecture of the machine learning model will be discussed along with the selection of various hyperparameters used for the training series. Finally, the results of the machine learning model are presented and discussed with an analysis of how each of the chosen hyperparameters affected the outcomes of each of the machine learning iterations.

## 2 Methods and Data

Our proposed approach is to develop a two-dimensional CNN to accelerate and enhance the inversion results. To build and test this, we need to implement a two-step approach. The first step is to generate datasets to work with; the second step is to train, validate, and test the machine learning models. In the first step of data creation, a set of randomly generated ground truth models has been created, and then a single iteration FWI is performed on the corresponding ultrasonic measurements. Because the combination of indefinite boundaries and varying shapes and sizes can quickly result in very complex challenges to solve, the dataset is limited to only a few regions of simple rectangular shapes with definite boundaries, but with different wave propagation speeds in rectangular regions. The inversion images generated by the FWI process have been used as inputs to the machine learning model to train said model to predict what the ground truth solution should be.

### 2.1 Implementation of Full Waveform Inversion.

**s**(

**x**,

*t*) can be expressed as

*p*denotes the pressure,

*ρ*is the mass density, and

**is the source excitation force. For waveform tomography, Tarantola [42] introduced the least-squares waveform misfit function. For this study, we also define the misfit function as**

*f***m**is the model space in ultrasound tomography and depends on material parameters such as density, wavespeed, attenuation, etc. At a lower frequency, attenuation is much lower [37]. For our study, therefore, we neglected attenuation and assumed the density to be constant throughout the process. The

*p*

_{obs}and

*p*

_{syn}are the pressure signals of the true data and synthetic data, respectively. The gradient of the misfit function is

*δp*

_{syn}is the perturbation in pressure field

*p*

_{syn}due to the model perturbation

*δ*

**m**[13,43]. The aim of FWI is to reduce the misfit gradient between these two data points so that model parameters can be obtained and a high-resolution, numerical model of the material distribution can be reconstructed as the final model [8]. However, in this study, we stopped the process after a single iteration of inversion. The results of this “single iteration FWI” process and the corresponding true models are then used to train–validate–test the deep-learning model.

### 2.2 Training and Test Dataset Generation.

A spectral finite element solver—called SPECFEM2D [44]—was used to simulate forward and adjoint wave propagation in two-dimensional media. A 12 mm × 12 mm water-immersed domain was designed as the 2D media. The domain consists of 900 spectral elements (30 along the *x*-axis and 30 along the *y*-axis) in the mesh. Each spectral element contained 25 control points. Excitation was generated from 100 source transducers located at one side of the domain (just below top wall in Fig. 1). All transducers were excited simultaneously to generate a plane wave with a center frequency of 1 MHz, and 400 sensor-elements were setup around the domain to receive the signals. Figure 1 illustrates how the plane wave propagated at different time-steps inside the domain. We built the initial model with a water-immersed environment at room temperature with a 1479.7 m/s wave propagation speed.

The ground truth models were created by introducing regions with unknown material properties (i.e., wave speed in this research). To generate a diverse dataset, the number, length, width, location, and wave speed for these regions were randomly selected. A total of 1000 ground truth models were created. Figure 2 shows the longitudinal wavespeed mapping of the initial model, a ground truth model, and an inverted model after a single iteration of FWI of that ground truth model.

We used a python-based framework called SeisFlows for FWI [8,45]. The output of both forward and inverse simulations was the wavespeed information of every control point in the domain saved as separate NumPy arrays.

FWI is a state-of-the-art tool for producing high-accuracy velocity models [46]. However, this method can be very computationally expensive [47]. Initially, the data generation was performed sequentially with one dataset sample being calculated at a time and then later stacked to create a single NumPy output array on a lab desktop. This approach proved to be time-consuming taking more than 15 h to generate 1000 samples. To remove this bottleneck, high-performance computing was used to allow data generation to be performed in parallel. The parallelization process was managed using Simple Linux Utility for a Resource Management (slurm) software package. A batch of 100 models was simulated as a single slurm job using 128 processor cores simultaneously. Each slurm job was completed in approximately 13 min. To generate 1000 models, we needed 10 slurm jobs. Thus, after parallelization, the time to generate a dataset of 1000 models was reduced to approximately 13 min.

### 2.3 Data Preprocessing.

The output of SPECFEM2D and SeisFlows is a one-dimensional array which contains wavespeed information of every control point in the spectral elements. However, since the problem is from a two-dimensional region, we needed to convert the data back into the original width and height dimensions. Additionally, there are many duplicate wave speed data points from overlapping control points of adjacent spectral elements that can be removed, and since the allowable gradient update region was a masked subset region, the domain can be reduced from the original 12 mm × 12 mm region to a 9.4 mm × 9.4 mm region. As in common practice, the wavespeeds were normalized between 0 and 1 using a linear regression normalization technique. Therefore, after normalization, the least wavespeed (the background wavespeed) became 0, and the highest wavespeed in the entire dataset was rescaled to 1. At this point, each individual sample exists as a one-dimensional array with a shape of 1 × 9025. However, each array can be rearranged back into a two-dimensional shape of size 95 × 95. Figure 3 illustrates the process of rearranging the 1D array into the 2D array.

Once converted into the two-dimensional array, we can expand the size of our dataset by augmenting the two-dimensional array by flipping the datasets along its axes (Fig. 4). Due to the symmetry of this surrounding array setup, this augmentation strategy allows us to enhance our original dataset with 3000 additional samples giving us a dataset of 4000 samples, without introducing non-physical artifacts.

Next, all samples are stacked into a single three-dimensional NumPy array of size of 4000 × 95 × 95 pertaining to the now 4000 samples with the 95 × 95 pixel values making up each sample. Once stacked, the samples are zero-padded along the bottom and right side increasing the dimension to 96 × 96 to allow easy pooling and scaling of the datasets in the machine learning processes. The deep-learning model processes each sample as a grey-scale image giving us a channel of one (instead of three for red–green–blue images). Hence, the final shape of our dataset is 4000 × 96 × 96 × 1. The final step in the data preparation process is to randomize and split the dataset into respective training, validation, and testing sets. The training and validation datasets are used to train the CNN and the testing dataset is reserved to test the performance of the trained model. The distribution of the datasets was $80%$ of the samples for training, $15%$ for validation, and $5%$ reserved for testing purposes.

### 2.4 Architectures.

With the original wave speed region and our dataset being a two-dimensional domain, a 2D CNN is well suited for this problem adaptation [48]. We use a CNN because of its ability to detect and classify features in an image [49], to deblur images [50], and for its ability to increase details through super-resolution [51]. Once trained, the CNN model facilitates the solving of the associated inverse problem and produces a high-resolution image of the predicted model in near real time. Operating in a Linux environment and utilizing the Anaconda 4.10 python package along with TensorFlow and Keras, a two-dimensional U-shaped CNN (U-NET) model was developed and trained using the single iteration FWI data as its input. While training, the CNN progresses the learning via mapping the prediction and the given ground truth to improve its predictions. During testing of the CNN, only FWIs from the test dataset are used.

The U-NET CNN Fig. 5 consists of an encoder and decoder block configured in a U shape with skip connections connecting encoder/decoder layers. The encoder block works to extract features from the input image and through the layers learns a representation of the image. The encoder block consists of two sets of convolutional layers utilizing batch normalization and an activation function. The skip connection allows the skipping of deeper U-NET layers as a shortcut to the decoder block. After each convolutional layer, a max pooling layer is implemented to reduce the physical dimensions of the input image by half while doubling the number of feature channels. The encoder and decoder blocks are connected with a bridge consisting of another pair of convolutional layers, batch normalization, and activation function combinations. The decoder block then works in reverse of the encoder block by doubling the input image size through a two-dimensional convolutional transpose and concatenating the feature map from the skip connection, and then another pair of convolutional, batch normalization, and activation blocks. The final convolutional layer then reduces the number of channels back to 1 to match the input image channel depth.

While the architecture of the U-NET is in place, there are still several hyperparameters that can be adjusted to tune the machine learning model. As each of these hyperparameters has its own strengths and weaknesses, it is necessary to test the performance of each hyperparameter and their combination with other hyperparameters. The main parameters considered in our U-NET implementation was the mean squared error (MSE) and mean absolute error (MAE) loss functions as well as the rectified linear unit (ReLU) and leaky rectified linear unit (LeakyReLU) activation functions. While the activation functions are both very similar, they exhibit some key differences in their performance. The ReLU function has the benefit of being fast but can also lead to some neurons being deactivated during training due to a vanishing gradient for negative values. The LeakyRelu activation function does not exhibit the vanishing gradient issue, but its value tends to be less stable [52].

With a total of two loss functions and two activation functions to test, we have a total of four U-NET models to compare. The complete list of parameters and a breakdown of their configurations can be seen in Table 1. All models were trained with 25 epochs and a batch size of 32. Model parameters were updated using the Adam optimizer. All convolutional layers utilized a 3 × 3 kernel size. Each U-NET model configuration utilized a single activation and error function for the entire model [54]. The four models were then trained on a high-performance cluster using a single GPU node with four NVIDIA Tesla V100-SXM2-32GB graphics cards with a total training time of approximately 4 min per model.

## 3 Results and Discussion

Once trained, all four of the 2D CNN models were able to quickly predict the ground truth model from a single iteration FWI image. All configurations of the U-NET were able to converge to greater than 96% accuracy after approximately seven epochs of training (Fig. 6). This means the CNN models were able to correctly predict most of the values and the total required training time is reduced to just over 1 min per model. Additionally, all of the models exhibited similar convergence on the loss values (Fig. 7), which is a measurement of how far from the true value the prediction was. A quantitative review of the model predictions (Fig. 8) shows that some of the more complex regions would cause over or under-prediction of values, particularly in locations where regions overlapped. Despite these challenges, the overall results of the predicted values proved quite favorable.

To quantify the performance of the U-NET models, a total of 200 of the test dataset inversions were predicted by each of the trained models (the results summarized in Table 2). All four of the trained U-NET models proved to perform exceptionally well at predicting the background speed distributions with none of the models exhibiting under-predicting or over-predicting of the background speed. Since the background speed is also the lowest speed in the models, all of the performance measurements of the U-NET configurations in Table 2 are calculated by the percent overshoot or percent undershoot of the maximum wave speed region located in the individual predicted sample.

ReLU | LeakyReLU | |||
---|---|---|---|---|

MSE | MAE | MSE | MAE | |

Max over-prediction | 6.77% | 6.74% | 9.71% | 10.67% |

Max under-prediction | −2.13% | −4.16% | −1.29% | −1.23% |

Min over-prediction | 0.05% | 0.10% | 0.09% | 0.03% |

Min under-prediction | −0.10% | −0.02% | −0.08% | −0.06% |

Average over-prediction | 2.03% | 1.36% | 3.58% | 3.53% |

Average under-prediction | −0.87% | −1.11% | −0.59% | −0.58% |

ReLU | LeakyReLU | |||
---|---|---|---|---|

MSE | MAE | MSE | MAE | |

Max over-prediction | 6.77% | 6.74% | 9.71% | 10.67% |

Max under-prediction | −2.13% | −4.16% | −1.29% | −1.23% |

Min over-prediction | 0.05% | 0.10% | 0.09% | 0.03% |

Min under-prediction | −0.10% | −0.02% | −0.08% | −0.06% |

Average over-prediction | 2.03% | 1.36% | 3.58% | 3.53% |

Average under-prediction | −0.87% | −1.11% | −0.59% | −0.58% |

As can be seen by this summary (and by Fig. 8), both U-NET configurations, when using the LeakyReLU activation function, have a greater tendency to over-predict the maximum wave speed as compared to their counterpart U-NET models using the ReLU activation function. However, the LeakyReLU activation function also tends to be less likely to under-predict the true wave speed. By comparing the different U-NET configurations in Fig. 8, we can also see that the LeakyReLU activation function tends to perform better at displaying the edges of complex region interactions, particularly in overlapping regions, while the standard ReLU activation function tends to perform better at predicting a uniform internal region wave speed.

Similarly, by comparing the MSE versus MAE loss functions, we can see that the MSE function tends to produce more consistent internal region speed predictions, while the MAE function tends to produce cleaner region edges. We also saw that the MAE loss function was slightly less likely to over-predict wave speeds; however, it was also slightly more prone to under-predicting the speeds when compared to the MSE loss function. We also saw that the MAE loss function in combination with the LeakyReLU activation function was more likely to predict false regions or fuzzy edges of regions.

Each of the U-NET models compared was configured to only use one error function (MSE or MAE) and one activation function (ReLU or LeakyReLU). However, as we can see from the results of the independent models, each of the error functions and activation functions do have certain benefits. A proposed path forward is to leverage a custom error function that combines the MSE and the MAE in a shared manner so that the strengths of each can be utilized. Additionally, since each encoder and decoder block in the U-NET uses a pair of convolutional layers, each with an activation layer, the activation functions could also be split between ReLU and LeakyReLU activations in an attempt to further leverage the strengths of each of the individual activation functions; this includes further tuning the LeakyReLU parameters to tailor its performance to the region identification task it excels at while attempting to reduce its tendency to over-predict wave speed values.

## 4 Conclusion

Two-dimensional CNN architectures were explored to instantaneously solve the associated FWI problem in USCT. 2D CNNs were found to be able to quickly provide quantitative results after data acquisition with only a few training epochs required to achieve a high level of accuracy. Various configurations of U-NETs were explored to analyze the strengths and weaknesses of various error and activation functions, whose performance was also qualitatively and quantitatively analyzed. Our studies show that 2D CNNs can be leveraged for solving the inverse problem in FWI-based USCT according to our results, and we suspect that the accuracy and performance of the U-NET can be improved through combinations of error and activation function implementations and will be tested in continued research.

## Acknowledgment

We deeply appreciate the financial support provided by NSF projects #2152764 and #2152765, the computational resources provided by ACCESS (formerly XSEDE) grants #MSS190025/#MDE220005, and the high-performance computing technical support for data generation by Dr. Paul Rodriguez through the XSEDE Extended Collaborative Support Service (ECSS) grant. The authors also appreciate the discussions with Mr. Benoît Guillard, Dr. Youzuo Lin, Dr. Mark Anastasio, and Dr. Umberto Villa.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

## References

_{2}Capture Process Involving Stirring and CaCO

_{3}Precipitation