Deep Learning for Photoacoustic Tomography from Sparse Data

The development of fast and accurate image reconstruction algorithms is a central aspect of computed tomography. In this paper we investigate this issue for the sparse data problem in photoacoustic tomography (PAT). We develop a direct and highly efficient reconstruction algorithm based on deep learning. In our approach image reconstruction is performed with a deep convolutional neural network (CNN), whose weights are adjusted prior to the actual image reconstruction based on a set of training data. The proposed reconstruction approach can be interpreted as a network that uses the PAT filtered backprojection algorithm for the first layer, followed by the U-net architecture for the remaining layers. Actual image reconstruction with deep learning consists in one evaluation of the trained network. The numerical complexity of evaluating the trained network is smaller than that of iterative reconstruction algorithms, which require repeatedly solving the forward and adjoint problems. At the same time, our numerical results demonstrate that the proposed deep learning approach reconstructs images with a quality comparable to (or even outperforming) state of the art iterative approaches for PAT from sparse data.


Introduction
Deep learning is a rapidly emerging research area that yields significantly improved performance of many pattern recognition and machine learning applications [1,2]. Deep learning algorithms make use of special artificial neural network designs for representing a nonlinear input to output map together with optimization procedures for adjusting the weights of the network during the training phase. Deep learning techniques are currently the state of the art for visual object recognition, natural language understanding or applications in other domains such as drug discovery or biomedical image analysis (see, for example, [3][4][5][6][7][8][9][10] and the references therein).
Despite its success in various scientific disciplines, in image reconstruction deep learning research appeared only very recently (see [11][12][13][14][15][16][17]). In this paper, we develop a deep learning framework for image reconstruction in photoacoustic tomography (PAT). To concentrate on the main ideas we restrict ourselves to the sparse data problem in PAT in a circular measurement geometry. Our approach can be extended to an arbitrary measurement geometry in arbitrary dimension. Clearly, the increased dimensionality comes with an increased computational cost. This is especially the case for the training of the network which, however, is done prior to the actual image reconstruction.

PAT and the sparse sampling problem
PAT is a non-invasive coupled-physics biomedical imaging technology which beneficially combines the high contrast of pure optical imaging with the high spatial resolution of ultrasound imaging [18][19][20][21]. It is based on the generation of acoustic waves by illuminating a semi-transparent biological or medical object with short optical pulses. The induced time dependent acoustic waves are measured outside of the sample with acoustic detectors, and the measured data are used to recover an image of the interior (see Figure 1). High spatial resolution in PAT can be achieved by measuring the acoustic data with high spatial and temporal sampling rate [22,23]. While temporal samples can be easily collected at or above the Nyquist rate, the spatial sampling density is usually limited [24][25][26][27][28][29]. In fact, each spatial measurement requires a separate sensor and high quality detectors are often costly. Moving the detector elements can increase the spatial sampling rate, but is time consuming and also can introduce motion artefacts. Therefore, in actual applications, the number of sensor locations is usually small compared to the desired resolution which yields to a so-called sparse data problem.
Applying standard algorithms to sparse data yields low-quality images containing severe undersampling artefacts. To some extent, these artefacts can be reduced by using iterative image reconstruction algorithms [30][31][32][33][34][35][36] which allow to include prior knowledge such as smoothness, sparsity or total variation (TV) constraints [37][38][39][40][41][42]. These algorithms tend to be time consuming as the forward and adjoint problems have to be solved repeatedly. Further, iterative algorithms have their own limitations. For example, the reconstruction quality strongly depends on the used a-priori model about the objects to be recovered. For example, TV minimization assumes sparsity of the gradient of the image to be reconstructed. Such assumptions are often not strictly satisfied in real world scenarios which again limits the theoretically achievable reconstruction quality.
To overcome the above limitations, in this paper we develop a new reconstruction approach based on deep learning that comes with the following properties: (i) image reconstruction is efficient and non-iterative; (ii) no explicit a-priori model for the class of objects to be reconstructed is required; (iii) it yields a reconstruction quality comparable to (or even outperforming) existing methods for sparse data. Note that instead of providing an explicit a-priori model, the deep learning approach requires a set of training data and the CNN itself adjusts to the provided training data. By training the network on real word data, it thereby automatically creates a model of the considered PAT images in an implicit and data driven manner. While training of the CNN again requires time-consuming iterative minimization, we stress that training is performed independently of the particular investigated objects and prior to the actual image reconstruction. Additionally, if the time resources available for training a new network are limited, then one can use weights learned on one data set as good starting value for training the weights in the new network.

Proposed deep learning approach
Our reconstruction approach for the sparse data problem in PAT uses a deep convolutional neural network (CNN) in combination with any linear reconstruction method as preprocessing step. Essentially, it consists of the following two steps (see Figure 2): (D1) In the first step, a linear PAT image reconstruction algorithm is applied, which yields an approximation of the original object including under-sampling artefacts. (D2) In the second step, a deep CNN is applied to map the intermediate reconstruction from (D2) to an artefact-free final image.
Note that the above two-stage procedure can be viewed as a single deep neuronal network that uses the linear reconstruction algorithm in (D1) for the first layer, and the CNN in (D2) for the remaining layers.
Step (D1) can be implemented by any standard linear reconstruction algorithm including filtered backprojection (FBP) [43][44][45][46][47][48][49], Fourier methods [50][51][52][53][54], or time reversal [55][56][57][58]. In fact, all these methods can be implemented efficiently using at most O(d 3 ) floating point operations (FLOPS) for reconstructing a high-resolution image on an d × d grid. Here d is number spatial discretization points along one dimension of the Figure 2. Illustration of the proposed network for PAT image reconstruction. In the first step, the FBP algorithm (or another standard linear reconstruction method) is applied to the sparse data. In a second step, a deep CNN is applied to the intermediate reconstruction which outputs an almost artefact-free image. This may be interpreted as a deep network with the FBP in the first layer and the CNN in the remaining layers. reconstructed image. The CNN applied in step (D2) depends on weights that are adjusted using a set of training data to achieve artefact removal. The weights in the CNN are adjusted during the so-called training phase which is performed prior to the actual image reconstruction [1]. In our current implementation, we use the U-net architecture originally designed in [59] for image segmentation. Application of the trained network for image reconstruction is fast. One application of the U-net requires O(F 2 Ld 2 ) FLOPS, where F is the number of channels in the first convolution and L describes the depth of the network. Typically, F 2 L will be in the order of d and the number of FLOPS for evaluating the CNN is comparable to the effort of performing an FBP reconstruction. Moreover, evaluation of the CNN can easily be parallelized, which further increases numerical performance. On the other hand, iterative reconstruction algorithms tend to be slower as they require repeated application of the PAT forward operator and its adjoint.
To the best of our knowledge, this is the first paper using deep learning or neural networks for image reconstruction in PAT. Related approaches applying CNNs for different medical imaging technologies including computed tomography (CT) and magnetic resonance imaging (MRI) appeared recently in [11][12][13][14][15][16][17]. The author of [14] shares his opinions on deep learning for image reconstruction. In [13], deep learning is applied to imaging problems where the normal operator is shift invariant; PAT does not belong to this class. A different learning approach for addressing the limited view problem in PAT is proposed in [60]. The above references show that a significant amount of research has been done on deep learning for CT and MRI image reconstruction (based on inverse Radon and inverse Fourier transforms). Opposed to that, PAT requires inverting the wave equation, and our work is the first paper that uses deep learning and CNNs for PAT image reconstruction and inversion of the wave equation.

Outline
The rest of this paper is organized as follows. In Section 2 we review PAT and discuss the sparse sampling problem. In Section 3 we describe the proposed deep learning approach. For that purpose, we discuss neural networks and present CNNs and the U-net actually implemented in our approach. Details on the numerical implementation and numerical results are presented in Section 4. The paper concludes with a short summary and outlook given in Section 5.

Photoacoustic tomography
As illustrated in Figure 1, PAT is based on generating an acoustic wave inside some investigated object using short optical pulses. Let us denote by h : R d → R the initial pressure distribution which provides diagnostic information about the patient and which is the quantity of interest in PAT. Of practical relevance are the cases d = 2,3 (see [20,61,62]). For keeping the presentation simple and focusing on the main ideas we only consider the case of d = 2. Among others, the two-dimensional case arises in PAT with so called integrating line detectors [20,43]. Extensions to three spatial dimensions are possible by using the FBP algorithm for 3D PAT [45] in combination with the 3D U-net designed in [63]. Further, we restrict ourselves to the case of a circular measurement geometry, where the acoustic measurements are made on a circle surrounding the investigated object. In general geometry, one can use the so-called universal backprojection formula [47,49] that is exact for general geometry up to an additive smoothing term [47]. In this case, the CNN can be used to account for the under-sampling issue as well as to account for the additive smooth term. Such investigations, however, are beyond the scope of this paper.

PAT in circular measurement geometry
In two spatial dimensions, the induced pressure in PAT satisfies the following initial value problem for the 2D wave equation where we assume a constant sound-speed that is rescaled to one. In the circular measurement geometry, the initial pressure h is assumed to vanish outside the disc Note that the solution of used forward wave equation (1) is, for positive times, equal to the causal solution of the wave equation with source term δ (t)h(x); see [64]. Both models (either with source term or with initial condition) are frequently used in PAT. The goal of PAT image reconstruction is to recover h from measurements of the acoustic pressure p made on the boundary ∂B R . In a complete data situation, PAT in a circular measurement geometry consist in recovering the function h from data where p denotes the solution of (1) with initial data h and T is the final measurement time.
Here complete data refers to data prior to sampling that are known on the full boundary ∂B R and up to times T ≥ 2R. In such a case, exact and stable PAT image reconstruction is theoretically possible; see [32,65]. Several efficient methods for recovering h from complete data Ph are well investigated (see, for example, [43][44][45][46][47][48][49][50][51][52][53][54][55][56][57][58]). As an example, we mention the FBP formula derived in [44], Note that (3) requires data for all t > 0; see [44,Theorem 1.4] for a related FBP formula that only uses data for t < 2R. For the numerical results in this paper we truncate (3) at t = 2R, in which situation all singularities of the initial pressure are contained in the reconstructed image and the truncation error is small.

Discretization and sparse sampling
In practical applications, the acoustic pressure Ph can only be measured with a finite number of acoustic detectors. The standard sampling scheme for PAT in circular geometry assumes uniformly sampled values with Here p[m, · ] : [0, T] → R is the signal corresponding to the mth detector, and M is the total number of detector locations. Of course, in practice also the signals p[m, · ] have to be represented by discrete samples. However, temporal samples can easily be collected at a high sampling rate compared to the spatial sampling, where each sample requires a separate sensor.
In the case that a sufficiently large number of detectors is used, according to Shannon's sampling theory, implementations of full data methods yield almost artefact-free reconstructions (for a detailed analysis of sampling in PAT see [22]). As the fabrication of an array of detectors is demanding, experiments using integrating line detectors are often carried out using a single line detector, scanned on circular paths using scanning stages [66,67], which is very time consuming. Recently, systems using arrays of 64 parallel line detectors have been demonstrated [25,26]. To keep production costs low and to allow fast imaging the number M will typically be kept much smaller than advised by Shannon's sampling theory and one deals with highly under-sampled data. Therefore, one has to deal with a so-called sparse sampling problem (see Figure 3).
Due to the high frequency information contained in time, there is still hope to recover high-resolution images form spatially under-sampled data. For example, iterative algorithms, using TV minimization yield good reconstruction results from undersampled data (see [24,27,42,68]). However, such algorithms are quite time consuming as they require evaluating the forward and adjoint problem repeatedly (for TV typically at least several hundreds of times). Moreover, the reconstruction quality depends on certain a-priori assumptions on the class of objects to be reconstructed such as sparsity of the gradient. Image reconstruction with a trained CNN is direct and requires a smaller numerical effort compared to iterative methods. Further, it does not require an explicit model for the prior knowledge about the objects to be recovered. Instead, such a model is implicitly learned in a data-driven manner based on the training data by adjusting the weights of the CNN to the provided training data during the training phase.

Deep learning for PAT image reconstruction
Suppose that sparsely sampled data of the form (4), (5) are at our disposal. As illustrated in Figure 2, in our deep learning approach we first apply a linear reconstruction procedure to the sparsely sampled data (p[m, · ]) M m=1 which outputs a discrete image X ∈ R d×d . According to Shannon's sampling theory an aliasing free reconstruction requires M ≥ π d detector positions [22]. However, in practical applications we will have M d, in which case severe undersampling artefacts appear in the reconstructed image. To reduce these artefacts, we apply a CNN to the intermediate reconstruction which outputs an almost artefact-free reconstruction Y ∈ R d×d . How to implement such an approach is described in the following.

Image reconstruction by neural networks
The task of high-resolution image reconstruction can be formulated as supervised machine learning problem. In that context, the aim is finding a restoration function : R d×d → R d×d that maps the input image X ∈ R d×d (containing undersampling artefacts) to the output image Y ∈ R d×d which should be almost artefact-free. For constructing such a function , one assumes that a family of training data T := (X n , Y n ) N n=1 are given. Any training example (X n , Y n ) consist of an input image X n and a corresponding artefact-free output image Y n . The restoration function is constructed in such a way that the training error is minimized, where d : R d×d × R d×d → R measures the error made by the function on the training examples. Particular powerful supervised machine learning methods are based on neural networks (NNs). In such a situation, the restoration function is taken in the form where any factor σ • W is the composition of a linear transformation (or matrix) W ∈ R D +1 ×D and a nonlinearity σ : R → R that is applied component-wise. Here L denotes the number of processing layers, σ are so called activation functions and W := (W 1 , . . . , W L ) is the weight vector. Neural networks can be interpreted to consist of several layers, where the factor σ • W maps the variables in layer to the variables in layer + 1.
The variables in the first layer are the entries of the input vector X and the variables in the last layer are the entries of the output vector Y. Note that in our situation we have an equal number of variables D 1 = D L+1 = d 2 in the input and the output layer. Approximation properties of NNs have been analyzed, for example, in [69,70].
The entries of the weight vector W are called weights and are the variable parameters in the NN. They are adjusted during the training phase prior to the actual image reconstruction process. This is commonly implemented using gradient descent methods to minimize the training set error [1,71] The standard gradient method uses the update rule where ∇E denotes the gradient of the error function in the second component and W (k) the weight vector in the kth iteration. In the context of neural networks, the update term is also known as error backpropagation. If the number of training examples is large, then the gradient method becomes slow. In such a situation, a popular acceleration is the stochastic gradient descent algorithm [1,71]. Here for each iteration a small subset T (k) of the whole training set is chosen randomly at any iteration and the weights are adjusted using the modified update formula for the kth iteration. In the context of image reconstruction similar acceleration strategies are known as ART or Kaczmarz type reconstruction methods [23,72,73]. The number of elements in T (k) is called batch size and η is referred to as the learning rate. To stabilize the iterative process, it is common to add a so-called momentum term β (W (k) − W (k−1) ) with some nonnegative parameter β in the update of the kth iteration.

CNNs and the U-net
In our application, the inputs and outputs are high dimensional vectors. Such large-scale problems require special network designs, where the weight matrices are not taken as arbitrary matrices but take a special form reducing its effective dimensionality. When the input is an image, CNNs use such special network designs that are widely and successfully used in various applications [71,74]. A main property of CNNs is the invariance with respect to certain transformations of the input. In CNNs, the weight matrices are block diagonal, where each block corresponds to a convolution with a filter of small support and the number of blocks corresponds to the number of different filters (or channels) used in each layer. Each block is therefore a sparse band matrix, where the non-zero entries of the band matrices determine the filters of the convolution. CNNs are currently extensively used in image processing and image classification, since they outperform most comparable algorithms [1]. They are also the method of choice for the present paper.
There are various CNN designs that can differ in the number of layers, the form of the activation functions and the particular form of the weight matrices W . In this paper, we use a particular CNN based on the so-called U-net introduced in [59]. It has been originally designed for biomedical image segmentation and recently been used for low dose CT in [12,13]. The U-net is based on the so-called fully convolutional network used in reference [75]. Such network architectures employ multichannel filtering which means that the weight matrix in every layer consists of a family of multiple convolution filters followed by the rectified linear unit (ReLU) as activation function. The rectified linear unit is defined by ReLU(x) := max{x, 0}. As shown in [12], the residual images X − Y often have a simpler structure and are more accessible to the U-net than the original outputs. Therefore, learning the residuals and subtracting them from the inputs after the last layer is more effective than directly training for Y. Such an approach is followed in our implementation. The resulting deep neural network architecture is shown in Figure 4.

PAT using FBP combined with the U-net
We are now ready to present the proposed deep learning approach for PAT image reconstruction from sparse data, that uses the FBP algorithm as linear preprocessing step followed by the U-net for removing undersampling artefacts. Recall that we have given sparsely sampled data (p[m, · ]) M m=1 of the form (4), (5). A discrete high-resolution approximation Y ∈ R d×d with d M of the original object is then reconstructed as follows.
(S1) Apply the FBP algorithm to p which yields an reconstruction X ∈ R d×d containing undersampling artefacts. (S2) Apply the U-net shown with residual connection in Figure 4 to X which yields an image Y ∈ R d×d with significantly reduced undersampling artefacts.
The above two steps can also be combined to a single network with the FBP in the first layer and the U-net for the remaining layers. Note that the first step could also be replaced by another linear reconstruction methods such as time reversal and the second step by a different CNN. Such alternative implementations will be investigated in future studies. In this work, we use the FBP algorithm described in [44] for solving step (S1). It is based on discretizing the inversion formula (3) by replacing the inner and the outer integration by numerical quadrature and uses an interpolation procedure to reduce the numerical complexity. For details on the implementation we refer to [44,76].
A crucial ingredient in the above deep learning method is the adjustment of the actual weights in the U-net, which have to be trained on an appropriate training data set. For that purpose we construct training data T = (X n , Y n ) N n=1 by first creating certain phantoms Y n . We then simulate sparse data by numerically implementing the well-known solution formula for the wave equation and subsequently construct X n by applying the FBP algorithm of [44] to the sparse data. For training the network we apply the stochastic gradient algorithm for minimizing the training set error (6), where we take the error measure d corresponding to the 1 -norm

Numerical realization
In this section, we give details on the numerical implementation of the deep learning approach and present reconstruction results under various scenarios.

Data generation and network training
For all numerical results presented below we use d = 128 for the image size and take R = 1 for the radius of the measurement curve. For the sparse data in (4)  For the training of the network on the ellipse phantoms we generate two different data sets, each consisting of N = 1000 training pairs (X n , Y n ) N n=1 . One set of training data corresponds to pressure data without noise and for the second data set we added random noise to the simulated pressure data. The outputs Y n consist of the sum of indicator functions of ellipses generated randomly as described above that are sampled on the 128 × 128 imaging grid. The number of ellipses in each training example is also taken randomly according to the uniform distribution on {1, . . . , 5}. The input images are generated numerically by first computing the sparse pressure data using the solution formula for the wave equation and then applying the FBP algorithm to obtain X n .
For actual training, we use the stochastic gradient descent algorithm with a batch size of one for minimizing (8). We train for 60 epochs which means we make 60 sweeps through the whole training dataset. We take η = 10 −3 for the learning rate, include a momentum parameter β = 0.99, and use the mean absolute error for the distance measure in (8). The weights in the jth layer are initialized by sampling the uniform distribution on [−H , H ] where H := √ 6/ √ D + D +1 and D is the size of the input in layer . This initializer is due to Glorot and Bengio [77]. We use F = 32 channels for the first convolution and the total number of layers is L = 19.

Numerical results
We first test the network trained above on a test set of 50 pairs (X, Y) that are generated according to the random model for the training data described above. For such random ellipse phantoms, the trained network is in all tested case able to almost completely eliminate the sparse data artefacts in the test images. Figure 5 illustrates such results for one of the test phantoms. Figure 5(a) shows the phantom, Figure 5(b) the result of the FPB algorithm which contains severe undersampling artefacts and Figure 5(c) the result of applying the CNN (right) which is almost artefact-free. The actual relative 2reconstruction error Y CNN − Y 2 / Y 2 of the CNN reconstruction is 0.0087 which is much smaller than the relative error of FBP reconstruction which is 0.1811.
We also compared our trained network to penalized TV minimization [37,42] Here P is a discretization of the PAT forward operator using M detector locations and d spatial discretization points and · TV is the discrete TV. For the presented results, we take λ = 0.002 and used the lagged diffusivity algorithm [78] with 20 outer and 20 inner iterations for numerically minimizing (9). TV minimization exploits the sparsity of the gradient as prior information and therefore is especially well suited for reconstructing sums of indicator functions and can be seen as state of the art approach for reconstructing such type of objects. As can be seen from the results in Figure 5(d), TV minimization in fact gives very accurate results. Nevertheless, the deep learning approach yields comparable results in both cases. In terms of the relative 2 -reconstruction error, the CNN reconstruction even outperforms the TV reconstruction (compare with Table 1). In order to test the stability with respect to noise we also test the network on reconstructions coming from noisy data. For that purpose, we added Gaussian noise with a standard deviation equal to 2% of the maximal value to simulated pressure data. Reconstruction results are shown in Figure 6. There we show reconstruction results with two differently trained networks. For the results shown in Figure 6(b) the CNN has been trained on the exact data, and for the results shown in Figure 6(c) it has been trained on noisy data. The reconstructions using each of the networks are again almost artefact-free. The reconstruction from the same data with TV minimization is shown in Figure 6(d). The relative 2 -reconstruction errors for all reconstructions are given in Table 1. Notes: Compared are FBP, TV, and the proposed CNN reconstruction trained on the class of ellipse phantoms without noise (ELL) and with noise (ELLn), as well as trained on a class containing Shepp-Logan type phantom without noise (SL) and with noise (SLn).

Results for Shepp-Logan type phantom
In order to investigate the limitations of the proposed deep learning approach, we additionally applied the CNN (trained on the ellipse phantom class) to test phantoms where the training data are not appropriate (Shepp-Logan type phantoms). Reconstruction results for such a Shepp-Logan type phantom from exact data are shown in Figure 7, which compares results using FBP (Figure 7(a)), TV minimization (Figure 7(b)) and CNN improved versions using the ellipse phantom class without noise (Figure 7(c)) and with noise ( Figure 7(d)) as training data. Figure 8 shows similar results for noisy measurement data with added Gaussian noise with a standard deviation equal to 2% of the maximal pressure value. As expected, this time the network does not completely remove all artefacts. However, despite the Shepp-Logan type test object has features not appearing in the training data, still many artefacts are removed by the network trained on the ellipse phantom class.
We point out, that the less good performance of CNN in Figures 7(a-d) and 8(a-d) is due to the non-appropriate training data and not due to the type of phantoms or the CNN approach itself. To support this claim, we trained additional CNNs on the union of 1000 randomly generated ellipse phantoms and 1000 randomly generated Shepp-Logan type phantoms. The Shepp-Logan type phantoms have position, angle, shape and intensity of every ellipse chosen uniformly at random under the side constraints that the support of every ellipse lies inside the unit disc. The results of the CNN trained on the new training data are shown in Figures 7(e,f) for exact measurement data and in Figure 8(e,f) for noisy measurement data. For both results we applied a CNN trained using training data without (e) and with noise (f). And indeed, when using appropriate training data including Shepp-Logan type phantoms, the CNN is again comparable to TV minimization. We see    these results quite encouraging; future work will be done to extensively test the framework using a variety of training and test data sets, including real world data. The relative 2 -reconstruction errors for all presented numerical results are summarized in Table 1.

Discussion
The above results demonstrate that deep learning-based methods are a promising tool to improve PAT image reconstruction. The presented results show that appropriately trained CNNs can significantly reduce under sampling artefacts and increase reconstruction quality. To further support this claim, in Table 2 we show the averaged relative 2 reconstruction error for 100 Shepp-Logan type phantoms (similar to the ones in Figure 7). We see that even in the case where we train the network for the different class of ellipse shape phantoms, the error decreases significantly compared to FBP.
Any reconstruction method for solving an ill-posed underdetermined problem, either implicitly or explicitly requires a-priori knowledge about the objects to be reconstructed. In classical variational regularization, a-priori knowledge is incorporated by selecting an element which has minimal (or at least small value) of a regularization functional among all objects consistent with the data. In the case of TV, this means that phantoms with small TV ( 1 norm of gradient) are reconstructed. On the other hand, in deep learning-based reconstruction methods a-priori knowledge is not explicitly prescribed in advance. Instead, the a-priori knowledge in encoded in the given training class and CNNs are trained to automatically learn the structure of desirable outputs. In the above results, the training class consists of piecewise constant phantoms which have small TV. Consequently, TV regularization is expected to perform well. It is therefore not surprising, that in this case TV minimization outperforms CNN-based methods. However, the CNN-based methods are more flexible in the sense that by changing the training set they can be adjusted to very different type of phantoms. For example, one can train the CNN for classes of experimental PAT data, where it may be difficult to find an appropriate convex regularizer.

Computational efforts
Application of the trained CNN for image reconstruction is non-iterative and efficient. In fact, one application of the used CNN requires O(F 2 Ld 2 ) FLOPS, where F is the number of channels for the first convolution and L describes the depth of the network. Moreover, CNNs are easily accessible to parallelization. For high-resolution images, F 2 L will be in the order of d and therefore the effort for one evaluation of the CNN is comparable to effort of one evaluation of the PAT forward operator and its adjoint, which both require O(d 3 ) FLOPS. However, for computing the minimizer of (9) we have to repeatedly evaluate the PAT forward operator and its adjoint. In the examples presented above, for TV minimization we evaluated both operations 400 times, and therefore the deep learning image reconstruction approach is expected to be significantly faster than TV minimization or related iterative image reconstruction approaches.
For training and evaluation of the U-net we use the Keras software (see https://keras.io/), which is a high-level application programming interface written in Python. Keras runs on top of TensorFlow (see https://www.tensorflow.org/), the open-source software library for machine intelligence. These software packages allow an efficient and simple implementation of the modified U-net according to Figure 4. The filtered backprojection and the TV-minimization have been implemented in MATLAB. We perform our computations using an Intel Core i7-6850K CPU and a Nvidia Geforce 1080 Ti GPU. The training time for the CNN using the training set of 1000 ellipse phantoms has been 16 seconds per epoch, yields 16 minutes for the overall training time (using 60 epochs). For the larger mixed training data set (consisting of 1000 ellipse phantoms and 1000 Shepp-Logan type phantoms) one epoch requires 25 seconds. Recovering a single image requires 15 milliseconds for the FBP algorithm and 5 milliseconds for applying the CNN. The reconstruction time for the TV-minimization (with 20 outer and 20 inner iterations) algorithm has been 25 seconds. In summary, the total reconstruction time using the two-stage deep learning approach is 20 milliseconds, which is over 1000 times faster than the time required for the TV minimization algorithm. Of course, the reconstruction times strongly depend on the implementation of TV-minimization algorithm and the implementation of the CNN approach. However, any step in the iterative TV-minimization has to be evaluated in a sequential manner, which is a conceptual limitation of iterative methods. Evaluation of the CNN, on the other hand, is non-iterative and inherently parallel, which allows efficient parallel GPU computations.

Conclusion
In this paper, we developed a deep learning approach for PAT from sparse data. In our approach, we first apply a linear reconstruction algorithm to the sparsely sampled data and subsequently apply a CNN with weights adjusted to a set of training data. Evaluation of the CNN is non-iterative and has a similar numerical effort as the standard FBP algorithm for PAT. The proposed deep learning image reconstruction approach has been shown to offer a reconstruction quality similar to state of the art iterative algorithms and at the same time requires a computational effort similar to direct algorithms such as FBP. The presented numerical results can be seen as a proof of principle, that deep learning is feasible and highly promising for image reconstruction in PAT.
As demonstrated in Section 4 the proposed deep learning framework already offers a reconstruction quality comparable to state of the art iterative algorithms for the sparse data problem in PAT. However, as illustrated by Figures 7 and 8 this requires the PAT image to share similarities with the training data used to adjust the weights of the CNN. In future work, we will therefore investigate and test our approach under various real-world scenarios including realistic phantom classes for training and testing, different measurement geometries, and increased discretization sizes. In particular, we will also train and evaluate the CNNs on real world data.
Note that the results in the present paper assume an ideal impulse response of the acoustic measurement system. For example, this is appropriate for PAT using integrating optical line detectors, which have broad detection bandwidth; see [20]. In the case that piezoelectric sensors are used for acoustic detection, the limited bandwidth is an important issue that must be taken into account in the PAT forward model and the PAT inverse problem [79]. In particular, in this case, the CNN must also be trained to learn a deconvolution process addressing the impulse response function. Such investigations, as well as the application to real data, are beyond the scope of this paper and have been addressed in our recent work [80]. (Note that the presented paper has already been submitted much earlier, initially on 14 April 2017 and to IPSE on 1 July 2017.) It is an interesting line of future research using other CNNs that may outperform the currently implemented U-net. We further work on the theoretical analysis of our proposal providing insight why it works that well, and how to steer the network design for further improving its performance and flexibility.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
The work of S.A. and M.H. has been supported by the Austrian Science Fund (FWF), project P 30747-N32.