DOLPHIn - Dictionary Learning for Phase Retrieval

We propose a new algorithm to learn a dictionary for reconstructing and sparsely encoding signals from measurements without phase. Specifically, we consider the task of estimating a two-dimensional image from squared-magnitude measurements of a complex-valued linear transformation of the original image. Several recent phase retrieval algorithms exploit underlying sparsity of the unknown signal in order to improve recovery performance. In this work, we consider such a sparse signal prior in the context of phase retrieval, when the sparsifying dictionary is not known in advance. Our algorithm jointly reconstructs the unknown signal - possibly corrupted by noise - and learns a dictionary such that each patch of the estimated image can be sparsely represented. Numerical experiments demonstrate that our approach can obtain significantly better reconstructions for phase retrieval problems with noise than methods that cannot exploit such"hidden"sparsity. Moreover, on the theoretical side, we provide a convergence result for our method.

In recent years, new phase retrieval techniques were developed for recovering sparse signals, which are linear combinations of only a few atoms from a known dictionary [8], [14], [13], [15].With a sparsity assumption, these algorithms obtained better recovery performance than traditional nonsparse approaches.The main idea is akin to compressed sensing, where one works with fewer (linear) measurements than signal components [16], [17], [18].An important motivation for developing sparse recovery techniques was that many classes of signals admit a sparse approximation in some basis or overcomplete dictionary [19], [20], [21].While sometimes such dictionaries are known explicitly, better results have been achieved by adapting the dictionary to the data, e.g., for image denoising [20].Numerous algorithms have been developed for this task, see, e.g., [19], [22], [23].In this traditional setting, the signal measurements are linear and a large database of training signals is used to train the dictionary.
In this work, we propose a dictionary learning formulation for simultaneously solving the signal reconstruction and sparse representation problems given nonlinear, phaseless and noisy measurements.To optimize the resulting (nonconvex) objective function, our algorithm-referred to as DOLPHIn (DictiOnary Learning for PHase retrIeval)-alternates between several minimization steps, thus monotonically reducing the value of the objective until a stationary point is found (if step sizes are chosen appropriately).Specifially, we iterate between best fitting the data and sparsely representing the recovered signal.DOLPHIn combines projected gradient descent steps to update the signal, iterative shrinkage to obtain a sparse approximation [24], and block-coordinate descent for the dictionary update [23].
In various experiments on image reconstruction problems, we demonstrate the ability of DOLPHIn to achieve significantly improved results when the oversampling ratio is low and the noise level high, compared to the recent state-of-theart Wirtinger Flow (WF) method [12], which cannot exploit sparsity if the dictionary is unknown.In this two-dimensional setting, we break an image down into small patches and train a dictionary such that each patch can be sparsely represented using this dictionary.The patch size as well as the amount of overlap between patches can be freely chosen, which allows us to control the trade-off between the amount of computation required to reconstruct the signal and the quality of the result.
The paper is organized as follows: In Sections II and III, we introduce the DOLPHIn framework and algorithm.Then, in Section IV, we present numerical experiments and implementation details, along with discussions about (hyper-)parameter selection and variants of DOLPHIn.We conclude the paper in Section V.The appendix provides further details on the mathematical derivation of the DOLPHIn algorithm.A short preliminary version of this work appeared in the conference paper [25].

II. PHASE RETRIEVAL MEETS DICTIONARY LEARNING
In mathematical terms, the phase retrieval problem can be formulated as solving a nonlinear system of equations: where the functions f i : C N → C are linear operators and the scalars y i are nonlinear measurements of the unknown original signal x in X , obtained by removing the phase information.The set X represents constraints corresponding to additional prior information about x.For instance, when dealing with real-valued bounded signals, this may typically be a box constraint X = [0, 1] N .Other common constraints include information about the support set-that is, the set of nonzero coefficients of x.Classical phase retrieval concerns the recovery of x given the (squared) modulus of the signal's Fourier transform.Other commonly considered cases pertain to randomized measurements (f i are random linear functions) or coded diffraction patterns, i.e., concatenations of random signal masks and Fourier transforms (see, e.g., [2], [12]).

A. Prior and Related Work
The most popular methods for classical phase retrieval-Fienup's algorithm [5] and many related approaches [2], [4], [6], [26], [27]-are based on alternating projections onto the sets Y := {x ∈ C N s.t.|f i (x)| = y i ∀ i} (or {x ∈ C N s.t.|f i (x)| 2 = y i ∀ i}) and onto the set X .However, the nonconvexity of Y makes the projection not uniquely defined and possibly hard to compute.The success of such projection-based methods hinges critically on precise prior knowledge (which, in general, will not be available in practice) and on the choice of a projection operator onto Y. Ultimately, convergence to x (up to global phase) is in general not guaranteed and these methods often fail in practice.
Further algorithmic techniques to tackle (1) include two different semidefinite relaxation approaches, PhaseLift [7] and PhaseCut [11].PhaseLift "lifts" (1) into the space of (complex) positive semidefinite rank-1 matrices via the variable transformation X := xx * .Then, the nonlinear constraints |f i (x)| 2 = y i are equivalent to linear constraints with respect to the matrix variable X.By suitably relaxing the immediate but intractable rank-minimization objective, one obtains a convex semidefinite program (SDP).Similarly, PhaseCut introduces a separate variable u for the phase, allowing to eliminate x, and then lifts u to obtain an equivalent problem with a rank-1constraint, which can be dropped to obtain a different SDP relaxation of (1).Despite some guarantees on when these relaxations are tight, i.e., allow for correctly recovering the solution to (1) (again up to a global phase factor), their practical applicability is limited due to the dimension of the SDP that grows quadratically with the problem dimension.
A recent method that works in the original variable space is the so-called Wirtinger Flow algorithm [12].Here, (1) is recast as the optimization problem which is approximately solved by a gradient descent algorithm.Note that in the case of complex variables, the concept of a gradient is not well-defined, but as shown in [12], a strongly related expression termed the "Wirtinger derivative" can be used instead and indeed reduces to the actual gradient in the real case.For the case of i.i.d.Gaussian random measurements, local convergence with high probability can be proven for the method, and a certain spectral initialization provides sufficiently accurate signal estimates for these results to be applicable.Further variants of the Wirtinger Flow (WF) method that have been investigated are the Truncated WF [28], which involves improving search directions by a statistically motivated technique to filter out components that bear "too much" influence, and Thresholded WF [13], which allows for improved reconstruction of sparse signals (i.e., ones with only a few significant components or nonzero elements), in particular when the measurements are corrupted by noise.The concept of sparsity has been successfully employed in the context of signal reconstruction from linear measurements, perhaps most prominently in the field of compressed sensing [16], [17], [18], [29] during the past decade.There, the task is to recover an unkown signal x ∈ C N from M < N linear measurements-that is, finding the desired solution among the infinitely many solutions of an underdetermined system of linear equations.For signals that are (exactly or approximately) sparse with respect to some basis or dictionary, i.e., when x ≈ Dâ for a matrix D and a vector â that has few nonzero entries, such recovery problems have been shown to be solvable in a very broad variety of settings and applications, and with a host of different algorithms.Dictionaries enabling sparse signal representations are sometimes, but not generally, known in advance.The goal of dictionary learning is to improve upon the sparsity achievable with a given (analytical) dictionary, or to find a suitable dictionary in the first place.Given a set of training signals, the task consists of finding a dictionary such that every training signal can be wellapproximated by linear combinations of just a few atoms.Again, many methods have been developed for this purpose (see, e.g., [19], [20], [21], [22], [23]) and demonstrated to work well in different practical applications.
Signal sparsity (or compressability) can also be beneficially exploited in phase retrieval methods, cf.[8], [9], [13], [14], [15].However, to the best of our knowledge, existing methods assume that the signal is sparse itself or sparse with respect to a fixed pre-defined dictionary.This motivates the development of new algorithms and formulations to jointly learn suitable dictionaries and reconstruct input signals from nonlinear measurements.

B. Dictionary Learning for Phase Retrieval
In this paper, we consider the problem of phase retrieval by focusing on image reconstruction applications.Therefore, we will work in a two-dimensional setting directly.However, it should be noted that all expressions and algorithms can also easily be formulated for one-dimensional signals like (1), as detailed in the appendix.We will also consider the case of noisy measurements, and will show that our approach based on dictionary learning is particularly robust to noise, which is an important feature in practice.
Concretely, we wish to recover an image X in [0, 1] N1×N2 from noise-corrupted phaseless nonlinear measurements where F : C N1×N2 → C M1×M2 is a linear operator, N is a real matrix whose entries represent noise, and the complex modulus and squares are taken component-wise.As mentioned earlier, signal sparsity is known to improve the performance of phase retrieval algorithms, but a sparsifying transform is not always known in advance, or a better choice than a predefined selection can sometimes be obtained by adapting the dictionary to the data.In the context of image reconstruction, this motivates learning a dictionary D in R s×n such that each s 1 × s 2 patch xi of X, represented as a vector of size s = s 1 s 2 , can be approximated by xi ≈ Da i with a sparse vector a i in R n .
Here, n is chosen a priori and the number of patches depends on whether the patches are overlapping or not.In general, D is chosen such that n ≥ s.With linear measurements, the paradigm would be similar to the successful image denoising technique of [20], but the problem (3) is significantly more difficult to solve due to the modulus operator.Before detailing our algorithm for solving (3), we introduce the following notation.Because our approach is patch-based (as most dictionary learning formulations), we consider the linear operator E : C N1×N2 → C s×p that extracts the p patches x i (which may overlap or not) from an image X and forms the matrix E(X) = (x1 , . . ., x p ).Similarly, we define the linear operator R : C s×p → C N1×N2 that reverses this process, i.e., builds an image from a matrix containing vectorized patches as its columns.Thus, in particular, we have R(E(X)) = X.When the patches do not overlap, the operator R simply places every small patch at its appropriate location in a larger N 1 × N 2 image.When they overlap, the operator R averages the pixel values from the patches that fall into the same location.Further, let A := (a 1 , . . ., a p ) in R n×p be the matrix containing the patch representation coefficient vectors as columns.Then, our desired sparse-approximation relation "x i ≈ Da i for all i" can be expressed as E(X) ≈ DA.
With this notation in hand, we may now introduce our method, called DOLPHIn (DictiOnary Learning for PHase retrIeval).We consider an optimization problem which can be interpreted as a combination of an optimization-based approach to phase retrieval-minimizing the residual norm with respect to the set of nonlinear equations induced by the phaseless measurements, cf.(2)-and a (patch-based) dictionary learning model similar to that used for image denoising in [20].The model contains three variables: The image, or phase retrieval solution X, the dictionary D and the matrix A containing as columns the coefficient vectors of the representation X ≈ R(DA).The phase retrieval task consists of estimating X and the dictionary learning or sparse coding task consists of estimating D and A; a common objective function provides feedback between the two objectives, with the goal of improving the phase retrieval reconstruction procedure by encouraging the patches of X to admit a sparse approximation.
Formally, the DOLPHIn formulation consists of minimizing min Here, X F denotes the Frobenius matrix-norm, which generalizes the Euclidean norm to matrices.The parameters µ, λ > 0 in the objective (4) provide a way to control the trade-off between the data fidelity term from the phase retrieval problem and the approximation sparsity of the image patches 1 .To that effect, we use the 1 -norm, which is wellknown to have a sparsity-inducing effect [30].In order to avoid scaling ambiguities, we also restrict D to be in the subset D := {D ∈ R s×n : d j 2 ≤ 1 ∀j = 1, . . ., n} of matrices with column 2 -norms at most 1, and assume n < p (otherwise, each patch is trivially representable by a 1-sparse vector a i by including x i / x i 2 as a column of D).The model (4) could also easily be modified to include further side constraints, a different type of nonlinear measurements, or multiple images or measurements, respectively; we omit these extensions for simplicity.

III. ALGORITHMIC FRAMEWORK
Similar to classical dictionary learning [19], [22], [21], [31] and phase retrieval, problem (4) is nonconvex and difficult to solve.Therefore, we adopt an algorithm that provides monotonic decrease of the objective while converging to a stationary point (see Section III-D below).
The algorithmic framework we employ is that of alternating minimization: For each variable A, X and D in turn, we take one step towards solving (4) with respect to this variable alone, keeping the other ones fixed.Each of these subproblems is convex in the remaining unfixed optimization variable, and well-known efficient algorithms can be employed accordingly.We summarize our method in Algorithm 1, where the superscript * denotes the adjoint operator (for a matrix Z, Z * is thus the conjugate transpose), (•) extracts the real part of a complex-valued argument, and denotes the Hadamard (element-wise) product of two matrices.The algorithm also involves the classical soft-thresholding operator S τ (Z) := max{0, |Z| − τ } sign(Z) and the Euclidean projection P X (Z) := max{0, min{1, Z}} onto X := [0, 1] N1×N2 ; here, all operations are meant component-wise.
To avoid training the dictionary on potentially useless early estimates, the algorithm performs two phases-while the iteration counter is smaller than K 1 , the dictionary is not updated.Below, we explain the algorithmic steps in more detail.
Note that DOLPHIn actually produces two distinct reconstructions of the desired image, namely X (the per se "image variable") and R(DA) (the image assembled from the sparsely coded patches) 2 .Our numerical experiments in Section IV show that in many cases, R(DA) is in fact slightly or even significantly better than X with respect to at least one quantitative quality measure and is therefore also considered a possible reconstruction output of Algorithm 1 (at least in the noisy setups we consider in this paper).Nevertheless, X is sometimes more visually appealing and can be used, for instance, to refine parameter settings (if it differs strongly from R(DA)) or to assess the influence of the patchbased "regularization" on the pure non-sparse Wirtinger Flow method corresponding to the formulation where λ and µ are set to zero.

A. Updating the Patch Representation Vectors
Updating A (i.e., considering (4) with D and X fixed at their current values) consists of decreasing the objective which is separable in the patches i = 1 . . ., p. Therefore, we can update all vectors a i independently and/or in parallel.
To do so, we choose to perform one step of the wellknown algorithm ISTA (see, e.g., [24]), which is a gradientbased method that is able to take into account a non-smooth regularizer such as the 1 -norm.Concretely, the following update is performed for each i = 1, . . ., p: This update involves a gradient descent step (the gradient with respect to a i of the smooth term in each summand of ( 5) is D ( ) D ( ) a i ( ) − x i , respectively) followed by softthresholding.Constructing A ( +1) from the a i ( +1) as specified above is equivalent to Step 2 of Algorithm 1.
The step size parameter γ A can be chosen in (0, 1/L A ), where L A is an upper bound on the Lipschitz constant of the gradient; here, 2 would be appropriate, but a less computationally demanding strategy is to use a backtracking scheme to automatically update L A [24].
A technical subtlety is noteworthy in this regard: We can either find one γ A that works for the whole matrix-variable update problem-this is what is stated implicitly in Step 2or we could find different values, say γ a,i , for each column a i , i = 1, . . ., p, of A separately.Our implementation does the latter, since it employs a backtracking strategy for each column update independently.

B. Updating the Image Estimate
With D = D ( ) and A = A ( +1) fixed, updating X consists of decreasing the objective This problem can be seen as a regularized version of the phase retrieval problem (with regularization parameter µ) that encourages the patches of X to be close to the sparse approximation DA obtained during the previous (inner) iterations.
Our approach to decrease the value of the objective ( 7) is by a projected gradient descent step.In fact, for µ = 0, this step reduces to the Wirtinger flow method [12], but with necessary modifications to take into account the constraints on X (realvaluedness and variable bounds [0, 1]).
The gradient of ϕ(X) F with respect to X can be computed as by using the chain rule.For ψ(X) where R is an N 1 × N 2 matrix whose entries r ij equal the number of patches the respective pixel x ij is contained in.Note that if the whole image is divided into a complete set of nonoverlapping patches, R will just be the all-ones matrix; otherwise, the element-wise multiplication with R undoes the averaging of pixel values performed by R when assembling an image from overlapping patches.
Finally, the gradient w.r.t.X of the objective in ( 7) is ∇ϕ(X) + ∇ψ(X) ∈ R N1×N2 , and the update in Step 3 of Algorithm 1 is indeed shown to be a projected gradient descent step.Typically, a backtracking (line search) strategy is used for choosing the step size γ X ; see Theorem 2 in Section III-D for a selection rule that gives theoretical convergence, and also Section IV-B for a heuristic alternative.

C. Updating the Dictionary
To update the dictionary, i.e., to approximately solve (4) w.r.t.D alone, keeping X and A fixed at their current values, we employ one pass of a block-coordinate descent (BCD) algorithm on the columns of the dictionary [23].The objective to decrease may be written as and the update rule given by Steps 4 -13 corresponds3 to one iteration of [21,Algorithm 11] applied to (8).
Algorithm 1 Dictionary learning for phase retrieval (DOLPHIn) choose step size γ A as explained in Section III-A and update 3: choose step size γ X as explained in Section III-D or IV-B and update where R is defined in Section III-B 4: do not update the dictionary: : reset j-th column: e.g., (D ( +1) To see this, note that each column update problem has a closed-form solution: 1 wj q j with w j := i (a i j ) 2 and q j := i a i j x i − k =j a i k d k ; here, we abbreviated a i := a i ( +1) , x i := x i ( +1) .If w j = 0, and thus a i j = 0 for all i = 1, . . ., p, then column d j is not used in any of the current patch representations; in that case, the column's update problem has a constant objective and is therefore solved by any d with d 2 ≤ 1, e.g., a normalized random vector as in Step 12 of Algorithm 1.The computations performed in Steps 8-13 of Algorithm 1 are equivalent to these solutions, expressed differently using the matrices B and C defined there.Note that the operations could be parallelized to speed up computation.

D. Convergence of the Algorithm
As mentioned at the beginning of this section, with appropriate step size choices, DOLPHIn (Algorithm 1) exhibits the property of monotonically decreasing the objective function value (4) at each iteration.In particular, many line-search type step size selection mechanisms aim precisely at reducing the objective; for simplicity, we will simply refer to such subroutines as "suitable backtracking schemes" below.Concrete examples are the ISTA backtracking from [24, Section 3] we can employ in the update of A, or the rule given in Theorem 2 for the projected gradient descent update of X (a different choice is described in Section IV-B); further variants are discussed, e.g., in [32].
The case of termination in Proposition 1 can occur when the backtracking scheme is combined with a maximal number of trial steps, which are often used as a safeguard against numerical stability problems or as a heuristic stopping condition to terminate the algorithm if no (sufficient) improvement can be reached even with tiny step sizes.Note also that the assertion of Proposition 1 trivially holds true if all step sizes are 0; naturally, a true descent of the objective requires a strictly positive step size in at least one update step.In our algorithm, step size positivity can always be guaranteed since all these updates involve objective functions whose gradient is Lipschitz continuous, and backtracking essentially finds step sizes inversely proportional to (local) Lipschitz constants.(Due to non-expansiveness, the projection in the X-update poses no problem either).
Adopting a specific Armijo-type step size selection rule for the X-update allows us to infer a convergence result, stated in Theorem 2 below.To simplify the presentation, let Theorem 2: Let η ∈ (0, 1) and η > 0. Consider the DOLPHIn variant consisting of Algorithm 1 with K 2 = ∞ and the following Armijo rule to be used in Step 3: Determine γ X as the largest number in ≤ ν for all and j (and i), then every accumulation point of the sequence {(A ( ) , X ( ) , D ( ) )} =0,1,2,... of DOLPHIn iterates is a stationary point of problem (4).
Proof: The proof works by expressing Algorithm 1 as a specific instantiation of the coordinate gradient descent (CGD) method from [34] and analyzing the objective descent achievable in each update of A, X and D, respectively.The technical details make the rigorous formal proof somewhat lengthy (it can be found in the appendix of the earlier preprint version [35] of this paper); due to space limitations, we only sketch it here.
The CGD method works by solving subproblems to obtain directions of improvement for blocks of variables at a time-in our case, (the columns of) A, the matrix X, and the columns of D correspond to such blocks-and then taking steps along these directions.More specifically, the directions are generated using a (suitably parameterized) strictly convex quadratic approximation of the objective (built using the gradient).Essentially due to the strict convexity, it is then always possible to make a positive step along such a direction that decreases the (original) objective, unless stationarity already holds.Using a certain Armijo line-search rule designed to find such positive step sizes which achieve a sufficient objective reduction, [34, Theorem 1] ensures (under mild further assumptions, which in our case essentially translate to the stated boundedness requirement of the step sizes) that every accumulation point of the iterate sequence is indeed a stationary point of the addressed (block-separable) problem.
To embed DOLPHIn into the CGD framework, we can interpret the difference between one iterate and the next (w.r.t. the variable "block" under consideration) as the improvement direction, and proceed to show that we can always choose a step size equal to 1 in the Armijo-criterion from [34] (cf.( 9) and (46) therein).For this to work out, we need to impose slightly stricter conditions on other parameters used to define that rule than what is needed in [34]; these conditions are derived directly from known descent properties of the Dand A-updates of our method (essentially, ISTA descent properties as in [24]).That way, the Dand Aupdates automatically satisfy the specific CGD Armijo rule, and the actual backtracking scheme for the X-update given in the present theorem can be shown to assert that our X-update does so as well.(The step sizes used in DOLPHIn could also be reinterpreted in the CGD framework as scaling factors of diagonal Hessian approximations of the combined objective to be used in the direction-finding subproblems.With such simple Hessian approximations, the obtained directions are then indeed equivalent to the iterate-differences resulting from the DOLPHIn update schemes.)The claim then follows directly from [34, Theorem 1(e) (and its extensions discussed in Section 8)].
A more formal explanation for why the step sizes can be chosen positive in each step can be found on page 392 of [34]; the boundedness of approximate Hessians is stated in [34, Assumption 1].Arguably, assuming step sizes are bounded away from zero by a constant may become problematic in theory (imagine an Armijo-generated step size sequence converging to zero), but will not pose a problem in practice where one always faces the limitations of numerical machine precision.(Note also that, in practice, the number of linesearch trials can be effectively reduced by choosing η based on the previous step size [34].) Our implementation uses a different backtracking scheme for the X-update (see Section IV-B) that can be viewed as a cheaper heuristic alternative to the stated Armijo-rule which still ensures monotonic objective descent (and hence is "suitable" in the context of Proposition 1), also enables strictly positive steps, and empirically performs equally well.Finally, we remark that the condition µ ≥ 1 in Theorem 2 can be dropped if the relevant objective parts of problem (4) are not rescaled for the Aand D-updates, respectively.
To conclude the discussion of convergence, we point out that one can obtain a linear rate of convergence for DOLPHIn with the Armijo rule from Theorem 2, by extending the results of [34, Theorem 2 (cf.Section 8)].

IV. NUMERICAL RESULTS
In this section, we discuss various numerical experiments to study the effectiveness of the DOLPHIn algorithm.To that end, we consider several types of linear operators F within our model (4) (namely, different types of Gaussian random operators and coded diffraction models).Details on the experimental setup and our implementation are given in the first two subsections, before presenting the main numerical results in Subsection IV-C.Our experiments demonstrate that with noisy measurements, DOLPHIn gives significantly better image reconstructions than the Wirtinger Flow method [12], one recent state-of-the-art phase retrieval algorithm, thereby showing that introducing sparsity via a (simultaneously) learned dictionary is indeed a promising new approach for signal reconstruction from noisy, phaseless, nonlinear measurements.Furthermore, we discuss sensitivity of DOLPHIn with regard to various (hyper-)parameter choices (Subsections.IV-D, IV-E and IV-F) and a variant in which the 1 -regularization term in the objective is replaced by explicit constraints on the sparsity of the patch-representation coefficient vectors a i (Subsection.IV-G).

A. Experimental Setup
We consider several linear operators F corresponding to different types of measurements that are classical in the phase retrieval literature.We denote by F the (normalized) 2D-Fourier operator (implemented using fast Fourier transforms), and introduce two complex Gaussian matrices G ∈ C M1×N1 , H ∈ C M2×N2 , whose entries are i.i.d.samples from the distribution N (0, I/2) + iN (0, I/2).Then, we experiment with the operators F(X) = GX, F(X) = GXG * , F(X) = GXH * , and the coded diffraction pattern model where Z j := Z {(j−1)N1+1,...,jN1},• (i.e., Z = (Z 1 , . . ., Z m )) and the M j 's are admissible coded diffraction patterns (CDPs), see for instance [12,Section 4.1]; in our experiments we used ternary CDPs, such that each M j is in {0, ±1} N1×N2 .(Later, we will also consider octanary CDPs with ) To reconstruct X, we choose an oversampling setting where M 1 = 4N 1 , M 2 = 4N 2 and/or m = 2, respectively.Moreover, we corrupt our measurements with additive white Gaussian noise N such that SNR(Y, |F( X)| 2 + N) = 10 dB for the Gaussian-type, and 20 dB for CDP measurements, respectively.Note that these settings yield, in particular, a relatively heavy noise level for the Gaussian cases and a relatively low oversampling ratio for the CDPs.

B. Implementation Details
We choose to initialize our algorithm with a simple random image X (0) in X to demonstrate the robustness of our approach with respect to its initialization.Nevertheless, other choices are possible.For instance, one may also initialize X (0) with a power-method scheme similar to that proposed in [12], modified to account for the real-valuedness and boxconstraints.The dictionary is initialized as , where F D corresponds to the two-dimensional discrete cosine transform (see, e.g., [20]).
To update A, we use the ISTA implementation from the SPAMS package 4 [23] with its integrated backtracking line search (for L A ). Regarding the step sizes γ X for the update of X (Step 3 of Algorithm 1), we adopt the following simple strategy, which is similar to that from [24] and may be viewed as a heuristic to the Armijo rule from Theorem 2: Whenever the gradient step leads to a reduction in the objective function value, we accept it.Otherwise, we recompute the step with γ X halved until a reduction is achieved; here, as a safeguard against numerical issues, we implemented a limit of 100 trials (forcing termination in case all were unsuccessful), but this was never reached in any of our computational experiments.Regardless of whether γ X was reduced or not, we reset its value to 1.68γ X for the next round; the initial step size is γ X 0 = 10 4 /f (0) , where f (0) is the objective function of the DOLPHIn model ( 4), evaluated at X (0) , D (0) and least-squares patch representations arg min A E(X (0) ) − D (0) A 2 F .(Note that, while this rule deviates from the theoretical convergence 4 http://spams-devel.gforge.inria.fr/Theorem 2, Propositon 1 and the remarks following it remain applicable.) Finally, we consider nonoverlapping 8 × 8 patches and run DOLPHIn (Algorithm 1) with K 1 = 25 and K 2 = 50; the regularization/penalty parameter values can be read from Table I (there, m Y is the number of elements of Y).We remark that these parameter values were empirically benchmarked to work well for the measurement setups and instances considered here; a discussion about the stability of our approach with respect to these parameter choices is presented below in Section IV-D.Further experiments with a sparsity-constrained DOLPHIn variant and using overlapping patches are discussed in Section IV-G.
Our DOLPHIn code is available online on the first author's webpage 5 .

C. Computational Experiments
We test our method on a collection of typical (grayscale) test images used in the literature, see Figure 1.All experiments were carried out on a Linux 64-bit quad-core machine (2.8 GHz, 8 GB RAM) running Matlab R2016a (single-thread).
We evaluate our approach with the following question in mind: Can we improve upon the quality of reconstruction compared to standard phase retrieval algorithms?Standard methods cannot exploit sparsity if the underlying basis or dictionary is unknown; as we will see, the introduced (patch-) sparsity indeed allows for better recovery results (at least in the oversampling and noise regimes considered here).
To evaluate the achievable sparsity, we look at the average number of nonzeros in the columns of A after running our algorithm.Generally, smaller values indicate an improved suitability of the learned dictionary for sparse patch coding (high values often occur if the regularization parameter λ is too small and the dictionary is learning the noise, which is something we would like to avoid).To assess the quality of the image reconstructions, we consider two standard measures, namely the peak signal-to-noise ratio (PSNR) of a reconstruction as well as its structural similarity index (SSIM) [36].For PSNR, larger values are better, and SSIM-values closer to 1 (always ranging between 0 and 1) indicate better visual quality.
Table I displays the CPU times, PSNR-and SSIM-values and mean patch representation vector sparsity levels obtained for the various measurement types, averaged over the instance groups of the same size.The concrete examples in Figures 2  and 3 show the results from DOLPHIn and plain Wirtinger Flow (WF; the real-valued, [0, 1]-box constrained variant, which corresponds to running Algorithm 1 with µ = 0 and omitting the updates of A and D).In all tests, we let the Wirtinger Flow method run for the same number of iterations (75) and use the same starting points as for the DOLPHIn runs.Note that instead of random X (0) , we could also use a spectral initialization similar to the one proposed for the (unconstrained) Wirtinger Flow algorithm, see [12].Such initialization can improve WF reconstruction (at least in the noiseless case), and may also provide better initial estimates for DOLPHIn.We have experimented with such a  The DOLPHIn method consistently provides better image reconstructions than WF, which clearly shows that our approach successfully introduces sparsity into the phase retrieval problem and exploits it for estimating the solution.As can be seen from Table I, the obtained dictionaries allow for rather sparse representation vectors, with the effect of making better use of the information provided by the measurements, and also denoising the image along the way.The latter fact can be seen in the examples (Fig. 2 and 3, see also Fig. 4) and also inferred from the significantly higher PSNR and SSIM values for the estimates X DOLPHIn and R(DA) (or P X (R(DA)), resp.)obtained from DOLPHIn compared to the reconstruction X WF of the WF algorithm (which cannot make use of hidden sparsity).The gain in reconstruction quality is more visible in the example of Fig. 3 (cf.Fig. 4) than for that in Fig. 2, though both cases assert higher quantitative measures.Furthermore, note that DOLPHIn naturally has higher running times than WF, since it performs more work per iteration (also, different types of measurement operators require different amounts of time to evaluate).Note also that storing A and D instead of an actual image X (such as the WF reconstruction) requires saving only about half as many numbers (including integer index pairs for the nonzero entries in A).As indicated earlier, the reconstruction R(DA) is quite often better than X DOLPHIn w.r.t. at least one of either PSNR or SSIM value.Nonetheless, X DOLPHIn may be visually more appealing than R(DA) even if the latter exhibits a higher quantitative quality measure (as is the case, for instance, in the example of Figures 3 and 4); Furthermore, occasionally X DOLPHIn achieves notably better (quantitative) measures than R(DA); an intuitive explanation may be that if, while the sparse coding of patches served well to eliminate the noise and-by means of the patch-fit objective term-to successfully "push" the X-update steps toward a solution of good quality, that solution eventually becomes "so good", then the fact that R(DA) is designed to be only an approximation (of X) predominates.
On the other hand, X DOLPHIn is sometimes very close to X WF , which indicates a suboptimal setting of the parameters µ and λ that control how much "feedback" the patch-fitting objective term introduces into the Wirtinger-Flow-like Xupdate in the DOLPHIn algorithm.We discuss parameter choices in more detail in the following subsection.

D. Hyperparameter Choices and Sensitivity
The DOLPHIn algorithm requires several parameters to be specified a priori.Most can be referred to as design parameters; the most prominent ones are the size of image patches (s 1 × s 2 ), whether patches should overlap or not (not given a name here), and the number n of dictionary atoms to learn.Furthermore, there are certain algorithmic parameters (in a broad sense) that need to be fixed, e.g., the iteration limits K 1 and K 2 or the initial dictionary D (0) and image estimate X (0) .The arguably most important parameters, however, are the model or regularization parameters µ and λ.For any fixed combination of design and algorithmic parameters in a certain measurement setup (fixed measurement type/model and (assumed) noise level), it is conceivable that one can find some values for µ and λ that work well for most instances, while the converse-choosing, say, iteration limits for fixed µ, λ and other parameters-is clearly not a very practical approach.
As is common for regularization parameters, a good "general-purpose" way to choose µ and λ a priori is unfortunately not known.To obtain the specific choices used in our experiments, we fixed all the other parameters (including noise SNR and oversampling ratios), then (for each measurement model) ran preliminary tests to identify values µ for which good results could be produced with some λ, and finally fixed µ at such values and ran extensive benchmark tests to find λ values that give the best results.
For DOLPHIn, µ offers some control over how much "feedback" from the current sparse approximation of the current image estimate is taken into account in the update step to produce the next image iterate-overly large values hinder the progress made by the Wirtinger-Flow-like part of the X-update, while too small values marginalize the influence of the approximation R(DA), with one consequence being that the automatic denoising feature is essentially lost.Nevertheless, in our experiments we found that DOLPHIn is not strongly sensitive to the specific choice of µ once a certain regime has been identified in which one is able to produce meaningful results (for some choice of λ).Hence, λ may be considered the most important parameter; note that this intuitively makes sense, as λ controls how strongly sparsity of the patch representations is actually promoted, the exploitation of which to obtain improved reconstructions being the driving purpose behind our development of DOLPHIn.
Figure 5 illustrates the sensitivity of DOLPHIn with respect to λ, in terms of reconstruction quality and achievable patchsparsity, for different noise levels, and examplary measurement types and problem sizes.(In this figure, image quality is measured by SSIM values alone; the plots using PSNR instead look very similar and were thus omitted.Note, however, that the parameters λ yielding the best SSIM and PSNR values, respectively, need not be the same.)As shown by (a) and (d), there is a clear correlation between the best reconstruction quality that can be achieved (in noisy settings) and the average sparsity of the patch representation vectors a i .For larger noise, clearly a larger λ is needed to achieve good resultssee (b) and (e)-which shows that a stronger promotion of patch-sparsity is an adequate response to increased noise, as is known for linear sparsity-based denoising as well.Similarly, increasing the number of measurements allows to pick a smaller λ whatever the noise level actually is, as can be seen in (c) and (f), respectively.The dependence of the best λ on the noise level appears to follow an exponential curve (w.r.t. the reciprocal SNR) which is "dampened" by the sampling ratio, i.e., becoming less steep and pronounced the more measurements are available, cf.(b) and (e).Indeed, again referring to the subplots (c) and (f), at a fixed noise level the best λ values seem to decrease exponentially with growing number of measurements.It remains subject of future research to investigate these dependencies in more detail, e.g., to come up with more or less general (functional) rules for choosing λ.

E. Impact of Increased Inner Iteration Counts
It is worth considering whether more inner iterations-i.e., consecutive update steps for the different variable blockslead to further improvements of the results and / or faster convergence.In general, this is an open question for blockcoordinate descent algorithms, so the choices are typically made empirically.Our default choices of a = 1 ISTA iterations for the A-update (Step 2 in Algorithm 1), x = 1 projected gradient descent steps for the X-update (Step 3) and d = 1 iterations of the BCD scheme for the D-update (Steps 4-13) primarily reflect the desire to keep the overall iteration cost low.To assess whether another choice might yield significant improvements, we evaluated the DOLPHIn performance for all combinations of a ∈ {1, 3, 5} and d ∈ {1, 3, 5}, keeping all other parameters equal to the settings from the experiments reported on above.(We also tried these combinations together with an increased iteration count for the X-update, but already for x = 2 or x = 3 the results were far worse than with just 1 projected gradient descent step; the reason can likely be found in the fact that without adapting A and D to a modified X-iterate, the patch-fitting term of the objective tries to keep X close to a then-unsuitable estimate R(DA) based on older X-iterates, which apparently has a quite notable negative influence on the achievable progress in the X-update loop.) The results are summarized in condensed format in Table II, from which we can read off the spread of the best and worst results (among the best ones achievable with either X or P X (R(DA))) for each measurement-instance combination among all combinations (a, d) ∈ {1, 3, 5} 2 .(Full tables for each test run can be found alongside our DOLPHIn code on the first author's webpage.)As the table shows, the results are all quite close; while some settings lead to sparser patch representations, the overall quality of the best reconstructions for the various combinations usually differ only slightly, and no particular combination stands out clearly as being better than all others.In particular, comparing the results with those in Table I, we find that our default settings provide consistently good results; they may be improved upon with some other combination of a and d, but at least with the same total iteration horizon (K 1 + K 2 = 75), the improvements are often only marginal.Since the overall runtime reductions (if any) obtained with other choices for (a, d) are also very small, there does not seem to be a clear advantage to using more than a single iteration for either update problem.

F. Influence of the First DOLPHIn Phase
Our algorithm keeps the dictionary fixed at its initialization for the first K 1 iterations in order to prevent the dictionary from training on relatively useless first reconstruction iterates.Indeed, if all variables including D are updated right from the beginning (i.e., K 1 = 0, K 2 = 75), then we end up with inferior results (keeping all other parameters unchanged): The obtained patch representations are much less sparse, the quality of the image estimate R(DA) decreases drastically, and also the quality of the reconstruction X becomes notably worse.This demonstrates that the dictionary apparently "learns too much noise" when updated from the beginning, and the positive effect of filtering out quite a lot of noise in the first phase by regularizing with sparsely coded patches using a fixed initial dictionary is almost completely lost.To give an example, for the CDP testset on 256 × 256 images, the average values obtained by DOLPHIn when updating also the dictionary from the first iteration onward are: 9.24 seconds runtime (versus 8.56 for default DOLPHIn, cf.Table I On the other hand, one might argue that if the first iterates are relatively worthless, the effort of updating A in the first DOLPHIn phase (i.e., the first K 1 iterations) could be saved as well.However, the influence of the objective terms involving D and A should then be completely removed from the algorithm for the first K 1 iterations; otherwise, the patchfitting term will certainly hinder progress made by the Xupdate because it then amounts to trying to keep X close to the initial estimate R(DA), which obviously needs not bear any resemblance to the sought solution at all.Thus, if both D and A are to be unused in the first phase, then one should temporarily set µ = 0, with the consequence that the first phase reduces to pure projected gradient descent for X with respect to the phase retrieval objective-i.e., essentially, Wirtinger Flow.Therefore, proceeding like this simply amounts to a different initialization of X. Experiments with this DOLPHIn variant (K 1 = 25 initial WF iterations followed by K 2 = 50 full DOLPHIn iterations including A and D; all other parameters again left unchanged) showed that the achievable patch-sparsities remain about the same for the Gaussian measurement types but become much worse for ).The reason for the observed behavior can be found in the inferior performance of WF without exploiting patch-sparsity (cf.also Table I); note that the results also further demonstrate DOLPHIn's robustness w.r.t. the initial point-apparently, the initial point obtained from some WF iterations is not more helpful for DOLPHIn than a random first guess.Finally, it is also worth considering what happens if the dictionary updates are turned off completely, i.e., K 2 = 0.Then, DOLPHIn reduces to patch-sparsity regularized Wirtinger Flow, a WF variant that apparently was not considered previously.Additional experiments with K 1 = 75, K 2 = 0 and D = D (0) = (I, F D ) (other parameters left unchanged) showed that this variant consistently produces higher sparsity (i.e., smaller average number of nonzero entries) of the patch representation coefficient vectors, but that the best reconstruction (X or P X (R(DA))) is always significantly inferior to the best one produced by our default version of DOLPHIn.The first observation is explained by the fact that with D fixed throughout, the patch coding (A-update) only needs to adapt w.r.t.new X-iterates but not a new D; at least if the new X is not too different from the previous one, the former representation coefficient vectors still yield an acceptable approximation, which no longer holds true if the dictionary was also modified.While it should also me mentioned that the patch-sparsity regularized version performs much better than plain WF already, the second observation clearly indicates the additional benefit of working with trained dictionaries, i.e., superiority of DOLPHIn also over the regularized WF variant.
Full tables containing results for all testruns reported on in this subsection are again available online along with our DOLPHIn code on the first author's website.

G. Sparsity-Constrained DOLPHIn Variant
From Table I and Figure 5, (a) and (d), it becomes apparent that a sparsity level of 8 ± 4 accompanies the good reconstructions by DOLPHIn.This suggests use in a DOLPHIn variant we already briefly hinted at: Instead of using 1norm regularization, we could incorporate explicit sparsity constraints on the a i .The corresponding DOLPHIn model then reads where k is the target sparsity level.We no longer need to tune the λ parameter, and can let our previous experimental results guide the choice of k.Note that the only modification to Algorithm 1 concerns the update of A (Step 2), which now requires solving or approximating In our implementation, we do so by running Orthogonal Matching Pursuit (OMP) [37] for each column a i of A separately until either the sparsity bound is reached or x i − Da i 2 ≤ ε, where we set a default of ε := 0.1.(The value of ε is again a parameter that might need thorough benchmarking; obviously, it is related to the amount of noise one wants to filter out by sparsely representing the patcheshigher noise levels will require larger ε values.The 0.1 default worked quite well in our experiments, but could probably be improved by benchmarking as well.)The OMP code we used is also part of the SPAMS package.
The effect of the parameter µ is more pronounced in the sparsity-constrained DOLPHIn than in Algorithm 1; however, it appears its choice is less dependent on the measurement model used.By just a few experimental runs, we found that good results in all our test cases can be achieved using K 1 = K 2 = 25 iterations, where in the first K 1 (with fixed dictionary), we use a value of µ = µ 1 = 0.005m Y along with a sparsity bound k = k 1 = 4, and in the final K 2 iterations (in which the dictionary is updated), µ = µ 2 = 1.68µ 1 = 0.0084m Y and k = k 2 = 8. Results on the same instances considered before (using the same D (0) ) are presented in Table III, with the exception that for the CDP case, we used 2 complex-valued octanary masks (cf.[12]) here; the initial image estimates and measurement noise were again chosen randomly.Note also that for these test, we used complete sets of overlapping patches.This greatly increases p and hence the number of subproblems to be solved in the A-update step, which is the main reason for the increased running times compared to Table I for the standard DOLPHIn method.(It should however also be mentioned that OMP requires up to k iterations per subproblem, while in Algorithm 1 we explicitly restricted the number of ISTA iterations in the A-update to just a single one.) A concrete example is given in Figure 6; here, we consider the color "mandrill" image, for which the reconstruction algorithms (given just two quite heavily noise-corrupted octanary CDP measurements) were run on each of the three RGB channels separately.The sparsity-constrained DOLPHIn reconstructions appear superior to the plain WF solution both visually and in terms of the quality measures PSNR and SSIM.

V. DISCUSSION AND CONCLUSION
In this paper, we introduced a new method, called DOL-PHIn, for dictionary learning from noisy nonlinear measurements without phase information.In the context of image reconstruction, the algorithm fuses a variant of the recent Wirtinger Flow method for phase retrieval with a patch-based dictionary learning model to obtain sparse representations of image patches, and yields monotonic objective decrease or (with appropriate step size selection) convergence to a stationary point for the nonconvex combined DOLPHIn model.
Our experiments demonstrate that dictionary learning for phase retrieval with a patch-based sparsity is a promising direction, especially for cases in which the original Wirtinger Flow approach fails (due to high noise levels and/or relatively low sampling rates).
Several aspects remain open for future research.For instance, regarding the generally difficult task of parameter tuning, additional benchmarking for to-be-identified instance settings of special interest could give further insights on how to choose, e.g., the regularization parameters in relation to varying noise levels.
It may also be worth developing further variants of our algorithm; the successful use of 0 -constraints instead of the 1 -penalty, combining OMP with our framework, is just one example.Perhaps most importantly, future research will be directed towards the "classic" phase retrieval problem in which one is given the (squared) magnitudes of Fourier measurements, see, e.g., [2], [1], [5].Here, the WF method fails, and existing other (projection-based) methods are not always reliable either.The hope is that introducing sparsity via a learned dictionary will also enable improved reconstructions in the Fourier setting.
To evaluate the quality of the learned dictionary, one might also ask how DOLPHIn compares to the straightforward approach to first run (standard) phase retrieval and then learn dictionary and sparse patch representations from the result.Some preliminary experiments (see also those in Section IV-F pertaining to keeping both A and D fixed in the first DOLPHIn phase) indicate that both approaches produce comparable results in the noisefree setting, while our numerical results demonstrate a denoising feature of our algorithm that the simple approach obviously lacks.
Similary, it will be of interest to see if the dictionaries learned by DOLPHIn can be used successfully within sparsityaware methods (e.g., the Thresholded WF proposed in [13], if that were modified to handle local (patch-based) sparsity instead of global priors).In particular, learning dictionaries for patch representations of images from a whole class of images would then be an interesting point to consider.To that end, note that the DOLPHIn model and algorithm can easily be extended to multiple input images whose patches are all to be represented using a single dictionary by summing up the objectives for each separate image, but with the same D throughout.
Another interesting aspect to evaluate is by how much reconstruction quality and achievable sparsity degrade due to the loss of phase (or, more generally, measurement nonlinearity), compared to the linear measurement case.

APPENDIX
In the following, we derive the DOLPHIn algorithm for the one-dimensional setting (1).In particular, the (gradient) formulas for the 2D-case can be obtained by applying the ones given below to the vectorized image x = vec(X) (stacking columns of X on top of each other to form the vector x), the vectorized matrix a = vec(A), and interpreting the matrix F ∈ C M ×N as describing the linear transformation corresponding to F in terms of the vectorized variables.
We now have a patch-extraction matrix P e ∈ R ps×N which gives us P e x = ((x 1 ) , . . ., (x p ) ) (in the vectorized 2Dcase, x i then is the vectorized i-th patch of X, i.e., P e corresponds to E).Similarly, we have a patch-reassembly matrix P a ∈ R N ×ps ; then, the reassembled signal vector will be P a ((a 1 ) D , . . ., (a p ) D ) (so P a corresponds to R).Note that P a = P † e = P e P e −1 P e ; in particular, x = P a P e x, and P e P e is a diagonal matrix for which each diagonal entry is associated to a specific vector component and gives the number of patches this component is contained in.(Thus, if x = vec(X) is a vectorized 2D-image, P e P e x = vec(R X) with R as defined in Section III-B.)Note that for nonoverlapping patches, P e is simply a permutation matrix, and P a = P e (so P a P e = I).Also, applying just P e actually reassembles a signal from patches by simply adding the component's values without averaging.We wish to represent each patch as x i ≈ Da i with sparse coefficient vectors a i ; with a := ((a 1 ) , . . ., (a p ) ) and D := I p ⊗D, this sparse-approximation relation can be written here, y := |Fx| 2 + n, with x the original signal we wish to recover and n a noise vector.
The update formulas for a (separately for a 1 , . . ., a p ) and D remain the same as described before, see Sections III-A and III-C, respectively.However, the update problem for the phase retrieval solution-now derived from (11), with D and a fixed at their current values-becomes decreasing the objective  We approach this by means of a projected gradient descent step; since we consider real-valued x-variables, this essentially amounts to (one iteration of) the Wirtinger Flow method [12], accommodating the [0, 1]-box constraints via projection onto them after the (Wirtinger) gradient step.The step size will be chosen to achieve a reduction of the objective w.r.t.its value at the previous x.

Fig. 6 .
Fig.6.Sparsity-Constrained DOLPHIn example: Image original is the 512 × 512 RGB "mandrill" picture, measurements are noisy CDPs (obtained using two complex-valued octanary masks, noise-SNR 10 dB, per color channel), µ 1 = 0.003 m Y , µ 2 = 0.00504 m Y (other parameters as described in Section IV-G).D (0) = (I, F D ) for R-channel, final dictionary then served as initial dictionary for G-channel, whose final dictionary in turn was initial dictionary for B-channel; X (0) ∈ X random for each channel.(a) final dictionary (excerpt) for B-channel (b) image reconstruction X DOLPHIn , (c) image reconstruction R(DA) from sparsely coded patches, (d) reconstruction X WF after 50 WF iterations.Final PSNR values: 20.53 dB for R(DA), 20.79 dB for X DOLPHIn , 14.47 dB for X WF ; final SSIM values: 0.4780 for R(DA), 0.5242 for X DOLPHIn , 0.2961 for X WF ; average a i 0 is 5.30.(Means over all color channels.)(e)-(h): zoomed-in 100 × 100 pixel parts (magnified) of (e) original image, (f) X DOLPHIn , (g) R(DA) and (h) X WF .

TABLE II TEST
RESULTS FOR DOLPHIN VARIANTS WITH DIFFERENT COMBINATIONS OF INNER ITERATION NUMBERS a AND d FOR THE A-AND D-UPDATES, RESP.REPORTED ARE THE BEST MEAN VALUES ACHIEVABLE (VIA EITHER X DOLPHIN OR P X (R(DA))) WITH ANY OF THE CONSIDERED COMBINATIONS (a, d) ∈ {1, 3, 5} × {1, 3, 5} (FIRST ROWS FOR EACH MEASUREMENT TYPE) AND THE WORST AMONG THE BEST VALUES FOR EACH COMBINATION (SECOND ROWS), ALONG WITH THE RESPECTIVE COMBINATIONS YIELDING THE STATED VALUES.ALL OTHER PARAMETERS ARE IDENTICAL TO THOSE USED IN THE EXPERIMENTS FOR DEFAULT DOLPHIN (a = d = 1), CF.TABLE I. the CDP setup, and that the average PSNR and SSIM values become (often clearly) worse in virtually all cases.In the example of measurements G X of the 512 × 512 test images, the above-described DOLPHIn variant runs 62.34 seconds on average (as less effort is spent in the first phase, this is naturally lower than the 68.62 seconds default DOLPHIn takes), produces slightly lower average patch-sparsity (5.88 vs. 6.30 for default DOLPHIn), but for both X and P X (R(DA)), the PSNR and SSIM values are notably worse (22.32 dB and 20.55 dB vs. 24.42dB and 22.66 dB, and 0.6165 and 0.5921 vs. 0.6547 and 0.6807, resp.