Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning

Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies


Introduction
Phylogenetic networks describe the evolutionary relationships between, for example, genes, genomes, or species.One of the first and most natural approaches to construct phylogenetic networks is to build a network from a set of gene trees.In the absence of incomplete lineage sorting, the constructed network is naturally required to "display", or embed, each of the gene trees.In addition, following the parsimony principle, a network assuming a minimum

16:2
Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning number of reticulate evolutionary events (like hybridization or lateral gene transfer) is often sought.Unfortunately, the associated computational problem, called Hybridization, is NP-hard even for two binary input trees with equal leaf sets [4].
For a long time, research on this topic was mostly restricted to inputs consisting of two trees.Proposed algorithms for multiple trees were either completely impractical or ran in reasonable time only for very small numbers of input trees.This situation changed drastically with the introduction of so-called cherry-picking sequences [10].This theoretical set-up opened the door to solving instances consisting of many input trees, like most practical datasets have.Indeed, a recent paper showed that this technique can be used to solve instances with up to 100 input trees to optimality [19], although it was restricted to binary trees all having the same leaf set and to so-called "tree-child" networks.Moreover, its running time has a (strong) exponential dependence on the number of reticulate events.
In this paper, we show significant progress towards a fully practical method by developing heuristics that are based on cherry picking and machine learning.Admittedly, our method is not yet widely applicable since it is still restricted to binary trees.However, our set-up is made in such a way that it can be extended to general trees, and we plan to do so in future work.Furthermore, although our method can theoretically be applied to trees with different leaf sets, in this paper we only conduct experiments on input trees with equal leaf sets.Still, we see our current method already as a breakthrough as it scales well with number of trees, number of taxa and number of reticulations.In fact, we experimentally show that it can easily handle sets of 100 trees in reasonable time (13 minutes on average for sets consisting of trees with 100 leaves each).We also have a faster version of the heuristic that already finds feasible solutions in 8 seconds for the same instances.As the running time depends at most quadratically on the input size, and linearly on the output number of reticulations, we expect it to be able to solve much larger instances still in a reasonable amount of time.In addition, our method is not restricted to tree-child networks.
The method we present is one of the first applications of machine learning in phylogenetics, and shows its promise.In particular, for the networks generated in our simulation study, it shows that features of interest of the networks can be identified with very high test accuracy (99.8%) purely based on the trees displayed by the networks.It is important to note at this point that no method is able to reconstruct any specific network from displayed trees as networks are, in general, not uniquely determined by the trees they display [12].
We focus on orchard networks (also called cherry picking networks), which are precisely those networks that can be drawn as a tree with additional horizontal arcs [17].Such horizontal arcs can for example correspond to lateral gene transfer events (LGT).Nevertheless, orchard networks are applicable more broadly because also networks in which reticulations represent hybridization or recombination can be orchard networks.In particular, the orchard networks class is much bigger than the class of tree-child networks.

Related work.
Previous practical algorithms for Hybridization include PIRN [21], PIRNs [11] and Hybroscale [1], exact methods that are only applicable to (very) small numbers of trees and/or to trees that can be combined into a network with a (very) small reticulation number.
The theoretical framework of cherry picking was introduced in [7] (for the restricted class of temporal networks) and [10] (for the class of tree-child networks) and was turned into algorithms for reconstructing tree-child [19] and temporal [5] networks.These methods can handle instances containing many trees but do not scale well with the number of reticulations, due to an exponential dependence.Other methods such as PhyloNet [20] and

16:3
PhyloNetworks [16] also construct networks from trees but have different premises and use completely different models.The class of orchard networks, which is based on cherry picking, was introduced in [15] and independently (as cherry-picking networks) in [8], although their practical relevance, as trees with added horizontal edges, was only discovered later [17].
The applicability of machine-learning techniques to Phylogenetics has not yet been fully explored, and to the best of our knowledge existing work is mainly limited to phylogenetic trees inference [2,22] and to testing evolutionary hypotheses [9].
Our contributions.We introduce the CPH class of heuristics to combine a set of binary phylogenetic trees into a single binary phylogenetic network based on cherry picking.Specifically, we address the Hybridization problem, asking to combine a set of trees in a single network with the minimum possible reticulation number.
We define and analyse several heuristics in the CPH class, all of which are guaranteed to produce feasible solutions to Hybridization and all of which can handle instances of practical size.Two of the methods we propose are simple randomised heuristics that showed to be extremely fast and to produce good solutions when run multiple times.
The main contribution of this paper consists in a machine-learning model that potentially captures essential information about the structure of the input set of trees.We trained the model over an extensive set of synthetically generated data, and applied it to guide our algorithms towards better solutions.In proof-of-concept experiments we show that the two machine-learned heuristics we design yield satisfactory results when applied to both synthetically generated and real data.

Preliminaries
A phylogenetic network N = (V, E, X) on a set of taxa X is a directed acyclic graph (V, E) with a single root with in-degree 0 and out-degree 1, and the other nodes with either (i) in-degree 1 and out-degree k > 1 (tree nodes); (ii) in-degree k > 1 and out-degree 1 (reticulations); or (iii) in-degree 1 and out-degree 0 (leaves).The leaves of N are biunivocally labelled by X.A surjective map ℓ : E → R ≥0 may assign a nonnegative branch length to each edge of N .We will denote by [1, n] the set of integers {1, 2, ..., n}.
Throughout this paper, we will only consider binary networks (with k = 2), and we will identify the leaves with their labels.We will also often drop the term "phylogenetic", as all the networks considered in this paper are phylogenetic networks.The reticulation number r(N ) of a network N is v∈V max (0, d − (v) − 1) , where d − (v) is the in-degree of v.A network T with r(T ) = 0 is a phylogenetic tree.It is easy to verify that binary networks with r(N ) reticulations have |X| + r(N ) − 1 tree nodes.

Cherry-picking.
We denote by N a set of networks and by T a set of trees.An ordered pair of leaves (x, y), x ̸ = y, is a cherry in a network if x and y have the same parent; (x, y) is a reticulated cherry if the parent p(x) of x is a reticulation, and p(y) is a tree node and a parent of p(x) (see Figure 1).A pair is reducible if it is either a cherry or a reticulated cherry.
Reducing (or picking) a cherry (x, y) in a network N is the action of deleting x and replacing the two edges (p(p(x)), p(x)) and (p(x), y) with a single edge (p(p(x)), y) (see Figure 1a).If N has branch lengths, the length of the new edge is ℓ(p(p(x)), y) = ℓ(p(p(x)), p(x)) + ℓ(p(x), y).A reticulated cherry (x, y) is reduced (picked) by deleting the edge (p(y), p(x)) and replacing the other edge (z, p(x)) incoming to p(x), and the consecutive edge (p(x), x), with a single Figure 1 Cherry (x, y) is picked in two different networks.In Fig. 1a (x, y) is a cherry, and in Fig. 1b (x, y) is a reticulated cherry.After picking, degree-two nodes are replaced by a single edge.edge (z, x).The length of the new edge is ℓ(z, x) = ℓ(z, p(x)) + ℓ(p(x), x) (if N has branch lengths).Reducing a non-reducible pair has no effect on N .In all cases, the resulting network is denoted by N (x,y) : we say that (x, y) affects N if N ̸ = N (x,y) .
Any sequence S = (x 1 , y 1 ), . . ., (x n , y n ) of ordered leaf pairs, with x i ̸ = y i for all i, is a partial cherry-picking sequence; S is a cherry-picking sequence (CPS) if, for each i < n, y i ∈ {x i+1 , . . ., x n , y n }.Given a network N and a (partial) CPS S, we denote by N S the network obtained by reducing in N each element of S, in order.We say that S fully reduces N if N S consists of the root with a single leaf.N is an orchard network (ON) if there exists a CPS that fully reduces it.If S fully reduces all N ∈ N , we say that S fully reduces N .In particular in this paper we will be interested in CPS that fully reduce a set of trees.
Hybridization.The Hybridization problem can be thought of as the computational problem of combining a set of phylogenetic trees into a network with the smallest possible reticulation number, that is, to find a network that displays each of the input trees in the sense specified by Definition 1. See Figure 2 for an example.The definition describes not only what it means to display a tree but also to display another network, which will be useful later.▶ Definition 1.Let N = (V, E, X) and N ′ = (V ′ , E ′ , X ′ ) be networks on the sets of taxa X and X ′ ⊆ X, respectively.The network N ′ is displayed in N if there is an embedding of N ′ in N : an injective map of the nodes of N ′ to the nodes of N , and of the edges of N ′ to edge-disjoint paths of N , such that the mapping of the edges respects the mapping of the nodes, and the mapping of the nodes respects the labelling of the leaves.
We call exhaustive a tree displayed in N = (V, E, X) with the whole X as leaf set.Note that Definition 1 only involves the topologies of the networks, disregarding possible branch lengths.In the following problem definition, the input trees may or may not have branch lengths, and the output is a network without branch lengths.We allow branch lengths for the input because they will be useful for the machine-learned heuristics of Section 4.

16:5
Figure 3 The ON reconstructed from the sequence S = (x, y), (x, w), (w, y).The pairs are added to the network in reverse order: if the first element of a pair is not yet in the network, it is added as a cherry with the second element (see the pair (x, w)).Otherwise, a reticulation is added above the first element with an incoming edge from a new parent of the second element (see the pair (x, y)).

Hybridization
Input: A set of phylogenetic trees T on a set of taxa X. Output: A network displaying T with minimum possible reticulation number.

Solving the Hybridization Problem via Cherry-Picking Sequences
We will develop heuristics for the Hybridization problem using cherry-picking sequences that fully reduce the input trees, leveraging the following result by Janssen and Murakami.
▶ Theorem 2 ([8], Theorem 3).Let N be a binary orchard network, and N ′ a (not necessarily binary) orchard network on sets of taxa X and X ′ ⊆ X, respectively.If a minimum-length CPS S that fully reduces N also fully reduces N ′ , then N ′ is displayed in N .
For binary ON, the following lemma, which is a special case of [8, Lemma 1], also holds.
▶ Lemma 3. Let N be a binary orchard network, and let (x, y) be a reducible pair of N .Then reducing (x, y) and then adding it back to N (x,y) results in N .
Theorem 2 and Lemma 3 provide the following approach for finding a feasible solution to Hybridization: find a CPS S that fully reduces all the input trees, and then reconstruct the unique binary orchard network N for which S is a minimum-length CPS.N can be reconstructed from S using one of the methods underlying Lemma 3 proposed in the literature, e.g., in [8] (illustrated in Figure 3) or in [19].In doing so, the number of reticulations of the resulting N is r(N ) = |S| − |X| + 1 [18].In the next section we focus on the first part of the heuristic: producing a CPS that fully reduces a given set of phylogenetic trees.

Randomised Heuristics
We define a class of randomised heuristics that construct a CPS by picking one reducible pair of the input set T at a time and by appending this pair to a growing partial sequence, as described in Algorithm 1 (the two subroutines PickNext and CompleteSeq will be described in details below).We call this class CPH (for Cherry-Picking Heuristics).Recall that T S denotes the set of trees T after reducing all trees with a (partial) CPS S.

16:6
Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning Algorithm 1 CPH.

INPUT: A set T of phylogenetic trees OUTPUT:
A CPS reducing T .

5:
Reduce (x, y) in all trees of TS; 6: S ← CompleteSeq(S); 7: return S; The while loop at lines 2-5 produces, in general, a partial CPS S, as shown in Example 4. To make it into a CPS, the subroutine CompleteSeq at line 6 appends at the end of S a sequence S ′ of pairs such that each second element in a pair of S • S ′ is a first element in a later pair (except for the last one), as required by the definition of CPS.These additional pairs do not affect the trees in T , that are already fully reduced by S. Algorithm 2 in Appendix A describes a procedure CompleteSeq that runs in time linear in the length of S.
▶ Example 4. Let T consist of the two 2-leaf trees (x, y) and (w, z).A partial CPS at the end of the while loop in Algorithm 1 could be, e.g., S = (x, y), (w, z).The two trees are both reduced to one leaf, so there are no more reducible pairs, but S is not a CPS.To make it into a CPS it suffices to append either pair (y, z) or pair (z, y): e.g., S • (y, z) = (x, y), (w, z), (y, z) is a CPS, and it still fully reduces the two input trees.
The class of heuristics described in Algorithm 1 is concretised in different heuristics depending on the function PickNext at line 3, which is used to choose a reducible pair at each iteration.To formulate them below we need to introduce the notions of height pair and trivial pair.Let N be a network with branch lengths and let (x, y) be a reducible pair in N .The height pair of (x, y) in N is a pair , where h N x = ℓ(p(x), x) and h N y = ℓ(p(y), y) if (x, y) is a cherry, and h N x = ℓ(p(y), p(x)) + ℓ(p(x), x) and h N y = ℓ(p(y), y) if (x, y) is a reticulated cherry.The height h N (x,y) of (x, y) is the average (h N x + h N y )/2 of h N x and h N y .Let T be a set of trees whose leaf sets are subsets of a set of taxa X.An ordered leaf pair (x, y) is a trivial pair of T if it is reducible in each T ∈ T that contains both x and y, and there is at least one tree in which it is reducible.We define the following three heuristics in the CPH class, resulting from as many possible implementations of PickNext.Rand.Function PickNext picks uniformly at random a reducible pair of T S .LowPair.Function PickNext picks a reducible pair (x, y) with the lowest average of values h T (x,y) over all T ∈ T S in which (x, y) is reducible (ties are broken randomly).TrivialRand.Function PickNext picks a trivial pair if there exists one, and otherwise picks a reducible pair of T S uniformly at random.

▶ Theorem 5. Algorithm 1 computes a CPS that fully reduces T , for any function
PickNext that picks, in each iteration, a reducible pair of T S .

16:7
In Section 5 we experimentally show that TrivialRand produces the best results among the proposed randomised heuristics.Rand is the fastest heuristic, its running time being trivially bounded by O(|T | 2 |X|) (see Lemma 9 in Appendix A).In the next section we introduce a further heuristic step for TrivialRand which improves the quality of the output even more.

Improving Heuristic TrivialRand via Tree Expansion
With respect to a trivial pair (x, y) each tree T in the set is of one of the following types: (i) (x, y) is a reducible pair of T ; or (ii) neither x nor y are leaves of T ; or (iii) y is a leaf of T but x is not; or (iv) x is a leaf of T but y is not.Suppose that at some iteration of TrivialRand, the subroutine PickNext returns the trivial pair (x, y).Then, before reducing (x, y) in all trees, we do the following extra step: for each tree of type (iv), replace leaf x with cherry (x, y).We call this operation the tree expansion of T : see Figure 4c.The effect of this step is that, after reducing (x, y), leaf x disappears from the set of trees, which would have not necessarily been the case before, because of trees of type (iv).The tree expansion followed by the reduction of (x, y) can, alternatively, be seen as relabelling leaf x in any tree of type (iv) by y.To guarantee that a CPS S produced with tree expansion implies a feasible solution for Hybridization, we must show that the network N reconstructed from S displays all the trees in the input set T .We prove that indeed this is the case with the following lemma.▶ Lemma 6.Let S be the CPS outputted by TrivialRand with input T and let S contain j ≥ 0 trivial pairs.Then the network reconstructed from S displays all the trees in T .
Proof.We define the network expansion of T with a trivial pair, denoted by T exp , as follows.Let S i−1 be the partial CPS constructed in the first i − 1 steps of TrivialRand, and let i be the first step in which we pick a trivial pair (x, y).For each T ∈ T that is reduced by S i−1 to a tree T S i−1 of type (iv) for (x, y), let S i−1 T be the subsequence of S i−1 consisting only of the pairs that subsequently affect T .We use the partial CPS S i−1 T • (x, y) to reconstruct a network N exp T with a method underlying Lemma 3, starting from T S i−1 : see Figure 4d.For trees of type (i)-(iii), N exp T = T .The set T exp then consists of networks N exp T for all T ∈ T .Note that, by construction and Lemma 3, all the elements of T exp S i−1 •(x,y) are trees.We can generalise this notion to multiple trivial pairs: we denote by T exp(j) the network expansion of T with the j-th trivial pair, (w, z) say, and suppose it is added to the partial CPS S at the k-th step.Consider a tree T ′ ∈ T exp(j−1) S k−1 of type (iv) for (w, z), and let .The elements of T exp(j) are all networks N exp(j) T . For completeness, we define T exp(0) = T and T exp (1) = T exp .By construction, S fully reduces all the networks in T exp (j) , thus the network N reconstructed from S displays all of them by Theorem 2. We prove that N exp(j) T displays T for all T ∈ T , and thus N displays the original tree set T too, by induction on j.
In the base case we pick j = 0 trivial pairs, so the statement is true by Theorem 2. Now let j > 0. The induction hypothesis is that each network N exp(j−1) T ∈ T exp(j−1) displays the tree T it originates from.Let (w, z) be the j-th trivial pair, added to the sequence at position be a tree of type (iv) for (w, z), and let N exp(j−1) T be the network it originates from.Then there are two possibilities: either z is a leaf of N exp(j−1) T or it is not.In case it is not, then adding (w, z) to N exp(j−1) T does not create any new reticulation, and clearly N exp(j) T keeps displaying T .If z does appear in N exp(j−1) T , then it must have been reduced by a pair (z, v) of S k−1 (otherwise T ′ would not be of type (iv)).Then the network N exp(j) T has an extra reticulation, created with the insertion of (z, v) at some point after (w, z) during the backwards reconstruction.In both cases, by [8, Lemma 10] , and thus by the induction hypothesis T is displayed too.◀

Good Cherries in Theory
By Lemma 3, the binary network N reconstructed from a CPS S is such that S is of minimum length for N .By Theorem 2, if S, in turn, fully reduces T , then N displays all the trees in T .Depending on S, though, N is not necessarily an optimal network (that is, with minimum reticulation number) among the ones displaying T .Let OPT(T ) denote the set of networks that display T with the minimum possible reticulation number.Ideally, we would like to produce a CPS fully reducing T that is also a minimum-length CPS fully reducing some network of OPT(T ).In other words, we aim to find a CPS S = (x 1 , y 1 ), . . ., (x n , y n ) such that, for any i ∈ [1, n], (x i , y i ) is a reducible pair of Ñ Si−1 , where S0 = ∅, Sk = (x 1 , y 1 ), . . ., (x k , y k ) for all k ∈ [1, n], and Ñ ∈ OPT(T ).
Let S = (x 1 , y 1 ), . . ., (x n , y n ) be a CPS fully reducing T and let OPT S k (T ) consist of all networks N ∈ OPT(T ) such that each pair

▶ Lemma 7. A CPS S reducing T reconstructs an optimal network Ñ if and only if each pair
Proof. ( =⇒ ) By Lemma 3, S is a minimum-length CPS for the network Ñ that is reconstructed from it; and a CPS C = (w 1 , z 1 ), . . ., (w n , z n ) reducing a network N is of minimum length precisely if, for all j ∈ [1, n], (w j , z j ) is a reducible pair of N C j−1 (otherwise the pair (w j , z j ) could be removed from C and the new sequence would still reduce N ).( ⇐= ) If all pairs of S affect some optimal network Ñ , then S is a minimum-length CPS for Ñ , thus Ñ is reconstructed from S (and it displays T by Theorem 2).◀ Lemma 7 implies that if some pair (x i , y i ) of S does not reduce any network in OPT S i−1 (T ), then the network reconstructed from S is not optimal: see Example 8.
▶ Example 8. Consider the set T of Figure 2b: S = (y, x), (y, z), (w, x), (x, z) is a CPS that fully reduces T and consists only of pairs successively reducible in the network N of Fig. 2a, thus it reconstructs it by Lemma 3. Now consider (w, x), which is reducible in T but not in N , and pick it as first pair, to obtain e.g. S ′ = (w, x), (y, z), (y, x), (w, x), (x, z).The network N ′ reconstructed from S ′ , depicted in Figure 5, has r(N ′ ) = 2, whereas r(N ) = 1.Suppose we are incrementally constructing a CPS S = (x 1 , y 1 ), . . ., (x n , y n ) for T with some heuristic in the CPH class.If we had an oracle that at each iteration i told us if a reducible pair (x, y) of T S i−1 were a reducible pair in some N ∈ OPT S i−1 (T ), then, by Lemma 7, we could solve Hybridization optimally.Unfortunately no such exact oracle can exist (unless P = N P ).However, in the next section we exploit this idea to design machine-learned heuristics in the CPH framework.

Predicting Good Cherries via Machine Learning
In this section, we present a supervised machine-learning classifier that (imperfectly) simulates the ideal oracle described at the end of Section 3.3.The goal is to predict, based on T , whether a given cherry of T is a reducible pair in a network N displaying T with a close-to-optimal number of reticulations, without knowing N .Based on Lemma 7, we then exploit the output of the classifier to define new functions PickNext, that in turn define new machine-learned heuristics in the class of CPH (Algorithm 1).
Specifically, we train a random forest classifier on data that encapsulates information on the cherries in the tree set.Each reducible pair in T S is represented by one data point.Each data point is a pair (F, c), where F is an array of the features of cherry (x, y) and c is an array containing the probability that the cherry belongs to each of the possible classes described below.Recall that cherries are ordered pairs, so (x, y) and (y, x) give rise to two distinct data points.The classification model learns the association between F and c.
The true class of a cherry (x, y) of T depends on whether, for the (unknown) network N that we aim to reconstruct: (class 1) (x, y) is a cherry of N ; (class 2) (x, y) is a reticulated cherry of N ; (class 3) (x, y) is not reducible in N , but (y, x) is a reticulated cherry; or (class 4) neither (x, y) nor (y, x) are reducible in N .Thus, for the data point of a cherry (x, y), c[i] contains the probability that (x, y) is in class i, and c[1] + c [2] gives the predicted probability that (x, y) is reducible in N .We define the following two heuristics in the CPH framework.
ML.Given a threshold t ∈ [0, 1), function PickNext picks the cherry with the highest predicted probability of being reducible in N , if this probability is at least t; or a random cherry if none of them have a probability of being reducible above the threshold.
TrivialML.Function PickNext picks a random trivial pair, if there exists one; otherwise it uses the same rules as ML.
In both cases, whenever a trivial pair is picked, we do tree expansion, as described in Section 3.2.Note that if t = 0, since the predicted probabilities are never exactly 0, ML is fully deterministic.In Section 5 we study the performance of ML with different thresholds.

WA B I 2 0 2 2 16:10 Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning
Table 1 Features of a cherry (x, y).Features 6-12 can be computed for both branch lengths and unweighted branches.We refer to these two options as distance and topological distance, respectively.

Num
Feature name Description 1 Cherry in tree Ratio of trees that contain cherry (x, y) 2 Trivial Ratio of trees with both leaves x and y that contain cherry (x, y) 3 Leaves in tree Ratio of trees that contain both leaves x and y 4 New cherries Number of new cherries of T after picking cherry (x, y) 5 Before/after Ratio of the number of cherries of T before/after picking cherry (x, y) Features measured by distance (d) and topology (t) Cherry depth Avg over trees with (x, y) of ratios "depth of (x, y) in the tree/depth of tree" 7 d,t Tree depth Avg over trees with (x, y) of ratios "depth of tree/max depth of trees with (x, y)" Leaf distance Avg over trees with x and y of ratios "x-y leaf distance/depth of the tree" 9 d,t LCA distance Avg over trees with x and y of ratios "x-LCA(x, y) distance/y-LCA(x, y) distance" 10 d,t Leaf depth x Avg over trees with x and y of ratios "root-x distance/depth of tree" 11 d,t Leaf depth y Avg over trees with x and y of ratios "root-y distance/depth of tree" 12 d,t Depth x/y Avg over trees with x and y of ratios "root-x distance and root-y distance" To assign a class to each cherry, we define 19 features, summarised in Table 1, that may capture essential information about the structure of the set of trees, and that can be efficiently computed and updated at every iteration of the heuristics.
The depth (resp.topological depth) of a node u in a tree T is the total branch length (resp.the total number of edges) on the root-to-u path; the depth of T is the maximum depth of any leaf of T ; the depth of a cherry (x, y) is the depth of the common parent of x and y.The (topological) leaf distance between x and y is the total branch length of the path from the parent of x to the lowest common ancestor of x and y, denoted by LCA(x, y), plus the total length of the path from the parent of y to LCA(x, y) (resp.the total number of edges on both paths).In particular, the leaf distance between the leaves of a cherry is zero.
Features 1-5 can be initially computed for all cherries of T with two bottom-up traversals of all trees: one for identifying and storing all cherries of T , the second for actually computing the features.Picking a cherry then entails recomputing features 1-2 and 4-5 for up to |T | data points (|T | denotes the number of trees), as this can result in up to one new cherry in each tree; and recomputing feature 3 for up to |X| cherries, where X is the set of taxa of T .
Features 6 d,t to 12 d,t can also be initially computed with a traversal of T , made efficient by preprocessing each tree to allow constant-time LCA queries [6] and by storing the depth (both topological and w.r.t.branch lengths) of each node of each tree.This preprocessing also allows for efficient updates of each feature.We defer a thorough analysis of the time complexity of computing and updating the features to a future full version of this paper.
Obtaining Training Data.The high-level idea to obtain training data is to first generate a phylogenetic network N ; then to extract a subset T of the trees displayed in N ; and finally, to iteratively choose a random reducible pair (x, y) of N , to reduce it in T as well as in N , and to label the remaining cherries of T with one of the four classes defined in Section 4.
We generate phylogenetic networks with branch lengths and up to 9 reticulations using the LGT (lateral gene transfer) network generator of [14] for binary orchard networks.For each such network N , we then generate the set T consisting of all the exhaustive trees displayed in N .Although N is not guaranteed to be an optimal network for T , we expect it to be reasonably close to an optimal one, because we remove redundant reticulations when we generate it and because the trees in T cover all the edges of N .In particular, r(N ) provides an upper bound estimate on the minimum possible number of reticulations of any network displaying T .We will thus use it as a reference value for assessing the quality of our results on synthetic data.

Experiments
The code of all our heuristics, available at https://github.com/estherjulien/HybridML, is written in Python.All experiments ran on an Intel Xeon Gold 6130 CPU @ 2.1 GHz with 96 GB RAM.We conducted experiments on both synthetic and real data, comparing the performance of Rand, TrivialRand, ML and TrivialML, using threshold t = 0.Although our methods can be applied to trees with different leaf sets, we restrict this proof-of-concept study to trees with the same set of leaves, and defer more complete experiments to future extensions.Similar to the training data, we generated two synthetic datasets by first growing a binary orchard network N using [14], and then extracting T as a subset of the exhaustive trees displayed in N .We provide details on each dataset in Section 5.2.We start by analysing the usefulness of tree expansion, the heuristic rule described in Section 3.2.We synthetically generated 112 instances for each tree set size |T | ∈ {5, 10, 20, 50, 100} (560 in total), all consisting of trees with 20 leaves each, and grouped them by |T |; we then ran TrivialRand 200 times (both with and without tree expansion) on each instance, selected the best output for each of them, and finally took the average of these results over each group of instances.The results are in Figure 6, showing that the use of tree expansion brought the output reticulation number down by at least 16% (for small instances) and up to 40% for the larger instances.We consistently chose to use this rule in all the heuristics that detect trivial cherries, namely, TrivialRand, TrivialML, and ML (although ML does not explicitly favour trivial cherries, it does check whether a selected cherry is trivial using feature number 2).

Prediction Model
The random forest is implemented with Python's scikit-learn [13] package using default settings.We evaluated the performance of our trained random forest models on different datasets in a holdout procedure: namely, we removed 10% of the data from each training dataset, trained the models on the remaining 90% and used the holdout 10% for testing.The accuracy was assessed by assigning to each test data point the class with the highest predicted probability, and comparing it with the true class.Before training the models, we balanced each dataset so that each class had the same number of representatives.
Each training dataset differed in terms of the number M of networks used for generating it and the number of leaves of the networks.For each dataset, the number L of leaves of each generated network was uniformly sampled from [2, max L], where max L is the maximum WA B I 2 0 2 2 16:12 Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning number of leaves per network.We constructed the networks using the LGT generator of [14].This generator has three parameters: n for the number of steps, α for the probability of lateral gene transfer events, and β for regulating the size of the biconnected components of the network (called blobs).The combination of these parameters determines the level (maximum number of reticulations per blob), the number of reticulations, and the number of leaves of the output network.In our experiments, α was uniformly sampled from [0.1, 0.5] and β = 1.See [14] for more details.
Each generated network gave rise to a number of data points: the total number of data points per dataset is shown in Table 5 in Appendix C. Each row of Table 5 corresponds to a dataset on which the random forest can be trained, obtaining as many ML models.We tested all the models on all the synthetically generated instances: we show these results in Figure 14 in Appendix D. In Section 5.2 we will report the results obtained for the best performing model for each type of instance.
Among the advantages of using a random forest as prediction model there is the ability of computing feature importance, shown in Table 6 in Appendix C. Some of the most useful features for a cherry (x, y) seem to be "Trivial" (the ratio of the trees containing both leaves x and y in which (x, y) is a cherry) and "Cherry in tree" (the ratio of trees that contain (x, y)).This was not unexpected, as these features are well-suited to identify trivial cherries.
'Leaf distance' (the average, over all trees containing both leaves x, y, of the ratio between the x-to-y distance and the depth of the tree) and "Depth x/y" (the average, over all trees containing both leaves x, y, of the ratio between the x-to-root and the y-to-root distances) are also two important features.The rationale behind these features was to try to identify reticulated cherries.This was also the idea for the feature "Before/after", but this has, surprisingly, a very low importance score.In future extensions of this work we plan to conduct a thorough analysis on whether some of the seemingly least important features can be removed without affecting the quality of the results.

Experimental Results
We assessed the performance of our heuristics on instances of three types: Full Tree Set (FTS), Restricted Tree Set (RTS), and real data.FTS and RTS are synthetically generated.We generated the FTS instances much like we did for the training data: we first grew a network with the LGT generator and then extracted all the exhaustive trees displayed in the network.We generated FTS data for different combinations of the following parameters: L ∈ {20, 50, 100} (number of leaves per tree) and R ∈ {5, 6, 7} (reticulation number of the original network).Note that, for FTS instances, |T | = 2 R .
For generating RTS instances, we also started by growing LGT networks, but we then extracted only a subset of the exhaustive trees from each of them, up to a certain amount |T | ∈ {20, 50, 100}; the other parameter for RTS instances is the number of leaves L ∈ {20, 50, 100}, while the number of reticulations is uniformly sampled in the ranges [5,25], [6,30], [7,35] for |T | = 20, |T | = 50 and |T | = 100, respectively.We generated 112 FTS instances and as many RTS instances for each possible combination of the parameters: by instance group we indicate all the (FTS or RTS) instances generated for one parameter pair.
The real-world dataset we tested our methods on consists of gene trees on homologous gene sets found in bacterial and archaeal genomes.This dataset was originally constructed in [3] and made binary in [19].We extracted a subset of instances (Table 2) from the binary dataset, for every combination of parameters L ∈ {20, 50, 100} and |T | ∈ {10, 20, 50, 100}.For the synthetically generated FTS and RTS datasets, we evaluated the performance of each heuristic in terms of the output number of reticulations, comparing it with the number of reticulations of the network N from which we extracted T .Indeed, although N is not guaranteed to be an optimal network for T , r(N ) clearly provides an estimate (from above) of the optimal value, and thus we used it as a reference value for our experimental evaluation.
For real data we used a different reference value.In the absence of the natural estimate on the optimal number of reticulations provided by the starting network, on real data we evaluated the performance of the heuristics using the best result obtained by running TrivialRand 1000 times as reference value.We also compared our results with the algorithm from [19], denoted by TreeChild, and the algorithm from [1], denoted by Hybroscale, using the same datasets that were used to test the two methods in [19], which consist of rather small instances (|T | ≤ 8) .
Full Tree Set (FTS) instances.We evaluated the output of ML with threshold t = 0 on all the random forest models described in Table 5.The model that gave the best results for datasets of this type is the one trained with parameters max L = 100, M = 500 (see Figure 14a), so we chose to use this model for both ML and TrivialML, and compare them to Rand and TrivialRand.We ran the machine-learned heuristics once for each instance, and then averaged the results within each group.The randomised heuristics Rand and TrivialRand were run min{x(I), 1000} times for each instance I, where x(I) is the number of runs that can be executed in the same time as one run of ML on the same instance.
In Figure 7 we summarise the results for the four heuristics.Solid bars represent the ratio of the average reticulation number to the reference value, for each instance group and for each of the four heuristics.Dashed bars represent the ratio of the average (within each group) of the best of the min{x(I), 1000} runs for each instance I to the reference value.The machine-learned heuristics ML and TrivialML seem to perform very similarly, the best one (on average) being one or the other depending on the instance group.The fully randomised heuristic, Rand, always performed much worse than all the others, and we omitted the results for LowPair because they were at least 44% worse, on average, than the ones for Rand.
While just one run of ML or TrivialML gave better results than the average of multiple runs of TrivialRand (up to 64% on average), picking the best output over all the runs of TrivialRand seems to give the best results for FTS instances, TrivialRand being better than ML by up to 25% on average for instances obtained from networks with 20 leaves and 5 reticulations and yielding results only slightly better than ML and TrivialML in the rest of instance groups (ML being better by 3% on average than the best output of TrivialRand only for the instance group (100,7)).Restricted Tree Set (RTS) instances.The ML-model that gave the best results for RTS datasets is the one trained with parameters max L = 100, M = 10 (Figure 14b in Appendix D).The results in Figure 8 were obtained with the same procedure as for FTS.Comparing Figures 7 and 8 it is clear that, for all the heuristics, it was more difficult to reconstruct N when the input trees were not the totality of the exhaustive trees displayed in N .
In particular, whereas running TrivialRand several times and picking the best output appeared to be the winning strategies for most of the FTS instances (although ML and TrivialML performed almost as good), this is not the case for RTS, where it seems that machine learning often makes up for the lack of information in the input better than the randomised strategies.For RTS instances, the result of ML was better than the best result of TrivialRand by up to 34% on average (for instances of 100 trees obtained from networks with 50 leaves), the difference being more marked for seemingly "difficult" instance groups, for which all the tested heuristics gave results diverging more from the reference value.The best result of TrivialRand was (slightly) better than the result of ML only for the instance group (20,20).
Real Data.We conducted two sets of experiments on real data, using the ML model trained on the dataset with parameters max L = 100, M = 10.For sufficiently small instances, we compared the results of our heuristics with the results of two existing tools for reconstructing networks from binary trees: TreeChild [19] and Hybroscale [1].Hybroscale is an exact method performing an exhaustive search on the networks displaying the input trees, therefore it can only handle reasonably small instances in terms of number of input trees.TreeChild is a fixed-parameter (in the number of reticulations of the output) exact algorithm that reconstructs the best tree-child network, a restricted class of phylogenetic networks, and due to its fast-growing computation time cannot handle large instances either.
We tested ML and TrivialRand against Hybroscale and TreeChild using the same dataset used in [19], in turn taken from [3].The dataset consists of ten instances for each possible combination of the parameters |T | ∈ [2,8] and L ∈ {10, 20, 30, 40, 50, 60, 80, 100, 150}.In Figure 9 we show results only for the instance groups for which Hybroscale or TreeChild could output a solution within 1 hour, consistent with the experiments in [19].As a consequence of Hybroscale and TreeChild being exact methods (TreeChild only for a restricted class of networks), they performed better than both ML and TrivialRand on all instances they could solve, although the best results of TrivialRand are often close (no worse than 14%) and sometimes match the optimal value.Table 4 in the appendix summarises the running time of all the methods on all instance groups.
The main advantage of our heuristics is that they can handle much larger instances.In Figure 10 we show the results obtained for the instance groups of Table 2. Like for FTS and RTS data, Rand and TrivialRand are executed min{x(I), 1000} times for each instance I, x(I) being the time required for executing ML once on instance I.The results are then averaged within each group and divided by the reference value that we obtained by running TrivialRand 1000 times on each instance and taking the best outputs.For this reason, in contrast with the synthetic datasets, for which we had more accurate estimates, the results of ML and TrivialML are even better (up to 10% on average) than the reference value, for some instance groups.The results essentially confirm the ones we obtained for RTS instances: the best outputs of TrivialRand are quite close to the outputs of ML (and TrivialML), being up to 15% worse and up to 9% better on average, depending on the instance group.
Effect of the Threshold on ML.We tested the effectiveness of adding a threshold t > 0 to ML on different types of data (FTS, RTS and real).For these experiments we used a subset of the instance groups, consisting of trees obtained using the parameters R = 5 for The results are shown in Figure 11.For all types of data a low threshold t = 0.1 is beneficial, intuitively indicating that when the probability of a pair being reducible is close to zero it gives no meaningful indication, and thus random choices among these pairs are more suited.For all but one instance group (5 real data instances each consisting of 10 trees with 100 leaves) this is true up to t = 0.3, and for all the synthetic datasets even up to t = 0.5.The seemingly best value for the threshold, though, is different for each type of instances.
The FTS instances seem to benefit from quite high values of t, the best being the highest tested value, t = 0.7.This is consistent with what we observed before: some randomness in the methods seems to be very effective.For the synthetic but less informative RTS instances, the best threshold seems to be around t = 0.3, while very high values (t = 0.7) are counterproductive.This is also coherent with the fact that the randomised heuristics are less effective on this types of instances.For real data, the best option seems to be around t = 0.1.
These experiments should be seen as an indication that the use of an appropriate threshold may improve the performance of the machine-learned heuristics, at the price of running them multiple times.We defer a more thorough analysis to future extensions of this work.

Discussion
Our contributions are twofold: first, we presented the first methods that allow reconstructing a phylogenetic network from a large set of large binary phylogenetic trees.Second, we show the promise of the use of machine learning in this context.Our experimental studies indicate that repeated runs of simple and fast randomised cherry-picking heuristics are quite effective in most of the cases, but machine-learned strategies bring better results in the most difficult instances.Furthermore, preliminary experiments indicate that their performance can even be improved by introducing appropriate thresholds, in fact mediating between random choices and predictions.In addition, our current implementations are in Python and hence not optimised for speed, but faster implementations could make machine-learned heuristics with nonzero thresholds even more effective.Figure 11 The reticulation number when running ML with different thresholds for the three instances classes.Each instance (in total 112) is run 10 times, where the lowest reticulation value of these runs is selected.At each point, a 68% confidence interval is also shown.
For future work we would like to evaluate if there is any relationship between the accuracy of the machine-learned models and the probability of the heuristic being able to reconstruct an optimal network.We will also investigate if the use of other parameters to introduce some randomness in strategies using machine learning can further improve the results.Finally, we plan to test our methods on trees with different leaf sets and to extend them to nonbinary trees.     5.For each training dataset, identified by the parameter pair max L-M , the value shown in the heatmap is the average, within each instance group, of the reticulation number found by ML divided by the reference value.We used a group of 112 instances for each combination of parameters L ∈ {20, 50, 100} and R ∈ {5, 6} (for FTS), and L ∈ {20, 50, 100} and |T | ∈ {20, 50} (for RTS).

( a )
Network N .(b) Input tree set T .

Figure 2
Figure 2 The two trees in Fig. 2b are displayed in the network of Fig. 2a.

( a )Figure 4
Figure 4 Tree expansion of T (a) with the trivial cherry (x, y) of T (y,z) .(b) After picking cherry (y, z), leaf y is missing in T S 1 .(c) Leaf x is replaced by the cherry (x, y).After completion of the heuristic, we have ST = (y, z), (x, y), (y, w), (w, z).(d) The network N exp T reconstructed from ST .Note that the input tree T is displayed in N exp T (solid edges).

N 2 16: 8 Reconstructing
exp(j−1) T ∈ T exp(j−1) be the network it originated from.Let S k−1 T be the subsequence of WA B I 2 0 2 Phylogenetic Networks via Cherry Picking and Machine Learning S k−1 consisting only of the pairs that subsequently affected N exp(j−1) T .Then N exp(j) T is the network reconstructed from S k−1 T • (w, z), starting from T ′ .For trees of T exp(j−1) S k−1 that are of type (i)-(iii) for (w, z), we define N exp(j) T = N exp(j−1) T

Figure 5
Figure 5 The network N ′ of Example 8.

Figure 6
Figure6 TrivialRand with and without tree expansion.The height of the bars is the average reticulation number over each group, obtained by selecting the best of 200 runs for each instance.

Figure 7
Figure 7 Results for FTS instances for each combination of the parameters (L, R) and distribution of the results per instance group (boxplots on the right, for the randomized heuristics only considering the best result over all runs on each instance).

Figure 8
Figure 8 Results for RTS instances for each combination of parameters L and |T | and distribution of the results per instance group (boxplots on the right, for the randomized heuristics only considering the best result over all runs on each instance).

Figure 9
Figure 9 Comparison of ML, TrivialRand, Hybroscale, and TreeChild on real data.The distribution of the results is shown in Appendix B (Figure 12).

WA B I 2 0 2 2 16: 16 Figure 10
Figure 10 Results for the real instance class for each combination of parameters L and |T |.The distribution of the results is shown in Appendix B (Figure 13.)

16 : 19 ▶ 9 .
Lemma The running time of a naive implementation of Rand is O(|T | 2 |X|).Proof.An upper bound for the length of the sequence is (|X| − 1)|T | as each tree can individually be fully reduced using at most |X|−1 pairs.Hence, the while loop of Algorithm 1 is executed at most (|X| − 1)|T | times.Furthermore, choosing a random reducible pair takes O(1) time, and reducing a pair in all trees in T takes O(|T |) time, as it takes constant time within each tree.Combining this with the fact that CompleteSeq takes O(|S|) = O(|X||T |) time, we conclude that Rand takes no more than O(|T | 2 |X|) time.◀ B Runtimes and Omitted Plots

Figure 12 Figure 13
Figure 12Boxplots for the distribution of the results of Figure9.

Figure 14
Figure14 Results for ML with the random forest model trained on each of the datasets given in Table5.For each training dataset, identified by the parameter pair max L-M , the value shown in the heatmap is the average, within each instance group, of the reticulation number found by ML divided by the reference value.We used a group of 112 instances for each combination of parameters L ∈ {20, 50, 100} and R ∈ {5, 6} (for FTS), and L ∈ {20, 50, 100} and |T | ∈ {20, 50} (for RTS).

Table 2
Number of real data instances for each group (combination of parameters L and |T |).

Table 3
Runtimes (in seconds) of ML for instance types FTS, RTS, and real, for all instance groups and for different parameter combinations.The number of leaves per tree are given in column "L".For FTS instances, the size of the tree set is determined by R (reticulation number of the original network).For RFT and real instances, this is simply denoted by |T |.

Table 4
Runtimes of CPH (same for ML and TrivialRand), Hybroscale, and TreeChild in seconds, for combinations of parameters L (leaves per tree) and |T | (number of trees).