Honest Fraction Differential Privacy

Over the last decades, differential privacy (DP) has become a standard notion of privacy. It allows to measure how much sensitive information an adversary could infer from a result (statistical model, prediction, etc.) he obtains. In privacy-preserving federated machine learning, one aims to learn a statistical model from data owned by multiple data owners without revealing their sensitive data. A common strategy is to use secure multi-party computation (SMPC) to avoid revealing intermediate results. However, DP assumes a very strong adversary who is able to know all information in the dataset except the targeted secret, while most SMPC methods assume a clearly less strong adversary, e.g., it is common to assume that the adversary has bounded computational power and can corrupt only a minority of the data owners (honest majority). As a chain is not stronger than its weakest part, in such combinations the DP provides an overly strong protection at an unnecessarily high cost in terms of utility. We propose honest fraction differential privacy, which is similar to differential privacy but assumes that the adversary can only collude with data owners covering part of the data. This assumption is very similar to the assumptions made by many SMPC strategies. We illustrate this idea by considering the application to the specific task of unregularized linear regression without bias on sufficiently large datasets.

is essential in fields like healthcare and finance, where it is crucial to handle sensitive data carefully.A common technique is to add some noise to the final result of a computation before publishing it to avoid that a recipient of the result can infer any sensitive data from it.
Federated machine learning [14] involves training algorithms across decentralized devices or servers holding local data samples.In privacy-preserving federated machine learning, the goal is to learn from datasets distributed across multiple data owners without compromising the confidentiality of the individual datasets [11].
A common strategy in this domain has been the utilization of Secure Multi-Party Computation (SMPC).SMPC [8] aims to compute results based on inputs from multiple parties without revealing any of the input data or intermediate results in collaborative computations, even if the central server is not trusted.Among others, it can readily be used for securely aggregating values contributed by the participating data owners or for generating DP noise in a verifiable way [18].
Standard DP can be seen as assuming an adversary with extensive knowledge, capable of accessing all but one instance in the dataset.This assumption contrasts with the more restrained adversary model in SMPC, where the adversary is often thought to have limited computational power and only able influence at most a certain fraction of the data owners (or at most a subset of data owners corresponding to a certain fraction of the data), as explored by Lindell [13].The result is an over-protection by DP, leading to a decrease in the utility of the data, a phenomenon widely discussed in the literature [20].
As it is unnecessary in view of an adversary for which one already assumes limited computational power to require informationtheoretic differential privacy, the notion of computational differential privacy was introduced [16].In the computational DP setting, an adversary may have encrypted sensitive information as long as decrypting it is beyond its computational capabilities, e.g., if decryption is infeasible in polynomial time.
Similarly, if the SMPC component of an algorithm already assumes that a certain fraction of the parties are honest, it doesn't help much to assume the adversary has access to all but one instance when considering statistical privacy.To address this issue, we introduce a novel concept termed Honest Fraction Differential Privacy (HFDP).HFDP is similar to traditional DP, but reflects privacy assuming that an adversary can only collude with a certain fraction of the data owners, aligned with many SMPC methods.
We apply HFDP specifically to linear regression models.Linear regression is chosen due to its wide use in various sectors, especially in medical data analysis.
In summary, our contributions are: • We introduce the novel concept of Honest Fraction Differential Privacy (HFDP), and study its properties.• We apply the HFDP framework to linear regression and propose a strategy to train -HF (, )-DP linear regression models.
The remainder of this paper is structured as follows: first, in Section 2 we review some basic definitions and relevant literature.Next, in Section 3 we present the key ideas of the Honest Fraction Differential Privacy framework, and in Section 4 we apply this to the specific case of simple linear regression.In Section 5 we then provide further discussion, conclusions and directions for future work.

BACKGROUND
Differential Privacy (DP).We will denote by X the space of all possible instances.For a space , we will denote by P () the space of all probability distributions over , and we will denote by  * the space of all sets of elements of .We will denote by   ∈ R  × the -dimensional identity matrix.
The foundational concept of Differential Privacy (DP) was established by Dwork [5,6], among others with the study of (, )differential privacy.
Definition 1 (Adjacent datasets).We say two datasets ,  ′ ∈ X * are adjacent if they differ in at most one instance, i.e., there are   ∈ X * and ,  ′ ∈ X such that  =   ∪ { } and  ′ =   ∪ { ′ }.We denote by adj X the set of all adjacent datasets in X * .Definition 2 (Mechanism).For an input space X and an output space Y, a mechanism  : X → Y is a randomized algorithm mapping elements of X on elements of Y.We denote the space of all mechanisms from X to Y by M (X, Y).
Mechanisms can be (or can be post-processing steps to) a wide variety of algorithms, such as randomized algorithms for learning, optimizing, predicting, inferencing or encoding.A mechanism is differentially private if changing a single instance in the input dataset doesn't lead to a distinguishably different output.Definition 3 ((, )-Differential Privacy).Let  > 0 and  ≥ 0. A mechanism  ∈ M (X * , Y) is (, )-differentially private if, for any two adjacent datasets  ∈ X * and  ′ ∈ X * , and for all  ⊆ Y, the inequality  (() ∈  ) ≤    (( ′ ) ∈  ) +  holds.
To achieve differential privacy (DP), one of the most well-known tools are the Laplace mechanism and the Gaussian mechanism [5,15].The Gaussian mechanism adds a Gaussian distributed noise to the output of a function.For functions  with range R  , we denote by  : 2 [ ] the mechanism mapping  on  () +  with  ∼ N (0,  2   ).
Definition 4 (L2 Sensitivity).The  2 sensitivity of a function  : X → Y, denoted as Δ 2  , is defined as the maximum Euclidean distance between the outputs of  on any two adjacent datasets, i.e., Theorem 1 (Gaussian Mechanism (Dwork 2014)).For any function  : X → Y and any  > 0 and  > 0, the Gaussian mechanism Federated Learning.Federated Learning (FL) is a decentralized approach to machine learning where the training instances are distributed across multiple data owners.Depending on the extent to which parties trust each other, one can assume several security models.While parties can misbehave in many ways, in this paper we focus on parties interested in learning about the sensitive data of other parties.To avoid parties learn each other's data while computing statistics one can use secure aggregation or more generally secure multi-party computation.While it is possible to design algorithms that are robust against all but one party colluding to reveal the secrets of the last party, such protocols are often expensive.e.g., additive secret sharing is robust against such collusion but the cost is quadratic in the number of parties.It is therefore more common to assume a fraction of the parties are honest, i.e., they follow the protocol and don't collude, e.g., [17] then has a cost  ( log()) with  the number of parties.A popular security model is the honest majority model [2,11] where one assumes that the majority of parties are honest.
DP in Linear Regression Models.Exploring the application of Differential Privacy (DP) within linear regression models offers a concrete example of how DP principles can be implemented in specific statistical analyses.Zhang et al. [22] introduce the Functional Mechanism, a method that adapts the objective function, ensuring differential privacy for both linear and logistic regression models.Furthering this line of inquiry, Dandekar et al. [4] examine the application of a DP functional mechanism in regularized linear regression models.
Utility-privacy trade-off.The integration of privacy-preserving techniques in Federated Learning (FL) is essential for safeguarding data across decentralized clients, but it can also impact the utility of the aggregated model.Several research papers have explored this topic, emphasizing the importance of finding a balance between privacy and utility [3,10,23,24].Zhang et al. [23] delve into the trade-offs between privacy and utility.
Exploiting external randomness.Some research has sought to exploit intrinsic randomness existing already in some algorithms to reduce the amount of additional noise needed to achieve a certain privacy level.For example, some cryptographic strategies such as fully homomorphic encryption contain randomization which can be exploited as differential privacy noise [19].Also, some machine learning algorithms such as Stochastic Gradient Descent allow for using their sampling randomness to be exploited to improve privacy [9,21].The work described in the current paper can be combined with such strategies exploiting intrinsic randomness from algorithms, as the sources of additional uncertainty for the adversary which these methods and our work exploit are complementary and independent.
Other privacy notions.Next to classic differential privacy, several other privacy notions have been proposed in literature.Some of these are based on alternative ways to measure the divergence between the distributions of outputs of a mechanism when provided with two adjacent datasets as input.For example, Rényi Differential Privacy [15] generalizes the notion of -differential privacy by considering Rényi divergence between probability distributions, parameterized by an order parameter .Concentrated Differential Privacy [7] is another variant, focusing on bounding the divergence from the expected privacy loss rather than the worst-case scenario.We describe our work in the context of (, )-differential privacy for the simplicity of explanation in this short paper, but the same idea can be also applied to Rényi DP and Concentrated DP.
An interesting general framework is the Pufferfish framework [12] of which all the above privacy notions are instantiations.Our work too can be seen as an instance of the Pufferfish model.Even though the generality decreases the operationality of the framework, [12] shows that a restricted set of properties hold in general for all privacy notions fitting the framework.

HONEST FRACTION DIFFERENTIAL PRIVACY
We now define our new concept of Honest Fraction Differential Privacy.The 'honest fraction' refers to the fraction of the parties in federated learning which are assumed to be honest, i.e., not colluding to infer some information.In practice, results in this paper which are secure under a -honest fraction assumption remain valid if there are multiple disjoint groups of colluding parties, as long as no such group of colluding parties exceeds a fraction of 1 −  of the parties.
Intuitively, the dataset  is partitioned as  =   ∪   ∪ {(  ,   )} with   containing instances the adversary can observe (e.g., by colluding with the concerned data owners),   containing instances which are unknown to the adversary, and (  ,   ) the target instance which the adversary can't see but would like to infer more information about.We will sometimes group   and (  ,   ), setting  − =   ∪ {  ,   }.Definition 5. Let  ∈ P (X) be a probability distribution.We define Ext  : N × X * → X * to be a randomized function which takes as input a number   and a dataset  − ⊆ X * and outputs a dataset  =  − ∪   where   contains   instances sampled i.i.d.from .Definition 6 (-Honest Fraction (, )-DP).a mechanism  ∈ M (X * , Y) is -HF (, )-DP if, for any two adjacent datasets  − and  ′ − , and for all output sets  ⊆ Y, the inequality This definition essentially says that the adversary only can see some fraction 1 −  of the dataset, denoted by   , and has no knowledge of the remaining fraction  of the dataset (except for the output of the algorithm ), consisting of the target instance (  ,   ) and a set   of  − 1 other unseen instances, except that it is drawn from the same distribution.This means that if an adversary wants to infer something about an instance , he does not only take into account the noise added by the mechanism, but also the uncertainty arising from his ignorance of   .
In general the HFDP privacy notion is not composable in the sense that a mechanism  outputting a pair containing the outputs of a -HF ( 1 ,  1 )-DP mechanism  1 and a -HF ( 2 ,  2 )-DP mechanism  2 , is itself not necessarily -HF ( 1 +  2 ,  1 +  2 )-DP.The reason is that the uncertainty added to the adversary's inference problem by his ignorance of the data   is the same for both mechanisms  1 and  2 , as the data owners not colluding with the adversary don't use different data for the evaluations of the mechanisms  1 and  2 .The noise added by  1 and  2 therefore correlates and simple composition doesn't work.Nevertheless, as we will show in Section 4, HFDP can be used to reduce the amount of noise one must add to a statistical model with multiple parameters to achieve privacy.

APPLICATION TO LINEAR REGRESSION
In this section, we explore the application of HFDP within the framework of linear regression. Let Let  ∈ Z  be a data set containing  instances   ,  = 1 . . ., where   = (  ,   ).We assume that the adversary knows at most ⌊(1 − )⌋ instances.To keep our formulae simple, in the remainder of this work we assume  is an integer.Definition 7 (Linear regression).The simple linear regression on  is the model LR  : X → Y with LR  () =  ⊤  * , where i.e,  * = arg min    ().
In other words, linear regression expresses the value of the target value  as a linear function of the input values of , minimizing the sum of squared errors on the training set.While we consider here for simplicity of our explanation this simple regression variant, our derivation also applies to regularized linear regression.Similarly, it is straightforward to add a bias term to the regression.
Functional mechanism.In the context of differential privacy, the functional mechanism refers to a class of techniques that provide differential privacy guarantees by adding noise to the objective function that machine learning algorithms try to optimize.To apply the functional mechanism to linear regression, the first step is to write   () as a polynomial in .In particular, .Next, one can add Gaussian noise to the coefficients of the polynomial.Let's denote the noise variables by  () ∈ R  × ,  () ∈ R  and  ( ) ∈ R respectively.The objective function after noise injection is: Inspired on Zhang et al. [22], we observe that for all  ≠ , Δ 2  , = Δ 2  = 1 and Δ 2  , = Δ 2   = 2.The matrix  is symmetric, so denoting the upper triangular part of  by   , the vector (  , , ) has in total We can make (  , , ) differentially private by adding the Gaussian mechanism with variance where   contains  − 1 instances they can't see and where (  ,   ) is the target instance they would like to infer information about.He knows that Θ =   +   +  (  ,   ) +  ( ) with Θ = (, , ),  (, ) = ( ⊤ , ,  2 ),  ( ) = ( () ,  () ,  ( ) ), and As the adversary may know Θ and   , the noise prohibiting the adversary from inferring (  ,   ) is the uncertainty on   + ( ) .The variable   is a sum of functions of |  | =  − 1 instances, if  is sufficiently large,   will follow a nearly Gaussian distribution.
According to Theorem 1 we can achieve privacy by ensuring there is Gaussian noise with variance  2 LR , so the uncertainty on each component of   +  ( ) should be at least  2  LR .The adversary may estimate   as θ a  =   |  |/|  |, knowing that both parts of the dataset are drawn from the same distribution.In general, it may not be easy to assess how accurate such estimate θ a  is.Special cases have been described in literature, e.g., if the population distribution is Gaussian, then θ a  − E[  ] follows a Wishart distribution.Anyway, the adversary may have prior background information and may have a better estimate θ a  .We therefore will assume that θ a  is a good approximation to E[  ].From the point of view of the party building the DP regression model, to estimate the uncertainty the adversary will have on θ a  −   , one could compute θ = Θ/ and As  increases, Σ becomes a better estimate for the uncertainty of the adversary on θ a  −   .In particular, for any  > 0, there is an increasing function ℎ : N → [0, 1] with lim →∞ ℎ() = 1 such that with probability at least 1 −  there holds If  is also sufficiently large to make   sufficiently close to a Gaussian distributed random variable (from the point of view of the adversary, even if they would have good knowledge of the population distribution and hence potentially of Σ), then to achieve -HF (,  ′ + )-DP it is sufficient to add to Θ noise  ( ) with  ( ) ∼ N (0, Σ  ) such that (,  ′ ) .This leads us to our main theorem: Theorem 2. Let  ∈ (0, 1),  > 0,  >  ′ > 0. There exists a function ℎ : N → [0, 1] with lim →∞ ℎ() = 1 such that for any  ∈ Z  , adding Gaussian noise to the Θ = (  , , ) vector (before optimizing for ) where to Θ we add  ( ) with  ( ) ∼ N (0, Σ  ) satisfying |  |Σℎ() + Σ  ⪰  2  (,  ′ ) , gives a -HF (, )-DP mechanism.
Here, ℎ() models the extent to which finite values of  may let the adversary's knowledge of the sum over   deviate from a Gaussian distribution.Deriving good bounds for ℎ() is out of the scope of this short paper.
It is important to observe that for the computation of Σ one only needs to evaluate averages over the instances of the dataset .Therefore, in the many federated learning settings where a secure aggregation operation is already available (for example, to compute the coefficients ,  and ), the implementation of our technique should be straightforward.
It is possible that the total uncertainty for the adversary, i.e., on   +  ( ) is not isotropic, e.g., if no Σ  would exist which makes exactly |  |Σℎ() + Σ  =  2  (,  ′ ) .Still, in that case the probability distribution on   −  a  +  ( ) could be written as a sum of the isotropic distribution N (0,  2  ) needed to ensure privacy and some remainder noise in the dimensions where there is high variance in the distribution on   .

CONCLUSION
In this short paper, we presented a novel idea for guaranteeing privacy while reducing the amount of noise that needs to be added.In particular, we argue that in the context of federated learning where anyway cryptography-based components make stronger security assumptions, e.g., that the adversary is computationally limited or can corrupt only part of the data owners, it makes no sense to adopt a statistical privacy notion assuming a strong adversary having access to all but one instance.We analyzed for the specific case of simple linear regression how assuming that a fraction  of the data is secured by honest data owners can reduce the noise one needs to add using the functional mechanism.Similar effects can be obtained when applying other privacy mechanisms, e.g., when using the Gaussian mechanism to privatize node information in decision trees or to privatize gradients in DP-SGD [1] to train neural network models.
If all data owners have the same number of instances, our notion of having a fraction  of the data with honest data owners implies that a fraction  of the data owners should be honest, which nicely aligns with the typical assumptions of cryptographic algorithms.In case data owners have different numbers of instances, it is possible that more data owners (having fewer instances each) can collude and fewer bigger data owners may collude.
We want to pursue several lines of further work.First, we want to perform systematic experiments on real-world data to better study the effect of changing the security assumptions on the utility of predictive models.Second, we aim to elaborate the several extensions outlined in the body of this paper, e.g., the option to consider HFDP in information-theoretic or computational form depending on the security assumptions, stating formulas for regression with bias terms and the generalizations to regularized linear regression and other privacy mechanisms.Third, we aim to more generally provide techniques to exploit the more limited knowledge of the adversary, even if the fact that the data unknown to the adversary induces correlated noise which is somewhat more difficult to handle in composition of privacy costs.Finally, we want more broadly to investigate whether there are other aspects of real-world scenarios we can exploit to better model potential privacy attacks and better tune the amounts of needed noise to achieve privacy, further improving the achievable utility given a desired level of privacy.