Journal of Machine Learning Research 25 (2024) 1-62 Submitted 11/23; Revised 4/24; Published 5/24
Optimal Locally Private Nonparametric Classification
with Public Data
School of Statistics
Renmin University of China
100872 Beijing, China
Hanfang Yang hy[email protected]
Center for Applied Statistics, School of Statistics
Renmin University of China
100872 Beijing, China
Editor: Po-Ling Loh
Abstract
In this work, we investigate the problem of public data assisted non-interactive Lo-
cal Differentially Private (LDP) learning with a focus on non-parametric classification.
Under the posterior drift assumption, we for the first time derive the mini-max optimal
convergence rate with LDP constraint. Then, we present a novel approach, the locally
differentially private classification tree, which attains the mini-max optimal convergence
rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tun-
ing and provides a fast converging estimator. Comprehensive experiments conducted on
synthetic and real data sets show the superior performance of our proposed methods. Both
our theoretical and experimental findings demonstrate the effectiveness of public data com-
pared to private data, which leads to practical suggestions for prioritizing non-private data
collection.
Keywords: Decision Tree, Local Differential Privacy, Non-parametric Classification,
Posterior Drift, Public Data
1. Introduction
Local differential privacy (LDP) (Kairouz et al., 2014; Duchi et al., 2018), which is a variant
of differential privacy (DP) (Dwork et al., 2006), has gained considerable attention in recent
years, particularly among industrial developers (Erlingsson et al., 2014; Tang et al., 2017).
Unlike central DP, which relies on a trusted curator who has access to the raw data, LDP
assumes that each sample is possessed by a data holder, and each holder privatizes their
data before it is collected by the curator. Although this setting provides stronger protection,
learning with perturbed data requires more samples (Duchi et al., 2018) compared to the
central setting. Moreover, basic techniques such as principal component analysis (Wang and
Xu, 2020; Bie et al., 2022), data standardization (Bie et al., 2022), and tree partitioning (Wu
et al., 2022) are troublesome or even prohibited. Consequently, LDP introduces challenges
for various machine learning tasks that are otherwise considered straightforward, including
c
2024 Yuheng Ma and Hanfang Yang.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-1563.html.
Ma and Yang
density estimation (Duchi et al., 2018), mean estimation (Duchi et al., 2018), Gaussian
estimation (Joseph et al., 2019), and change-point detection (Berrett and Yu, 2021).
Fortunately, in certain scenarios, private estimation performance can be enhanced with
an additional public data set. From an empirical perspective, numerous studies demon-
strated the effectiveness of using public data (Papernot et al., 2017, 2018; Yu et al., 2021b,
2022; Nasr et al., 2023; Gu et al., 2023). In most cases, the public data is out-of-distribution
and collected from other related sources. Yet, though public data from an identical distri-
bution has been extensively studied (Bassily et al., 2018; Alon et al., 2019; Kairouz et al.,
2021; Amid et al., 2022; Ben-David et al., 2023; Lowy et al., 2023; Wang et al., 2023a),
only a few works systematically described the out-of-distribution relationship. Focusing on
unsupervised problems, i.e. Gaussian estimation, Bie et al. (2022) discussed the gain of
using public data with respect to the total variation distance between private and public
data distributions. As for supervised learning, Gu et al. (2023) discussed the similarities
between data sets to guide the selection of appropriate public data. Ma et al. (2023) lever-
aged public data to create informative partitions for LDP regression trees. Neither of them
discussed the relationship between regression functions of private and public data distribu-
tions. Aiming at the gap between theory and practice in supervised private learning with
public data, we pose the following intriguing questions:
1. Theoretically, when is labeled public data beneficial for LDP learning?
2. Under such conditions, how to effectively leverage public data?
Hyper-parameter tuning is an essential yet challenging task in private learning (Papernot
and Steinke, 2021; Mohapatra et al., 2022) and remains an unsolved problem in local differ-
ential privacy. Common strategies, such as cross-validation and information criteria, remain
unavailable due to their requirement for multiple queries to the training data. Tuning by
omitting a validation set is restrictive in terms of the size of the potential hyperparameter
space (Ma et al., 2023). Moreover, leveraging public data necessitates more complex mod-
els, which often results in an increased number of hyperparameters to tune. Consequently,
models with public data may face difficulties in parameter selection, which leads to the
third consideration:
3. Can data-driven hyperparameter tuning approaches be derived?
In this work, we answer the above questions from the perspective of non-parametric
classification with the non-interactive LDP constraint. Though parametric methods enjoy
better privacy-utility trade-off (Duchi et al., 2018) under certain assumptions, they are
vulnerable to model misspecification. Under LDP setting where we have limited prior
knowledge of the data, non-parametric methods ensure the worst-case performance. As
for private classification, the local setting remains rarely explored compared to the central
setting. A notable reason is that most gradient-based methods (Song et al., 2013; Abadi
et al., 2016) are prohibited due to the high demand for memory and communication capacity
on the terminal machine (Tram`er et al., 2022). We consider non-interactive LDP, where the
communication between the curator and the data holders is limited to a single round. This
type of method is favored by practitioners (Smith et al., 2017; Zheng et al., 2017; Daniely
2
Optimal Locally Private Nonparametric Cla ssification with Public Data
and Feldman, 2019) since they avoid multi-round protocols that are prohibitively slow in
practice due to network latency.
Under such background, we summarize our contributions by answering the questions:
For the first question, we propose to adopt the framework of posterior drift, a setting
in transfer learning (Cai and Wei, 2021), for our analysis. Specifically, given the target
distribution P and the source distribution Q, we require P
X
= Q
X
, i.e. they have
the same marginal distributions. Moreover, their Bayes decision rule are identical, i.e.
(P(Y = 1|X) 1/2) · (Q(Y = 1|X) 1/2) 0. The assumption covers a wide range
of distributions and includes P = Q as a special case. Theoretically, we for the first
time establish a mini-max lower bound for nonparametric LDP learning with public
data.
To answer the second question, we propose the Locally differentially Private Classifi-
cation Tree (LPCT), a novel algorithm that leverages both public and private data.
Specifically, we first create a tree partition on public data and then estimate the re-
gression function through a weighted average of public and private data. Besides
inheriting the merits of tree-based models such as efficiency, interpretability, and ex-
tensivity to multiple data types, LPCT is superior from both theoretical and empirical
perspectives. Theoretically, we show that LPCT attains the mini-max optimal con-
vergence rate. Empirically, we conduct experiments on both synthetic and real data
sets to show the effectiveness of LPCT over state-of-the-art competitors.
To answer the last question, we propose the Pruned Locally differentially Private
Classification Tree (LPCT-prune), a data-driven classifier that is free of parameter
tuning. Specifically, we query the privatized information with a sufficient tree depth
and conduct an efficient pruning procedure. Theoretically, we show that LPCT-prune
achieves the optimal convergence rate in most cases and maintains a reasonable con-
vergence rate otherwise. Empirically, we illustrate that LPCT-prune with a default
parameter performs comparable to LPCT with the best parameters.
This article is organized as follows. In Section 2, we provide a comprehensive literature
review. Then we formally define our problem setting and introduce our methodology in
Section 3. The obtained theoretical results are presented in Section 4. Then we develop
the data-driven estimator in Section 5. Extensive experiments are conducted on synthetic
data and real data in Section 6 and 7 respectively. The conclusion and discussions are in
Section 8.
2. Related Work
2.1 Private Learning with Public Data
Leveraging public data in private learning is a research topic that is both theoretically and
practically meaningful to closing the gap between private and non-private methods. There
is a long list of works addressing the issue of private learning with public data from a cen-
tral differential privacy perspective. Through unlabeled public data, Papernot et al. (2017,
2018) fed knowledge privately into student models, whose theoretical benefits are estab-
lished by Liu et al. (2021). Empirical investigations have demonstrated the effectiveness of
3
Ma and Yang
pretraining on public data and fine-tuning privately on sensitive data (Li et al., 2022; Yu
et al., 2022; Ganesh et al., 2023; Yu et al., 2023). Using the information obtained in public
data, Wang and Zhou (2020); Kairouz et al. (2021); Yu et al. (2021a); Zhou et al. (2021);
Amid et al. (2022); Nasr et al. (2023) investigated preconditioning or adaptively clipping
the private gradients, which reduce the required amount of noise in DP-SGD and accel-
erate its convergence. Theoretical works such as Bassily et al. (2018); Alon et al. (2019);
Bassily et al. (2020) studied sample complexity bounds for PAC learning and query release
based on the VC dimension of the function space. Bie et al. (2022) used public data to
standardize private data and showed that the sample complexity of Gaussian mean esti-
mation can be augmented in the sense of the range parameter. More recently, by relating
to sample compression schemes, Ben-David et al. (2023) presented sample complexities for
several problems including Gaussian mixture estimation. Ganesh et al. (2023) explained
the necessity of public data by investigating the loss landscape.
In contrast, there is less attention focused on the local setting. Su et al. (2023) consid-
ered half-space estimation with an auxiliary unlabeled public data. They construct several
weak LDP classifiers and label the public data by majority vote. They established sample
complexity that is linear in the dimension and polynomial in other terms for both private
and public data. Wang et al. (2023a) employed public data to estimate the leading eigen-
value of the covariance matrix, which will serve as a reference for data holders to clip their
statistics. Given the learned clipping threshold, the sample complexity for GLM with LDP
for a general class of functions is improved to polynomial concerning error, dimension, and
privacy budget, which is shown impossible without public data (Smith et al., 2017; Dagan
and Feldman, 2020). Both Wang et al. (2023a) and Su et al. (2023) considered the un-
labeled data, while our work further leverages the information contained in the labels if
labeled public data is available. More recently, Ma et al. (2023) enhanced LDP regression
by using public data to create an informative partition. However, the relationship between
the source and target distributions is vaguely defined. Also, their theoretical results are no
better than using just private data. Lowy et al. (2023) studied mean estimation, DP-SCO,
and DP-ERM with public data. They assume that the public data and the private data
are identically distributed. Also, the theoretical results are established under sequentially-
interactive LDP. Despite the distinctions in the research problem, their conclusions are
analogous to ours: the asymptotic optimal error can be achieved by using either the private
or public estimation solely, while weighted average estimation achieves better constant as
well as empirical performance.
2.2 Locally Private Classification
LDP classification with parametric assumptions can be viewed as a special case of LDP-
ERM (Smith et al., 2017; Zheng et al., 2017; Wang et al., 2018; Daniely and Feldman, 2019;
Dagan and Feldman, 2020; Wang et al., 2023a). Most works provide evidence of hardness on
non-interactive LDP-ERM. Non-parametric methods possess a worse privacy-utility trade-
off (Duchi et al., 2018) and are considered to be harder than parametric problems. Zheng
et al. (2017) investigated the kernel ridge regression and established a bound of estimation
error of the order n
1/4
. Their results only apply to losses that are strongly convex and
smooth. The works most related to ours is that of Berrett and Butucea (2019); Berrett
4
Optimal Locally Private Nonparametric Cla ssification with Public Data
et al. (2021). Berrett and Butucea (2019) studied non-parametric classification with the
older smoothness assumption. They proposed a histogram-based method using the Laplace
mechanism and showed that this approach is mini-max optimal under their assumption. The
strong consistency of this approach is established in Berrett et al. (2021) as a by-product.
Our results include their conclusion as a special case and hold in a stronger sense, i.e.
“uniformly with high probability” instead of “in expectation”.
2.3 Private Hyperparameter Tuning
Researchers highlighted the importance and difficulty of parameter tuning under DP con-
straint (Papernot and Steinke, 2021; Wang et al., 2023b) focusing on central DP. Chaudhuri
and Vinterbo (2013) explored this problem under strong stability assumptions. Liu and Tal-
war (2019) presented DP algorithms for selecting the best parameters in several candidates.
Compared to the original algorithm, their method suffers a factor of 3 in the privacy param-
eter, meaning that 2/3 of the privacy budget is spent on the selection process. Papernot
and Steinke (2021) improved the result with a much tighter privacy bound in Renyi-DP
by running the algorithm random times and returning the best result. Recently, Wang
et al. (2023b) proposed a framework that reduces DP parameter selection to a non-DP
counterpart. This approach leverages existing hyperparameter optimization methods and
outperforms the uniform sampling approaches (Liu and Talwar, 2019; Papernot and Steinke,
2021). Ramaswamy et al. (2020) proposed to tune hyperparameters on public data sets.
Mohapatra et al. (2022) argued that, under a limited privacy budget, DP optimizers that re-
quire less tuning are preferable. As far as we know, there is a lack of research focusing on the
LDP parameter selection. Butucea et al. (2020) investigated adaptive parameter selection
in non-parametric density estimation, and provided a data-driven wavelet estimator.
3. Locally Differentially Private Classification Tree
In this section, we introduce the Locally differentially Private Classification Tree (LPCT).
After explaining necessary notations and problem setting in Section 3.1, we propose our
decision tree partition rule in Section 3.2. Finally, Section 3.3 introduces our privacy mech-
anism and proposes the final algorithm.
3.1 Problem Setting
We introduce necessary notations. For any vector x, let x
i
denote the i-th element of
x. Recall that for 1 p < , the L
p
-norm of x = (x
1
, . . . , x
d
) is defined by kxk
p
:=
(|x
1
|
p
+···+|x
d
|
p
)
1/p
. Let x
i:j
= (x
i
, ··· , x
j
) be the slicing view of x in the i, ··· , j position.
Throughout this paper, we use the notation a
n
. b
n
and a
n
& b
n
to denote that there exist
positive constant c and c
0
such that a
n
cb
n
and a
n
c
0
b
n
, for all n N. In addition,
we denote a
n
b
n
if a
n
. b
n
and b
n
. a
n
. Let a b = max(a, b) and a b = min(a, b).
Besides, for any set A R
d
, the diameter of A is defined by diam(A) := sup
x,x
0
A
kxx
0
k
2
.
Let f
1
f
2
represent the composition of functions f
1
and f
2
. Let A × B be the Cartesian
product of sets where A X
1
and B X
2
. For measure P on X
1
and Q on X
2
, define the
product measure P Q on X
1
× X
2
as P Q(A × B) = P(A) · Q(B). For an integer k,
5
Ma and Yang
denote the k-fold product measure on X
k
1
as P
k
. Let the standard Laplace random variable
have probability density function e
−|x|
/2 for x R.
It is legitimate to consider binary classification while an extension of our results to
multi-class classification is straightforward. Suppose there are two unknown probability
measures, the private measure P and the public measure Q on X ×Y = [0, 1]
d
×{0, 1}. We
observe n
P
i.i.d. private samples D
P
= {(X
P
1
, Y
P
1
), . . . , (X
P
n
P
, Y
P
n
P
)} drawn from the target
distribution P, and n
Q
i.i.d. public samples D
Q
= {(X
Q
1
, Y
Q
1
), . . . , (X
Q
n
Q
, Y
Q
n
Q
)} drawn from
the source distribution Q. The data points from the distributions P and Q are also mutually
independent. Our goal is to classify under the target distribution P. Given the observed
data, we construct a classifier
b
f : X Y which minimizes the classification risk under the
target distribution P :
R
P
(
b
f) , P
(X,Y )P
(Y 6=
b
f(X)).
Here P
(X,Y )P
(·) means the probability when (X, Y ) are drawn from distribution P. The
Bayes risk, which is the smallest possible risk with respect to P, is given by R
P
:=
inf{R
P
(f)|f : X Y is measurable}. In such binary classification problems, the regression
functions are defined as
η
P
(x) , P(Y = 1 | X = x) and η
Q
(x) , Q(Y = 1 | X = x), (1)
which represent the conditional distributions P
Y |X
and Q
Y |X
. The function that achieves
Bayes risk with respect to P is called Bayes function, namely, f
P
(x) := 1 (η
P
(x) > 1/2).
We consider the following setting. The estimator
b
f is considered as a random function
with respect to both D
P
and D
Q
, while its construction process with respect to D
P
is locally
differentially private (LDP). The rigorous definition of LDP is as follows.
Definition 1 (Local Differential Privacy) Given data {(X
P
i
, Y
P
i
)}
n
P
i=1
, a privatized in-
formation Z
i
, which is a random variable on S, is released based on (X
P
i
, Y
P
i
) and Z
1
, ··· , Z
i1
.
Let σ(S) be the σ-field on S. Z
i
is drawn conditional on (X
P
i
, Y
P
i
) and Z
1:i1
via the
distribution R
i
S | X
P
i
= x, Y
P
i
= y, Z
1:i1
= z
1:i1
for S σ(S). Then the mechanism
R = {R
i
}
n
P
i=1
is sequentially-interactive ε-locally differentially private (ε-LDP) if
R
i
S | X
P
i
= x, Y
P
i
= y, Z
1:i1
= z
1:i1
R
i
S | X
P
i
= x
0
, Y
P
i
= y
0
, Z
1:i1
= z
1:i1
e
ε
for all 1 i n
P
, S σ(S), x, x
0
X, y, y
0
Y, and z
1:i1
S
i1
. Moreover, if
R
i
S | X
P
i
= x, Y
P
i
= y
R
i
S | X
P
i
= x
0
, Y
P
i
= y
0
e
ε
for all 1 i n
P
, S σ(S), x, x
0
X, y, y
0
Y, then R is non-interactive ε-LDP.
This formulation is widely adopted (Duchi et al., 2018; Berrett and Butucea, 2019). In
contrast to central DP where the likelihood ratio is taken concerning some statistics of all
data, LDP requires individuals to guarantee their own privacy by considering the likelihood
ratio of each (X
P
i
, Y
P
i
). Once the view z is provided, no further processing can reduce
6
Optimal Locally Private Nonparametric Cla ssification with Public Data
the deniability about taking a value (x, y) since any outcome z is nearly as likely to have
come from some other initial value (x
0
, y
0
). Sequentially-interactive mechanisms fulfill the
requirements for a wide range of methods including gradient-based approaches. However,
they can prove to be prohibitively slow in practice. Conversely, non-interactive mechanisms
are quite efficient but suffer from poorer performance (Smith et al., 2017). Our analysis
does not encompass the more general fully-interactive mechanisms due to their complex
nature (Duchi et al., 2018).
Besides the privacy guarantee of P data, we consider the existence of additional pub-
lic data from distribution Q. To depict the relationship between P and Q, we consider
the posterior drift setting in transfer learning literature (Cai and Wei, 2021). Under the
posterior drift model, let P
X
= Q
X
be the identical marginal distribution. The main dif-
ference between P and Q lies in the regression functions η
P
(x) and η
Q
(x). Specifically, let
η
Q
(x) = φ (η
P
(x)) for some strictly increasing link function φ(·) with φ (1/2) = 1/2. The
condition that φ is strictly increasing leads to the situation where those X that are more
likely to be labeled Y = 1 under P are more likely to be labeled Y = 1 under Q. The
assumption φ (1/2) = 1/2 guarantees that those X that are non-informative under P are
the same under Q, and more importantly, the sign of η
P
1/2 and η
Q
1/2 are identical.
3.2 Decision Tree Partition
In this section, we formalize the decision tree partition process. A binary tree partition
π is a disjoint cover of X obtained by recursively split grids into subgrids. While our
methodology applies to any tree partition, it can be challenging to use general partitions
such as the original CART (Breiman, 1984) for theoretical analysis. Following Cai et al.
(2023); Ma et al. (2023), we propose a new splitting rule called the max-edge partition rule.
This rule is amenable to theoretical analysis and can also achieve satisfactory practical
performance. In the non-interactive setting, the private data set is unavailable during the
partition process and the partition is created solely on the public data. Given public data
set {(X
Q
i
, Y
Q
i
)}
n
Q
i=1
, the partition rule is stated as follows:
Let A
1
(0)
:= [0, 1]
d
be the initial rectangular cell and π
0
:= {A
j
(0)
}
j∈I
0
be the initialized
cell partition. I
0
= {1} stands for the initialized index set. In addition, let p N
represent the maximum depth of the tree. The parameter is fixed beforehand by the
user and possibly depends on n.
Suppose we have obtained a partition π
i1
of X after i 1 steps of the recursion. Let
π
i
= . In the i-th step, let A
j
(i1)
π
i1
be ×
d
`=1
[a
`
, b
`
] for j I
i1
. We choose the
edge to be split among the longest edges. The index set of longest edges is defined as
M
j
(i1)
=
k | |b
k
a
k
| = max
`=1,··· ,d
|b
`
a
`
|, k = 1, ··· , d
.
Assume we split along the `-th dimension for ` M
j
(i1)
, A
j
(i1)
is then parti-
tioned into a left sub-cell A
j,0
(i1)
(`) and a right sub-cell A
j,1
(i1)
(`) along the midpoint
of the chosen dimension, where A
j,0
(i1)
(`) =
n
x | x A
j
(i1)
, x
`
< (a
`
+ b
`
)/2
o
and
7
Ma and Yang
Algorithm 1: Max-edge Partition
Input: Public data D
Q
, depth p.
Initialization: π
0
= [0, 1]
d
.
for i = 1 to p do
π
i
=
for j in I
i1
do
Select ` as in (2).
π
i
= π
i
{A
j,0
(i1)
(`), A
j,1
(i1)
(`)}.
end
end
Output: Parition π
p
A
j,1
(i1)
(`) = A
j
(i1)
/A
j,0
(i1)
(`). Then the dimension to be split is chosen by
min arg min
`∈M
j
(i1)
score
A
j,0
(i1)
(`), A
j,1
(i1)
(`), D
Q
, (2)
where the score(·) function can be criteria for decision tree classifiers such as the Gini
index or information gain. The min operator implies that if the arg min function
returns multiple indices, we select the smallest index. Thus, the partition process is
feasible even if there is no public data, where all axes have the same score.
Once ` is selected, let π
i
= π
i
{A
j,0
(i1)
(`), A
j,1
(i1)
(`))}.
The complete process is presented in Algorithm 1 and illustrated in Figure 1. For each grid,
the partition rule selects the midpoint of the longest edges that achieves the largest score
reduction. This procedure continues until either node has no samples or the depth of the
tree reaches its limit. We also define relationships between nodes which are necessary in
Section 5. For node A
j
(i)
, we define its parent node as
parent(A
j
(i)
) := A
j
0
(i1)
s.t. A
j
0
(i1)
π
i1
, A
j
(i)
A
j
0
(i1)
.
For i
0
< i, the ancestor of a node A
j
(i)
with depth i
0
is then defined as
ancestor(A
j
(i)
, i
0
) = parent
ii
0
(A
j
(i)
) = parent ··· parent
| {z }
ii
0
(A
j
(i)
),
which is the grid in π
i
0
that contains A
j
(i)
.
3.3 Privacy Mechanism for Any Partition
This section focuses on the privacy mechanism based on general tree partitions. We first
introduce the necessary definitions and then present our private estimation based on the
Laplacian mechanism in (8). To avoid confusion, we refer the readers to a clear summariza-
tion of the defined estimators in Appendix A.1.
8
Optimal Locally Private Nonparametric Cla ssification with Public Data
A
1
(0)
A
1
(1)
A
2
(1)
A
2
(2)
A
1
(2)
A
1
(3)
A
2
(3)
A
3
(3)
A
4
(3)
Ancestor
Ancestor/Parent
π
0
π
1
π
2
π
3
Figure 1: Partition created by the max-edge rule. The areas filled with orange represent the corresponding
A
j
(i)
and the blue lines represent the partition boundaries after each level of partitioning. Red
boxes contain an ancestor with depth 2, which is also a parent. Blue boxes contain an ancestor
with depth 1.
For index set I, let π = π
p
= {A
j
(p)
}
j∈I
p
be the tree partition created by Algorithm 1.
A population classification tree estimation is defined as
η
π
P
(x) =
X
j∈I
p
1{x A
j
(p)
}
R
A
j
(p)
η
P
(x
0
) dP
X
(x
0
)
R
A
j
(p)
dP
X
(x
0
)
(3)
The label is inferred using the population tree classifier which is defined as f
π
P
(x) =
1 (η
π
P
(x) > 1/2). Here, we let 0/0 = 0 by definition.
To get an empirical estimator given the data set D = {(X
P
1
, Y
P
1
), . . . , (X
P
n
P
, Y
P
n
P
)}, we
estimate the numerator and the denominator of (3) separately. To estimate the denomi-
nator, each sample (X
P
i
, Y
P
i
) contributes a one-hot vector U
P
i
{0, 1}
|I
p
|
where the j-th
element of U
P
i
is 1{X
P
i
A
j
(p)
}. Then an estimation of
R
A
j
(p)
dP
X
(x
0
) is
1
n
P
P
n
P
i=1
U
Pj
i
, which
is the number of samples in A
j
(p)
divided by n
P
. Analogously, let V
P
i
= Y
i
· U
P
i
{0, 1}
|I
p
|
.
An estimation of
R
A
j
(p)
η
P
(x
0
)dP
X
(x
0
) is
1
n
P
P
n
P
i=1
V
Pj
i
, which is the sum of the labels in A
j
(p)
divided by n
P
. Combining the pieces, a classification tree estimation is defined as
bη
π
P
(x) =
X
j∈I
p
1{x A
j
(p)
}
P
n
P
i=1
V
Pj
i
P
n
P
i=1
U
Pj
i
. (4)
The corresponding classifer is defined as
b
f
π
P
(x) = 1 (bη
π
P
(x) > 1/2). In other words, bη
π
P
(x)
estimates η
P
(x) by the average of the responses in the cell. In the non-private setting, each
data holder prepares U
P
i
and V
P
i
according to the partition π and sends it to the curator.
Then the curator aggregates the transmission following (4).
To protect the privacy of each data, we propose to estimate the numerator and denomi-
nator of the population regression tree using a privatized method. Specifically, U
P
i
and V
P
i
are combined with the Laplace mechanism (Dwork et al., 2006) before being sent to the
9
Ma and Yang
curator. For i.i.d. standard Laplace random variables ζ
j
i
and ξ
j
i
, let
˜
U
P
i
:= U
P
i
+
4
ε
ζ
i
= U
P
i
+ (
4
ε
ζ
1
i
, ··· ,
4
ε
ζ
|I
p
|
i
). (5)
Then a privatized estimation of
R
A
j
(p)
dP
X
(x) is
1
n
P
P
n
P
i=1
˜
U
Pj
i
. Analogously, let
˜
V
P
i
:= V
P
i
+
4
ε
ξ
i
= V
P
i
+ (
4
ε
ξ
1
i
, ··· ,
4
ε
ξ
|I
p
|
i
). (6)
An estimation of
R
A
j
(p)
η
P
(x)dP
X
(x) is
1
n
P
P
n
P
i=1
˜
V
Pj
i
. Then using the privatized information
(
˜
U
P
i
,
˜
V
P
i
), i = 1, ··· , n
P
, we can estimate the regression function as
˜η
π
P
(x) =
X
j∈I
p
1{x A
j
(p)
}
P
n
P
i=1
˜
V
Pj
i
P
n
P
i=1
˜
U
Pj
i
. (7)
The procedure is also derived by Berrett and Butucea (2019). As an alternative, one can
use random response (Warner, 1965) since both U and V are binary vectors. The following
proposition shows that the estimation procedures satisfy LDP.
Proposition 2 Let π = {A
j
}
j∈I
be any partition of X with
j∈I
A
j
= X and A
i
A
j
= ,
i 6= j. Then the privacy mechanism defined in (5) and (6) is non-interactive ε-LDP.
In the case where one also has access to the Q-data in addition to the P-data, the
Q data can be used to help the classification task under the target distribution P and
should be taken into consideration. The utilization of P data must satisfy the local DP
constraint while the Q data can be arbitrarily adopted. We accommodate the existing
private estimation (7) and non-private estimation (4) by taking a weighted average based
on both estimators. Denote the encoded information from D
P
and D
Q
as (U
P
i
, V
P
i
) and
(U
Q
i
, V
Q
i
), respectively. We use the same partition for P estimator and Q estimator. Namely,
(4) and (7) are constructed via an identical π. When taking the average, data from the
different distributions should have different weights since the signal strengths are different
between P and Q. To make the classification at x X, the Locally differentially Private
Classification Tree estimation (LPCT) is defined as follows:
˜η
π
P,Q
(x) =
X
j∈I
p
1{x A
j
(p)
}
P
n
P
i=1
˜
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
˜
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
. (8)
Here, we let λ 0. The locally private tree classifier is
˜
f
π
P,Q
(x) = 1(˜η
π
P,Q
(x) > 1/2). In each
leaf node A
j
(p)
, the classifier is assigned a label based on the weighted average of estimations
from P and Q data. The parameter λ serves to balance the relative contributions of both P
and Q. A higher value of λ indicates that the final predictions are predominantly influenced
by the public data, and vice versa.
The estimator defined in (8) easily adapts to solve the non-private transfer learning
problem in Cai and Wei (2021) if we define
bη
π
P,Q
(x) =
X
j∈I
p
1{x A
j
(p)
}
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
. (9)
10
Optimal Locally Private Nonparametric Cla ssification with Public Data
Similar to the nearest neighbor estimator in Cai and Wei (2021), (9) estimates by taking
weighted averages of labels from nearby samples. Our theoretical results in Section 4.2
imply that (9) is also rate optimal in terms of rate for non-private transfer learning. A key
advantage for incorporating (9) is that, given the partition, it stores only the prediction
values at each node, which makes it convenient to inject noise. Conversely, the nearest
neighbor estimator (similarly, the kernel estimator) requires a query to the entire training
data set for each single testing sample, which prohibits any modification towards a private
estimator. See Kroll (2021) for an example of such limitations.
4. Theoretical Results
In this section, we present the obtained theoretical results. We present the matching lower
and upper bounds of excess risk in Section 4.1 and 4.2. Finally, we demonstrate that LPCT
is free from the unknown range parameter in Section 4.3. All technical proofs can be found
in Appendix B.
4.1 Mini-max Convergence Rate
We first give the necessary assumptions on the distribution P and Q. Then we provide the
mini-max rate under the assumptions.
Assumption 1 Let α (0, 1]. Assume the regression function η
P
: X R is α-H¨older
continuous, i.e. there exists a constant c
L
> 0 such that for all x
1
, x
2
X, |η
P
(x
1
)
η
P
(x
2
)| c
L
kx
1
x
2
k
α
. Also, assume that the marginal density functions of P and Q are
upper and lower bounded, i.e. c dP
X
(x) c for some c, c > 0.
Assumption 2 (Margin Assumption) Let β 0, C
β
> 0. Assume
P
X
η
P
(X)
1
2
< t
C
β
t
β
.
Assumption 3 (Relative Signal Exponent) Let the relative signal exponent γ (0, )
and C
γ
(0, ). Assume that
η
Q
(x)
1
2
η
P
(x)
1
2
0, and
η
Q
(x)
1
2
C
γ
η
P
(x)
1
2
γ
.
Assumption 1 and 2 are standard conditions that have been widely used for non-
parametric classification problems (Audibert and Tsybakov, 2007; Samworth, 2012; Chaud-
huri and Dasgupta, 2014). Assumption 3 depicts the similarity between the conditional
distribution of public data and private data. When γ is small, the signal strength of Q
is strong compared to P. In the extreme instance of γ = 0, η
Q
(x) consistently remains
distant from 1/2 by a constant. In contrast, when γ is large, there is not much information
contained in Q. The special case of γ = allows η
Q
(x) to always be 1/2 and is thus non-
informative. We present the following theorem that specifies the mini-max lower bound
with LDP constraints on P data under the above assumptions. The proof is based on first
constructing two function classes and then applying the information inequalities for local
privacy (Duchi et al., 2018) and Assouad’s Lemma (Tsybakov, 2009).
11
Ma and Yang
Theorem 3 Denote the function class of (P, Q) satisfying Assumption 1, 2, and 3 by F.
Then for any estimator
b
f that is sequentially-interactive ε-LDP, there holds
inf
b
f
sup
F
E
X,Y
h
R
P
(
b
f) R
P
i
&
n
P
(e
ε
1)
2
+ n
2α+2d
2γα+d
Q
α(β+1)
2α+2d
. (10)
If n
Q
= 0, our results reduce to the special case of LDP non-parametric classifica-
tion with only private data. In this case, the mini-max lower bound is of the order
(n
P
(e
ε
1)
2
)
α(β+1)
2α+2d
. Previously, Berrett and Butucea (2019) provided an estimator that
reaches the mini-max optimal convergence rate for small ε. If n
P
= 0, the result recovers the
learning rate n
α(β+1)
2γα+d
Q
established for transfer learning under posterior drift (Cai and Wei,
2021). In this case, we have only source data from Q and want to generalize the classifier
on the target distribution P.
The mini-max convergence rate (10) is the same as if one fits an estimator using only
P data with sample size n
P
+ (e
ε
1)
2
·n
2α+2d
2γα+d
Q
. The contribution of Q data is substantial
when n
P
(e
ε
1)
2
· n
2α+2d
2γα+d
Q
. Otherwise, the convergence rate is not significantly better
than using P data only. The quantity (e
ε
1)
2
· n
2α+2d
2γα+d
Q
, which is referred to as effective
sample size, represent the theoretical gain of using Q data with sample size n
Q
under given
ε, α, d, and γ. Notably, the smaller γ is, the more effective the Q data becomes, and the
greater gain is acquired by using Q data. Regarding different values of γ, our work covers
the setting in previous work when a small amount of effective public samples are required
(Bie et al., 2022; Ben-David et al., 2023), and when a large amount of noisy/irrelevant public
samples are required (Ganesh et al., 2023; Yu et al., 2023). Also, the public data is more
effective for small ε, i.e. under stronger privacy constraints. Intuitively, in the presence
of stringent privacy constraints, the utilization of Q data is as effective as incorporating
a substantial amount of P data. Furthermore, even for a constant ε, our effective sample
size significantly exceeds n
2α+d
2γα+d
Q
as in the non-private scenario (Cai and Wei, 2021). This
suggests that, in comparison to transfer learning in the non-private setting, public data
assumes greater value in the private setting.
4.2 Convergence Rate of LPCT
In the following, we show the convergence rate of LPCT reaches the lower bound with
proper parameters. The proof is based on a carefully designed decomposition in Appendix
B.3. Define the quantity
δ :=
n
P
ε
2
log(n
P
+ n
Q
)
+
n
Q
log(n
P
+ n
Q
)
2α+2d
2γα+d
!
d
2α+2d
. (11)
Theorem 4 Let π be constructed through the max-edge partition in Section 3.2 with depth
p. Let
˜
f
π
P,Q
be the locally private two-sample tree classifier. Suppose Assumption 1, 2, and
12
Optimal Locally Private Nonparametric Cla ssification with Public Data
3 hold. Then, for n
α
2α+2d
P
. ε . n
d
4α+2d
P
, if λ δ
(γαα)(2γαd)
d
and 2
p
δ
1
, there holds
R
P
(
˜
f
π
P,Q
) R
P
. δ
α(β+1)
d
=
n
P
ε
2
log(n
P
+ n
Q
)
+
n
Q
log(n
P
+ n
Q
)
2α+2d
2γα+d
!
α(β+1)
2α+2d
(12)
with probability 1 3/(n
P
+ n
Q
)
2
with respect to P
n
P
Q
n
Q
R where R is the joint
distribution of privacy mechanisms in (5) and (6).
Compared to Theorem 3, the LPCT with the best parameter choice reaches the mini-
max lower bound up to a logarithm factor. Both parameters λ and p should be assigned
a value of the correct order to achieve the optimal trade-off. The depth p increases when
either n
P
or n
Q
grows. This is intuitively correct since the depth of a decision tree should
increase with a larger number of samples. The strength ratio λ, on the other hand, is closely
related to the value of γ. As explained earlier, a small γ guarantees a stronger signal in
η
Q
. Thus, the value of λ is decreasing for γ. In this case, when the signal strength of η
Q
is
large, we rely more on the estimation by Q data by assigning a larger ratio to it.
We note that when ε is large, there is a gap between (e
ε
1)
2
in the lower bound and
ε
2
in the upper bound. The problem can be mitigated by employing the random response
mechanism (Warner, 1965) instead of the Laplace mechanism, as mentioned in Section
3.3. The theoretical analysis generalizes analogously. Here, we do not address this issue
as we want to focus on public data. The theorem restricts n
α
2α+2d
P
. ε . n
d
4α+2d
P
. When
ε . n
α
2α+2d
P
, the estimator is drastically perturbed by the random noise and the convergence
rate is no longer optimal. As ε gets larger, the upper bound of excess risk decreases until
ε n
d
4α+2d
P
. When ε & n
d
2α+2d
P
, the random noise is negligible and the convergence rate
remains to be
n
P
log(n
P
+ n
Q
)
+
n
Q
log(n
P
+ n
Q
)
2α+d
2γα+d
!
α(β+1)
2α+d
,
which recovers the rate of non-private learning as established in Cai and Wei (2021).
We discuss the order of ratio λ. When γ
d
α
1, we have λ δ
(γ1)α
d
. This rate
is exactly the same as in the conventional transfer learning case (Cai and Wei, 2021) and
is enough to achieve the optimal trade-off. Yet, under privacy constraints, this choice
of λ will fail when γ <
d
α
1. Roughly speaking, if we let λ δ
(γ1)α
d
in this case,
there is a probability such that the noise
4
ε
P
n
P
i=1
ζ
j
i
in (8) will dominate the non-private
density estimation
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
. When the domination happens, the point-wise
estimation is decided by random noises, and thus the overall excess risk fails to converge.
To alleviate this issue, we assign a larger value λ δ
2γαd
d
> δ
(γ1)α
d
to ensure
4
ε
P
n
P
i=1
ζ
j
i
.
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
. As an interpretation, in the private setting, we place greater
reliance on the estimation by public data compared to the non-private setting.
Last, we discuss the computational complexity of LPCT. We first consider the average
computation complexity of LPCT. The training stage consists of two parts. Similar to the
13
Ma and Yang
standard decision tree algorithm (Breiman, 1984), the partition procedure takes O(pn
Q
d)
time. The computation of (8) takes O(p(n
P
+ n
Q
)) time. From the proof of Theorem 4,
we know that 2
p
δ
1
, where δ
1
. n
P
+ n
Q
. Thus the training stage complexity is
around O((n
P
+ n
Q
d) log(n
P
+ n
Q
)) which is linear. Since each prediction of the decision
tree takes O(p) time, the test time for each test instance is no more than O(log(n
P
+ n
Q
)).
As for storage complexity, since LPCT only requires the storage of the tree structure and
the prediction value at each node, the space complexity of LPCT is O(δ
1
) which is also
sub-linear. In short, LPCT is an efficient method in time and memory.
4.3 Eliminating the Range Parameter
We discuss how public data can eliminate the range parameter contained in the convergence
rate. When X is unknown, one must select a predefined range [r, r]
d
for creating partitions.
Only then do the data holders have a reference for encoding their information. In this case,
the convergence rate of Berrett and Butucea (2019) becomes (r
2d
/(n
P
ε
2
))
(1+β)α
2α+2d
(1
R
[r,r]
d
dP(x)), which is slower than the known X = [0, 1]
d
for any r. A small r means that
a large part of the informative X will be ignored (the second term is large), whereas a large
r may create too many redundant cells (the first term is large). See Gy¨orfi and Kroll (2022)
for detailed derivation. Without much prior knowledge, this parameter can be hard to tune.
The removal of the range parameter is first discussed in Bie et al. (2022), where d+1 public
data is required. Our result shows that we do not need to tune this range parameter with
the help of public data.
Theorem 5 Suppose the domain X = ×
d
j=1
[a
j
, b
j
] is unknown. Let
b
a
j
= min
i
X
j
i
and
b
b
j
=
max
i
X
j
i
. Define the min-max scaling map M : R
d
R
d
where M(x)
j
= (x
j
b
a
j
)/(
b
b
j
b
a
j
).
The the min-max scaled D
P
and D
Q
is written as M(D
P
) and M(D
Q
). Then for
˜
f
π
P,Q
trained within [0, 1]
d
on M(D
P
) and M(D
Q
) as in Theorem 4, we have
R
P
(
˜
f
π
P,Q
· 1
[0,1]
d
) M
R
P
. δ
α(β+1)
d
+
log n
Q
n
Q
with probability 1 3/(n
P
+ n
Q
)
2
d/n
2
Q
with respect to P
n
P
Q
n
Q
R.
Compared to Theorem 4, the upper bound of Theorem 5 increases by log n
Q
/n
Q
, which
is a minor term in most cases. Moreover, Theorem 5 fails with additional probability d/n
2
Q
since we are estimating [a
j
, b
j
] with [
b
a
j
,
b
b
j
]. Both changes indicate that sufficient public
data is required to resolve the unknown range issue.
5. Data-driven Tree Structure Pruning
In this section, we develop a data-driven approach for parameter tuning and investigate its
theoretical properties. In Section 5.1, we derive a selection scheme in the spirit of Lepski’s
method (Lepskii, 1992), a classical method for adaptively choosing the hyperparameters.
In Section 5.2, we make slight modifications to the scheme and illustrate the effectiveness
of the amended approach.
14
Optimal Locally Private Nonparametric Cla ssification with Public Data
A
1
(0)
A
1
(1)
A
2
(2)
A
1
(2)
A
1
(3)
A
2
(3)
A
3
(3)
A
4
(3)
(a) Initial estimation
A
1
(0)
A
1
(1)
A
2
(2)
A
1
(2)
A
1
(3)
A
2
(3)
A
3
(3)
A
4
(3)
(b) During pruning
A
1
(0)
A
1
(1)
A
2
(2)
A
1
(2)
A
1
(3)
A
2
(3)
A
3
(3)
A
4
(3)
(c) Pruned estimation
Figure 2: Illustration of pruning process. The yellow area filled at node A
j
(i)
means that for x A
j
(i)
, we use
the average of labels in the yellow area instead of A
j
(i)
itself. The prediction in A
1
(3)
is pruned to
its depth - 1 ancestor; the prediction in A
3
(3)
is pruned to its depth - 2 ancestor; the prediction in
A
2
(3)
is pruned to its depth - 3 ancestor, i.e. not pruned.
5.1 A Naive Strategy
We first develop a private analog of Lepski’s method (Lepskii, 1992) for LPCT. There are
two parameters: the depth p and the relative ratio λ, both of which have an optimal value
depending on the unknown α and γ. We do not set the parameters to specific values.
Instead, we first query the privatized information with a sufficient tree depth p
0
. Then, we
introduce a greedy procedure to locally prune the tree to an appropriate depth and select
λ based on local information.
We provide a preliminary result that will guide the derivation of the pruning procedure.
For max-edge partition π
p
0
with depth p
0
and node A
j
(p
0
)
, the depth k ancestor private tree
estimation is defined as
˜η
π
k
P,Q
(x) =
X
j∈I
p
0
1{x A
j
(p
0
)
}
P
ancestor(A
j
0
(p
0
)
,k)=ancestor(A
j
(p
0
)
,k)
P
n
P
i=1
˜
V
Pj
0
i
+ λ
P
n
Q
i=1
V
Qj
0
i
P
ancestor(A
j
0
(p
0
)
,k)=ancestor(A
j
(p
0
)
,k)
P
n
P
i=1
˜
U
Pj
0
i
+ λ
P
n
Q
i=1
U
Qj
0
i
(13)
and consequently the classifier is defined as
˜
f
π
k
P,Q
(x) = 1(˜η
π
k
P,Q
(x) > 1/2). See Figure 2(a)
for an illustration of estimation at ancestor nodes.
Proposition 6 Let π be constructed through the max-edge partition in Section 3.2 with
depth p
0
. Let ˜η
π
k
P,Q
be the tree estimator with depth k ancestor node in (13). Let bη
π
k
P,Q
be the
non-private tree estimator with partition π
k
in (9). Define the quantity
r
j
k
:=
v
u
u
u
u
t
4C
1
(2
p
0
k+3
ε
2
n
P
)
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
+ C
1
λ
2
P
j
0
P
n
P
i=1
U
Qj
0
i
P
j
0
P
n
P
i=1
˜
U
Pj
i
+ λ
P
j
0
P
n
Q
i=1
U
Qj
i
2
(14)
where the summation over j
0
is with respect to all descendent nodes of the depth-k ancestor,
i.e. ancestor(A
j
0
(p
0
)
, k) = ancestor(A
j
(p
0
)
, k). The constant C
1
is specified in the proof. Then,
15
Ma and Yang
for all λ and p
0
, if x A
j
(p
0
)
, there holds
˜η
π
k
P,Q
(x) E
Y |X
h
bη
π
k
P,Q
(x)
i
r
j
k
with probability 1 3/(n
P
+ n
Q
)
2
with respect to P
n
P
Y
P
|X
P
Q
n
Q
Y
Q
|X
Q
R where R is the
joint distribution of privacy mechanisms in (5) and (6). E
Y |X
is taken with respect to
P
n
P
Y |X
Q
n
Q
Y |X
.
The above proposition indicates that if ˜η
π
k
P,Q
(x) 1/2 r
j
k
, then there holds
E
Y |X
h
bη
π
k
P,Q
(x)
i
1
2
= E
Y |X
h
bη
π
k
P,Q
(x)
i
˜η
π
k
P,Q
(x) + ˜η
π
k
P,Q
(x)
1
2
0
with probability 1 3/(n
P
+ n
Q
)
2
. Similar conclusion holds when ˜η
π
k
P,Q
(x) 1/2 < r
j
k
. In
other words, when we have
|˜η
π
k
P,Q
(x)
1
2
|
r
j
k
1, (15)
the population version of ancestor estimation is of the same sign as the sample version
estimation. As long as (15) holds, it is enough to use the sample version estimation, and
our goal is to find the best k for E
Y |X
[bη
π
k
P,Q
(x)] to approximate η
P
(x). Under continuity
assumption 1, the approximation error is bounded by the radius of the largest cell (see
Lemma 15). Consequently, one can show that R
P
(1(E
Y |X
[bη
π
k
P,Q
(x)] > 1/2)) is monotonically
decreasing with respect to k. Thus, our goal is to find the largest k such that (15) holds.
On the other hand, (15) is dependent on λ. We select the best possible λ in order to let
(15) hold for each j and k, i.e.
λ
j
k
= arg max
λ
|˜η
π
k
P,Q
(x) 1/2|
r
j
k
(16)
for x A
j
(p
0
)
. The optimization problem in (16) has a closed-form solution that can be
computed efficiently and explicitly. The derivation of the closed-form solution is postponed
to Section A.2.
Based on the above analysis, we can perform the following pruning procedure. We first
query with a sufficient depth, i.e. we select p
0
large enough, create a partition on D
Q
, and
receive the information from data holders. Then we prune back by the following procedure:
For j I
p
0
, we do the following operation: for k = p
0
, ··· , 2 and r
j
k
defined in (14),
we assign λ
j
k
as in (16). Then if (15) holds, let k
j
= k; else, k k 1.
If k
j
is not assigned, we select k
j
= arg max
k
|˜η
π
k
P,Q
(x) 1/2|/r
j
k
.
Assign ˜η
prune
P,Q
(x) = ˜η
π
k
j
P,Q
(x) for all x A
j
(p
0
)
.
This process can be done efficiently. The exact process is illustrated in Figure 2. We first
calculate estimations as well as the relative signal strength at all nodes. Then we trace back
from each leaf node to the root and find the node that maximizes the statistic (15). The
prediction value at the leaf node is assigned as the prediction at the ancestor node with the
maximum statistic value. In total, the pruning causes additional time complexity at most
O(2
p
0
), which is ignorable compared to the original complexity as long as 2
p
0
. n
P
+ dn
Q
.
16
Optimal Locally Private Nonparametric Cla ssification with Public Data
5.2 Pruned LDP Classification Tree
In the non-private case, i.e. when ε = , ˜η
prune
P,Q
is essentially the Lepski’s method (Lepskii,
1992; Cai and Wei, 2021). However, in the presence of privacy concerns, a tricky term
2
p
0
k+3
ε
2
n
P
arises in (14) due to the Laplace noises. Specifically, if 2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
, the first term in the numerator of (14) contains no information other than
the level of noise. As a result, the theoretical analysis of Lepski’s method fails to adapt.
To deal with the issue, we simultaneously perform two pruning procedures using informa-
tion of solely P or Q data. We first query with a large depth p
0
= b
d
2+2d
log
2
(n
P
ε
2
+n
2+2d
d
Q
)c
and half of the budget ε/2. When 2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
, we compare the values
of |˜η
π
k
P,Q
(x) 1/2|/r
j
k
with λ = 0 and λ = , associating to P and Q data, respectively.
If either of the statistics is larger than 1, we terminate the pruning. Else, we compare the
values of the statistics. If the statistic associated with Q is larger, we return the estimation
b
f
π
k
Q
. In this case, we guarantee that the pruned classifier is at least as good as the estimation
using only Q data. If the statistic associated with P is larger, we proceed the pruning to
k 1. If we reach k b
d
2+2d
log
2
(n
P
ε
2
)c for any j, we terminate the whole algorithm and
return (7) with depth b
d
2+2d
log
2
(n
P
ε
2
)c, which costs another ε/2. In this case, we ensure
Q data is unimportant and adopt a terminating condition that is close to the optimal value
b
d
2α+2d
log
2
(n
P
ε
2
)c. The detailed algorithm is presented in Algorithm 2. Moreover, we show
that the adjusted pruning procedure will result in a near-optimal excess risk.
Theorem 7 Let
˜
f
prune
P,Q
= 1(˜η
prune
P,Q
1/2) be the pruned LPCT defined in Algorithm 2.
Suppose Assumption 1, 2, and 3 hold. Let δ be defined in (11). Then, for n
α
2α+2d
P
. ε .
n
d
4+2d
P
, with probability 1 4/(n
P
+ n
Q
)
2
with respect to P
n
P
Q
n
Q
R, we have
(i) if
n
P
ε
2
log(n
P
+n
Q
)
.
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
, R
P
(
˜
f
prune
P,Q
) R
P
. δ
α(β+1)
d
;
(ii) if
n
P
ε
2
log(n
P
+n
Q
)
&
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
, R
P
(
˜
f
prune
P,Q
) R
P
. δ
α(α+d)(β+1)
d(1+d)
.
The conclusion of Theorem 7 is twofold. On one hand, when Q data dominates P
data, i.e.
n
P
ε
2
log(n
P
+n
Q
)
. (
n
Q
log(n
P
+n
Q
)
)
2α+2d
2γα+d
, the pruned estimator remains rate-optimal. In
this case, the estimator is appropriately pruned to the correct depth. On the other hand,
when P data dominates Q data, the convergence rate suffers degradation. The degrada-
tion factor (α + d)/(1 + d) is mild, and even diminishes when α = 1, i.e. η
P
is Lipschitz
continuous. In this case, there are two possible regions in which our conclusion is derived.
If (
n
Q
log(n
P
+n
Q
)
)
2α+2d
2γα+d
. (
n
P
ε
2
log(n
P
+n
Q
)
)
2α+2d
2+d
, the estimator triggers termination and is com-
puted with depth b
d
2+2d
log
2
(n
P
ε
2
)c. If the opposite, i.e.
n
P
ε
2
log(n
P
+n
Q
)
& (
n
Q
log(n
P
+n
Q
)
)
2α+2d
2γα+d
&
(
n
P
ε
2
log(n
P
+n
Q
)
)
2α+2d
2+d
, pruning fails to identify the optimal depth b
d
2α+2d
log
2
(n
P
ε
2
)c. Yet, it
also does not trigger termination at b
d
2+2d
log
2
(n
P
ε
2
)c. In this region, we take a shortcut
to avoid intricate analysis, while the true convergence rate is more complex yet faster than
that stated in Theorem 7. Notably, one drawback of this shortcut is the discontinuity of
the upper bound at the boundary
n
P
ε
2
log(n
P
+n
Q
)
(
n
Q
log(n
P
+n
Q
)
)
2α+2d
2γα+d
.
17
Ma and Yang
Algorithm 2: Pruned LPCT
Input: Private data D
P
= {(X
P
i
, Y
P
i
)}
n
P
i=1
, public data D
Q
= {(X
Q
i
, Y
Q
i
)}
n
Q
i=1
.
Initialization: p
0
= b
d
2+2d
log
2
(n
P
ε
2
+ n
2+2d
d
Q
)c, flag = 0.
Create π with depth p
0
on D
Q
. Query (5) and (6) with ε/2.
for A
j
(p
0
)
in π do
for k = p
0
, ···, 1 do
if 2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
then
r
Qj
k
=
r
4 log(n
P
+n
Q
)
P
j
0
P
n
P
i=1
U
Qj
0
i
, v
Qj
k
= |bη
π
k
Q
1/2|/r
Qj
k
r
Pj
k
=
s
2
p
0
k+5
ε
2
n
P
log(n
P
+n
Q
)
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
, v
Pj
k
= |˜η
π
k
P
1/2|/r
Pj
k
if v
Qj
k
v
Pj
k
then
s
j
k
= ˜η
π
k
P
, r
j
k
= r
Pj
k
, v
j
k
= v
Pj
k
if k b
d
2+2d
log
2
(n
P
ε
2
)c then
k
j
= k, flag = 1, break
end
else
s
j
k
= bη
π
k
Q
, r
j
k
= r
Qj
k
, v
j
k
= v
Qj
k
end
else
r
j
k
=
s
32
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
+4λ
2
P
j
0
P
n
P
i=1
U
Qj
0
i
log(n
P
+n
Q
)
P
j
0
P
n
P
i=1
˜
U
Pj
i
+λ
P
j
0
P
n
Q
i=1
U
Qj
0
i
2
v
j
k
= max
λ
|˜η
π
k
P,Q
1/2|/r
j
k
, s
j
k
= ˜η
π
k
P,Q
with λ = λ
j
k
in (16)
end
if v
j
k
1 then
k
j
= k, break
end
end
end
if flag then
p = b
d
2+2d
log
2
(n
P
ε
2
)c. Create π
p
on D
Q
. Query (5) and (6) with ε/2.
Output: ˜η
prune
P,Q
= ˜η
π
p
P
.
else
Output: ˜η
prune
P,Q
=
P
j
s
j
k
· 1{A
j
(p
0
)
}.
end
18
Optimal Locally Private Nonparametric Cla ssification with Public Data
Algorithm 2 lacks true adaptiveness, given that its upper bound suffers a significant
loss factor of (α + d)/(1 + d). Conventional adaptive methods typically incur a loss factor
of logarithmic order (Butucea et al., 2020; Cai and Wei, 2021). Following the approach
outlined in Butucea et al. (2020), one potential avenue for improvement involves fitting
multiple trees for p = b
d
2+2d
log
2
n
P
ε
2
c, ··· , b
d
2+2d
log
2
(n
P
ε
2
+ n
2+2d
d
Q
)c and designing a
test statistic to zero out the suboptimal candidates.
In summary, this proposed scheme offers benefits from two perspectives: (1) It avoids
multiple queries to the data. All data holders send their privatized messages at most twice.
(2) It selects sensitive parameters p and λ in a data-driven manner. In sacrifice, the bound
of excess risk is reasonably loosened compared to the model with the best parameters, and
one more query to the private data is required.
6. Experiments on Synthetic Data
We validate our theoretical findings by comparing LPCT with several of its variants on
synthetic data in this section. All experiments are conducted on a machine with 72-core
Intel Xeon 2.60GHz and 128GB memory. Reproducible codes are available on GitHub
1
.
6.1 Simulation Design
We motivate the design of simulation settings. While conducting experiments across various
combinations of γ, n
P
, and n
Q
, our primary focus is on two categories of data: (1) both n
Q
and γ are small, i.e. we have a small amount of high-quality public data (2) both n
Q
and γ
are large, i.e. we have a large amount of low-quality public data. When γ is small but n
Q
is large, using solely public data yields sufficient performance, rendering private estimation
less meaningful. Conversely, when γ is large but n
Q
is small, incorporating public data into
the estimation offers limited assistance. Moreover, we require n
P
to be much larger than
n
Q
since practical data collection becomes much easier with a privacy guarantee.
We consider the following pair of distributions P and Q. The marginal distributions
P
X
= Q
X
are both uniform distributions on X = [0, 1]
2
. The regression function of P is
η
P
(x) =
1
2
+ sign(x
1
1
3
) · sign(x
2
1
3
) ·
(3x
1
1)(3x
2
1)
10
· max
0, 1
2x
2
1
1
10
while the regression function of Q with parameter γ is
η
Q
(x) =
1
2
+
2
5
·
5
2
·
η
P
(x)
1
2

γ
.
It can be easily verified that the constructed distributions above satisfy Assumption 1, 2,
and 3. Throughout the following experiments, we take γ in 0.5 and 5. For better illustration,
their regression functions are plotted in Figure 3.
We conduct experiments with privacy budgets ε {0.5, 1, 2, 4, 8, 1000}. Here, ε = 1000
can be regarded as the non-private case which marks the performance limitation of our
methods. The other budget values cover all commonly seen magnitudes of privacy budgets.
1. https://github.com/Karlmyh/LPCT.
19
Ma and Yang
(a) η
P
(b) η
Q
with γ = 0.5 (c) η
Q
with γ = 5
Figure 3: Contour plots of the simulation distributions.
We take classification accuracy as the evaluation metric. In the simulation studies, the
comparison methods and their abbreviation are as follows:
LPCT is the proposed algorithm with the max-edge partition rule. We select the
parameters among p {1, ··· , 8} and λ {0.1 , 0.5, 1, 2, 5, 10, 50, 100, 200, 300, 400,
500, 750, 1000, 1250, 1500, 2000}. We choose the Gini index as the reduction criterion.
LPCT-P and LPCT-Q are the estimation associated to P and Q data, respectively.
Their parameter grids are the same as LPCT.
LPCT-R is the proposed algorithm using the random max-edge partition (Cai et al.,
2023). Specifically, we randomly select an edge among the longest edges to cut at the
midpoint. Its parameter grids are the same as LPCT.
LPCT-prune is the pruned locally private classification tree in Algorithm 2. We use
constants that yield a tighter bound than (14).
For each model, we report the best result over its parameter grids, with the best result
determined based on the average result of at least 100 replications.
6.2 Simulation Results
In Section 6.2.1 and 6.2.2, we analyze the influence of underlying parameters and sample
sizes to validate the theoretical findings. In Section 6.2.3, we analyze the parameters of
LPCT to understand their behaviors. We illustrate various ways that public data benefits
the estimation in Section 6.2.4.
6.2.1 Influence of underlying parameters
Privacy-utility trade-off We analyze how the privacy budget ε influences the quality
of prediction in terms of accuracy. We generate 10,000 private samples, with 50 and 1,000
public samples for γ = 0.5 and γ = 5, respectively. The results are presented in Figure 4(a)
and 4(b). In both cases, the performances of all models increase as ε increases. Furthermore,
the performance of LPCT is consistently better than both LPCT-P and LPCT-Q, demon-
strating the effectiveness of our proposed method in combining both data. LPCT-prune
performs reasonably worse than LPCT. In the high privacy regime (ε 1), LPCT-P is al-
most non-informative and our estimator performs similarly to LPCT-Q. In the low privacy
20
Optimal Locally Private Nonparametric Cla ssification with Public Data
(a) Privacy-utility trade-off with
γ = 0.5 and n
Q
= 50
(b) Privacy-utility trade-off with
γ = 5 and n
Q
= 1, 000
(c) Accuracy - γ
Figure 4: Illustration of relationship between accuracy and underlying parameters, i.e. ε and γ.
regime (ε 8), the performance of our estimator improves rapidly along with LPCT-P. In
the medium regime, neither LPCT-P nor LPCT-Q obviously outperforms each other, and
there is a clear performance gain from using LPCT.
Influence of γ We analyze how the relative signal exponent γ influences the performance.
We set γ = 0.5, n
P
= 10, 000 and n
Q
= 50, while varying γ in {0.5, 0.75, 1, 1.25, 1.5, 2}. In
Figure 4(c), we can see that the accuracy of LPCT is decreasing with respect to γ for all ε,
which aligns with Theorem 4. Since γ only influences the performance through Q data, its
effect is less apparent when the P data takes the dominance, for instance ε = 8.
6.2.2 Influence of sample sizes
Accuracy - n
P
We conduct experiments to investigate the influence of the private sample
size n
P
. We fix n
Q
= 50 and γ = 0.5 while varying n
P
from 4, 000 to 16, 000. The results
are presented in Figure 5(a). Additionally, we compare the performance of LPCT-P and
LPCT-Q to analyze the contribution of both data sets, as shown in Figure 5(b). Moreover,
we compare LPCT and LPCT-prune in Figure 5(c). We refrain from invoking ˜η
π
p
P
even
if the flag = 1 condition is triggered, as the termination induces sudden changes in the
continuous curves, thereby rendering the findings less evident. In Figure 5(a), except for
(a) Accuracy - n
P
(b) With LPCT-P and LPCT-Q
(c) With LPCT-prune
Figure 5: Analysis of accuracy with respect to n
P
.
21
Ma and Yang
(a) With LPCT-prune
(b) With LPCT-P and LPCT-Q
Figure 6: Analysis of accuracy with respect to n
Q
.
ε = 0.5, the accuracy improves as n
P
increases for all ε values. Moreover, the rate of
improvement is more pronounced for higher values of ε. This observation aligns with the
convergence rate established in Theorem 4. In 5(b), LPCT always outperforms methods
using single data set, i.e. LPCT-P and LPCT-Q. The improvement of LPCT over LPCT-
Q is more significant for large ε. In Figure 5(c), we justify that LPCT-prune performs
analogously to LPCT and its performance improves as ε and n
P
increase.
Accuracy - n
Q
We conduct experiments to investigate the influence of the private sample
size n
Q
. We keep n
P
= 10, 000 and γ = 0.5 while varying n
Q
from 10 to 200. The
classification accuracy monotonically increases with respect to n
Q
for all methods. In Figure
6(a), the performance of LPCT-prune is closer to LPCT when n
Q
gets larger. As explained
in Section 5.2, the pruning is appropriately conducted when Q data takes the dominance,
and is disturbed when P data takes the dominance. In Figure 6(b), it is worth noting that
the performance improvement is more significant for small values of n
Q
, specifically when
n
Q
50. Moreover, the performance of LPCT-P also improves in this region and remains
approximately unchanged for larger n
Q
. This can be attributed to the information gained
during partitioning, which is further validated in Figure 9.
6.2.3 Analysis of model parameters
Parameter analysis of depth We conduct experiments to investigate the choice of
depth. Under (n
P
, n
Q
) = (10000, 50) and γ = 0.5, we plot the best performance of LPCT
and LPCT-prune when fixing p and p
0
to certain values, respectively in Figure 7(a) and
7(b). To avoid obscuring empirical findings, we refrain from invoking ˜η
π
p
P
even if flag = 1 is
triggered, since the experiment is conducted with respect to p
0
. In Figure 7(a), the accuracy
first increases as p increases until it reaches a certain value. Then the accuracy begins to
decrease as p increases. This confirms the trade-off observed in Theorem 4. Moreover, the
p that maximizes the accuracy is increasing with respect to ε. This is compatible with
the theory since the optimal choice p log(n
P
ε
2
+ n
2α+2d
2γα+d
Q
) is monotonically increasing
with respect to ε. As in Figure 7(b), the accuracy increases with p
0
at first, whereas the
performance remains unchanged or decreases slightly for p
0
larger than the optimum. Note
22
Optimal Locally Private Nonparametric Cla ssification with Public Data
(a) p of LPCT (b) p
0
of LPCT-prune
(c) λ of LPCT
Figure 7: Parameter analysis.
that the axes of 7(a) and 7(b) are aligned. LPCT-prune behaves analogously to LPCT in
terms of accuracy and optimal depth choice.
Parameter analysis of λ We conduct experiments to investigate the choice of λ in terms
of accuracy. Under (n
P
, n
Q
) = (10000, 1000) and γ = 5, we plot in Figure 7(c) the best
performance of LPCT when fixing λ to certain values. In Figure 7(c), the relation between
accuracy and λ is inverted U-shaped for ε = 2, 4, 8, while is monotonically increasing and
decreasing for ε = 1 and 1000, respectively. The results indicate that a properly chosen
λ is necessary as stated in Theorem 4. Moreover, the λ that maximizes the accuracy is
decreasing with respect to ε, which justify the fact that we should rely more on public data
with higher privacy constraint.
6.2.4 Benefits of public data
Range parameter We demonstrate the effectiveness of range estimation using public
data when the domain is unknown. Consider the case where X = [0, a]
d
and a is a random
variable uniformly distributed in [1, 5]. We compare the following scenarios: (I) a is known.
(II) a is estimated using public data as shown in Theorem 5. (III) a is unknown. Case (I)
is equivalent to X = [0, 1]
d
. In case (II), we min-max scale the samples from the estimated
(a) Illustration of range estimation
(b) ε = 0.5 (c) ε = 8
Figure 8: Analysis of range parameter. In 8(a), we have a = 2.9. The black square is the range estimated
by public data in case (II). The grey squares represent the grid of a in case (III).
23
Ma and Yang
(a) Illustration of
random partition and
max-edge partition
(b) Comparison of LPCT and LPCT-R
Figure 9: Illustration of partition with public data.
range to [0, 1]
d
. In case (III), we scale the samples from [0, a]
d
to [0, 1]
d
and select the best
a among a {1, 1.5, ··· , 5}. In Figure 8(a), we showcase the scaling schemes in (II) and
(III). The estimated range from public data is much more concise than the best a. Given
no prior knowledge, the a chosen in practice can perform significantly worse. In Figure 8(b)
and 8(c), we observe that the performance of the estimated range matches that of (I) as n
Q
increases, whereas the unknown range significantly degrades performance. The performance
gap between the known and estimated range is more substantial when n
Q
is small. This
observation is compatible with the conclusion drawn in Theorem 5 emphasizing the need
for sufficient public data to resolve the unknown range issue.
Partition We demonstrate the effectiveness of partitioning with information of public
data by comparing LPCT with LPCT-R. In Figure 9(a), the showcase of p = 3 is presented.
The random partition is possibly created as the top figure, where the upper left and the
lower right cells are hard to distinguish as 0 or 1. Using sufficient public data, the max-edge
partition is bound to create an informative partition like the lower figure. Moreover, Figure
9(b) shows that LPCT outperforms LPCT-R.
7. Experiments on Real Data
7.1 Data Description
We conduct experiments on real-world data sets to show the superiority of LPCT. We
utilize 11 real data sets that are available online. Each of these data sets contains sensitive
information and necessitates privacy protection. Furthermore, these data sets cover a wide
range of sample sizes and feature dimensions commonly encountered in real-world scenarios.
When creating D
P
and D
Q
from the original data set D, some data sets have explicit criteria
for splitting, reflecting real-world considerations. In these cases, we have P 6= Q. For others,
the split is performed randomly and we have P = Q. A summary of key information for
these data sets after pre-processing can be found in Table 1. Following the analysis in
24
Optimal Locally Private Nonparametric Cla ssification with Public Data
Table 1: Information of real data sets.
Dataset P = Q n
P
n
Q
d Area
Anonymity X 453 50 14 Social
Diabetes X 2,054 200 8 Medical
Email X 3,977 200 3,000 Text
Employ X 2,225 200 8 Business
Rice X 240 3,510 7 Image
Census × 41,292 3,144 46 Social
Election × 1,752 222 9 Social
Employee × 3,319 504 9 Business
Jobs × 46,218 14,318 11 Social
Landcover × 7,097 131 28 Image
Taxi × 3,297,639 117,367 93 Social
Section 4.3, we min-max scale each data to [0, 1]
d
according to public data. Detailed data
information and pre-processing procedures are provided in Appendix C.1.
7.2 Competitors
We compare LPCT with several state-of-the-art competitors. For each model, we report
the best result over its parameter grids based on an average of 20 replications. The tested
methods are
LPCT is described in Section 6. In real data experiments, we enlarge the grids
to p {1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16} and λ {0.1 , 0.5, 1, 2, 5, 10, 50, 100, 200,
300, 400, 500, 750, 1000, 1250, 1500, 2000} due to the large sample size. Since the max-
edge rule is easily affected by useless features or categorical features (Ma et al., 2023),
we boost the performance of LPCT by incorporating the criterion reduction scheme
from the original CART (Breiman, 1984) to the tree construction. Since Proposition
2 holds for any partition, the method is also ε-LDP. We choose the Gini index as the
reduction criterion.
LPCT-prune is the pruned locally private classification tree in Algorithm 2. We also
adopt the criterion reduction scheme for tree construction.
CT is the original classification tree (Breiman, 1984) with the same grid as LPCT.
We define two approaches, CT-W and CT-Q. For CT-W, we fit CT on the whole
data, namely P and Q data together, with no privacy protection. Its accuracy will
serve as an upper bound. For CT-Q, we fit CT on Q data only. Its performance is
taken into comparison. We use implementation by sklearn (Pedregosa et al., 2011)
with default parameters while varying max depth in {1, 2, ··· , 16}.
PHIST is the locally differentially private histogram estimator proposed by Berrett
and Butucea (2019). We fit PHIST on private data solely as it does not leverage
public data. We choose the number of splits h
1
along each axis in {1, 2, 3, 4, 5, 6}.
LPDT is the locally private decision tree originally proposed for regression (Ma et al.,
2023). The approach creates a partition on Q data and utilizes P data for estimation.
We adapt the strategy to classification. We select p in the same range as LPCT.
25
Ma and Yang
PGLM is the locally private generalized linear model proposed by Wang et al. (2023a)
with logistic link function. Note that their privacy result is under (ε, δ)-LDP. We let
δ = 1/n
2
P
for a fair comparison.
PSGD is the locally private stochastic gradient descent developed in Lowy et al.
(2023). The presented algorithm is in a sequentially interactive manner, meaning
that the privatized information from X
i
depends on the information from X
i1
. We
adopt the warm start paradigm where we first find a minimizer on the public data
and then initialize the training process with the minimizer. Following Lowy et al.
(2023), we choose the parameters of the privacy mechanism, i.e. PrivUnitG, by the
procedure they provided. We let C = 1 and learning rate η {10
5
, 10
4
, ··· , 1}. We
consider two models: PSGD-L is the linear classifier using the logistic loss; PSGD-N
is the non-linear neural network classifier with one hidden layer and ReLU activation
function. The number of neurons is selected in d {d/2, d, 2d}.
7.3 Experiment Results
We first illustrate the substantial utility gain brought by public data. We choose Diabetes
and Employee, representing the case of P = Q and P 6= Q, respectively. On each data,
we randomly select a portion of private data and 80 public samples to train three models:
LPCT, LPDT, and PHIST. PHIST does not leverage any information of public data, while
LPDT benefits from public data only through the partition. LPCT utilizes public data
in both partition and estimation. In Figure 10, the accuracy increases with respect to
the extent of public data utilization for both identically and non-identically distributed
public data. Moreover, the accuracy gain is significant. For instance, in Figure 10(b), 80
public samples and 0.2 × 3319 664 private samples achieve the same accuracy as using
0.6 × 3319 1991 private samples alone. The observation yields potential practical clues.
For instance, when data-collecting resources are limited, organizations can put up rewards
to get a small amount of public data rather than collecting a large amount of private data.
Next, we report the averaged accuracy for ε {0.5, 2, 8} in Table 2 on each real data
set. Here, CT-W is not private and serves as a reference. To ensure significance, we adopt
the Wilcoxon signed-rank test (Wilcoxon, 1992) with a significance level of 0.05 to check if
(a) P = Q
(b) P 6= Q
Figure 10: Illustration of utility gain brought by public data.
26
Optimal Locally Private Nonparametric Cla ssification with Public Data
the result is significantly better. For each data set, trials that exceed a memory of 100GB
or a running time of one hour are terminated.
From Table 2, we can see that LPCT-based methods generally outperform their competi-
tors, both in the sense of average performance (rank sum) and best performance (number of
No.1). Compared to LPCT, LPCT-prune performs reasonably worse due to its data-driven
nature. In practice, however, all other methods require parameter tuning which limits the
privacy budget for the final estimation. Thus, their real performance is expected to drop
down while that of LPCT-prune remains. Methods with the CART rule substantially out-
perform those with the max-edge rule, as the CART rule is robust to useless features or
categorical features and can fully employ data information. We note that on some data
sets, the results of LPCT remain the same for all ε. This is the case where public data
takes dominance and the estimator is barely affected by private data. Moreover, we observe
that linear classifiers such as PSDG-L and PGLM perform poorly on some data sets. This
is attributed to the non-linear underlying nature of these data sets.
We compare the efficiency of the methods by reporting the total running time in Table
3. PSGD-L and PSGD-N are generally slower and even take more than one hour for a
single round on taxi data. Compared to these sequentially-interactive LDP methods, non-
interactive methods achieve acceptable running time. We argue that the performance gap
between LPCT and CT is caused by cython acceleration in sklearn. Moreover, PHIST
causes memory leakage on three data sets. Though histogram-based methods and tree-
based methods possess the same storage complexity theoretically (Ma et al., 2023), LPCT
avoids memory leakage when d is large. The results show that LPCT is an efficient method.
8. Conclusion and Discussions
In this work, we address the under-explored problem of public data assisted LDP learning
by investigating non-parametric classification under the posterior drift assumption. We
propose the locally private classification tree that achieves the mini-max optimal conver-
gence rate. Moreover, we introduce a data-driven pruning procedure that avoids parameter
tuning and produces an estimator with a fast convergence rate. Extensive experiments
on synthetic and real data sets validate the superior performance of these methods. Both
theoretical and experimental results suggest that public data is significantly more effective
compared to data under privacy protection. This finding leads to practical suggestions.
For instance, prioritizing non-private data collection should be considered. When data-
collecting resources are limited, organizations can offer rewards to encourage users to share
non-private data instead of collecting a large amount of privatized data.
We summarize the ways that public data contributes to LPCT. The benefits lie in four-
fold: (1) Elimination of range parameter. Since the domain is unknown in the absence of the
raw data, the convergence rate and performance of partition-based estimators are degraded
by a range parameter. For LPCT, public data helps to eliminate the range parameter both
theoretically and empirically, as demonstrated in Section 4.3 and 6.2.4, respectively. (2)
Implicit information in partition. Through the partition created on public data, LPCT
implicitly feeds information from public data to the estimation. The performance gain
is explained empirically in Section 6.2.4. (3) Augmented estimation. In the weighted
average estimation, the public data reduces the instability caused by perturbations and
27
Ma and Yang
Table 2: Accuracy (10
1
) over real data sets. The best results are marked with *. The results that are not
significantly worse than the best result are bolded. Due to memory limitation, PHIST is corrupted
on three data sets. PSGD-L and PSGD-N exceed the running time of 3,600 seconds on one data
set and are terminated. These results are marked with -.
(a) ε = 0.5
Datasets CT-W
LPCT LPCT-prune
CT-Q PHIST LPDT PGLM PSGD-L PSGD-N
Maxedge CART Maxedge CART
Anonymity 8.346 8.346* 8.346* 8.309 8.309 8.346* 8.346* 7.546 5.899 7.790 8.123
Diabetes 9.964 7.067 7.590* 6.661 7.529 7.301 6.676 6.305 5.499 6.582 6.448
Email 9.261 7.332 7.487* 7.205 7.185 7.283 - 7.112 5.136 6.435 6.292
Employ 9.048 7.403 8.646* 6.684 6.362 8.577 5.704 5.644 4.874 5.742 5.595
Rice 8.992 9.039 9.183* 9.053 9.173 9.117 5.467 6.583 4.886 8.517 8.142
Census 8.685 8.539 8.743* 8.471 8.528 8.285 - 7.195 5.104 8.310 8.405
Election 6.408 5.610 5.846 5.259 5.773 5.958* 5.060 5.089 4.948 5.207 5.074
Employee 7.648 7.397* 7.131 7.381 7.040 7.151 6.660 6.098 5.280 6.493 6.387
Jobs 7.863 5.628 7.814* 5.561 7.793 7.599 5.327 5.441 4.942 5.530 5.732
Landcover 9.404 8.615 8.960* 8.299 8.719 6.670 8.362 8.366 4.955 8.368 8.349
Taxi 9.601 5.647 9.589* 5.637 9.589* 9.490 - 5.438 4.912 - -
Rank Sum 36 18* 54 36 38 83 84 106 70 80
No. 1 2 10* 2 5 5 1 0 0 0 0
(b) ε = 2
Datasets CT-W
LPCT LPCT-prune
CT-Q PHIST LPDT PGLM PSGD-L PSGD-N
Maxedge CART Maxedge CART
Anonymity 8.346 8.346* 8.346* 8.086 8.168 8.346* 8.346* 8.346* 5.899 7.965 8.039
Diabetes 9.964 7.274 7.590* 6.661 7.412 7.301 6.676 6.676 5.499 6.653 6.674
Email 9.261 7.350 7.485* 7.224 7.459 7.283 - 7.181 5.136 7.136 7.203
Employ 9.048 7.702 9.121* 7.005 6.578 8.577 6.174 7.181 4.874 5.786 5.792
Rice 8.992 9.027 9.183* 9.053 9.183* 9.117 5.467 8.458 4.886 9.050 8.750
Census 8.685 8.537 8.743* 8.460 8.633 8.285 - 8.017 5.104 8.432 8.471
Election 6.408 5.648 5.850 5.327 5.595 5.958* 5.056 5.114 4.948 5.323 5.111
Employee 7.648 7.437* 7.175 7.362 7.175 7.151 6.472 6.951 5.280 6.400 6.383
Jobs 7.863 5.628 7.821* 5.597 7.778 7.599 5.333 5.591 4.942 6.788 5.934
Landcover 9.404 8.715 8.960* 8.623 8.734 6.670 8.367 8.493 4.955 8.553 8.369
Taxi 9.601 5.659 9.590* 5.643 9.580 9.490 - 5.536 4.912 - -
Rank Sum 36 18* 54 36 38 83 84 106 70 80
No. 1 3 10* 1 5 4 1 1 0 1 0
(c) ε = 8
Datasets CT-W
LPCT LPCT-prune
CT-Q PHIST LPDT PGLM PSGD-L PSGD-N
Maxedge CART Maxedge CART
Anonymity 8.346 8.346* 8.346* 8.221 8.309 8.346* 8.346* 8.346* 5.395 8.338 8.154
Diabetes 9.964 7.503 7.930* 6.648 7.433 7.301 6.676 7.496 5.588 7.244 7.101
Email 9.261 7.355 7.520 7.251 7.491 7.283 - 7.326 5.271 8.446 7.060
Employ 9.048 7.619 9.068* 7.393 7.492 8.577 9.043 7.572 5.237 5.795 5.798
Rice 8.992 9.011 9.183* 9.053 9.137 9.117 8.858 9.017 6.283 9.017 8.875
Census 8.685 8.557 8.743* 8.404 8.671 8.285 - 8.527 5.389 8.530 8.154
Election 6.408 5.894 6.082* 5.321 5.563 5.958 5.015 5.892 5.462 5.765 5.705
Employee 7.648 7.441* 7.173 7.384 7.105 7.151 7.437 7.439 4.930 6.775 6.883
Jobs 7.863 5.650 7.861* 5.584 7.794 7.599 5.992 5.641 6.802 7.712 7.414
Landcover 9.404 8.737 8.963 8.620 8.745 6.670 8.367 8.695 5.551 9.137* 8.948
Taxi 9.601 5.666 9.610* 5.637 9.579 9.490 - 5.620 5.087 - -
Rank Sum 42 18* 66 36 49 90 68 104 54 78
No. 1 2 8* 1 4 4 2 2 0 3 0
28
Optimal Locally Private Nonparametric Cla ssification with Public Data
Table 3: Running time (s) on real data sets.
Datasets CT-W
LPCT LPCT-prune
CT-Q PHIST LPDT PGLM PSGD-L PSGD-N
Maxedge CART Maxedge CART
Anonymity 3e-3 1e-2 6e-3 9e-3 1e-2 2e-3 1e-2 2e-3 2e-3 8e-1 1e0
Diabetes 1e-2 1e-1 3e-2 3e-2 3e-2 3e-3 3e-2 6e-3 3e-3 3e0 6e0
Email 3e0 1e0 3e-2 2e-1 1e0 6e-2 - 1e0 4e0 8e0 3e3
Employ 5e-3 9e-2 2e-1 1e0 9e-1 3e-3 3e-2 4e-2 4e-3 5e0 7e0
Rice 1e-2 2e-2 1e-2 1e-2 2e-2 2e-2 5e-3 6e-3 8e-3 2e0 2e0
Census 6e-1 1e0 5e0 6e-1 9e-1 3e-1 - 2e-1 5e-2 2e1 3e1
Election 1e-2 6e-2 3e-2 3e-2 3e-2 3e-3 1e0 3e-2 3e-3 3e0 5e0
Employee 6e-3 8e-2 1e-1 4e-2 1e-1 3e-3 5e-2 5e-2 4e-3 7e0 1e2
Jobs 1e-1 1e0 1e0 6e-1 6e-1 3e-2 8e-1 5e-1 3e-2 1e2 1e2
Landcover 2e-1 1e-1 8e-2 9e-2 1e-1 5e-3 2e-1 9e-2 5e-3 1e1 2e1
Taxi 1e2 2e2 7e1 4e1 4e1 1e0 - 1e2 1e0 - -
produces more accurate estimations. Theoretically, the convergence rate can be significantly
improved. Empirically, even a small amount of public data improves the performance
significantly (Section 6 and 7). (4) Data-driven pruning. As shown in Section 5.2, we are
capable of deriving the data-driven approach only with the existence of public data.
We discuss the potential future works and improvements. The assumption that P
X
and
Q
X
are identical can be mitigated. Similar to Ma et al. (2023), we only need the density
ratio to be bounded. Moreover, the marginal densities are not necessarily lower bounded,
if we add a parameter to guarantee that sufficient samples are contained in each cell (Ma
et al., 2023). In practice, X
P
and X
Q
can be distinctive (Yu et al., 2021b; Gu et al., 2023).
This problem is related to domain adaption (Redko et al., 2020) and can serve as future
work direction. Determining whether a data set satisfies Assumption 3 is of independent
interest, as demonstrated in Gu et al. (2023); Yu et al. (2023).
Acknowledgments
We would like to thank the reviewers and action editor for their help and advice, which led
to a significant improvement of the article. The research is supported by the Special Funds
of the National Natural Science Foundation of China (Grant No. 72342010). Yuheng Ma
is supported by the Outstanding Innovative Talents Cultivation Funded Programs 2024 of
Renmin University of China. This research is also supported by Public Computing Cloud,
Renmin University of China.
29
Ma and Yang
Appendix A. Methodology
A.1 Estimator Summarization
Function Meaning Defined
η
P
, η
Q
true regression function of P and Q (1)
η
π
P
population tree estimation (3)
bη
π
P
, bη
π
Q
non-private tree estimation of P and Q (4)
˜η
π
P
private tree estimation based on P (7)
˜η
π
P,Q
private tree estimation (8)
bη
π
P,Q
non-private two-sample tree estimation (9)
˜η
π
k
P,Q
ancestor private tree estimation (13)
˜η
prune
P,Q
pruned private tree estimation Algorithm 2
A.2 Derivation of the Closed-form Solution for Pruning Procedure
Remind that
r
j
k
:=
v
u
u
u
u
t
4C
1
(2
p
0
k+3
ε
2
n
P
)
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
+ C
1
λ
2
P
j
0
P
n
P
i=1
U
Qj
0
i
P
j
0
P
n
P
i=1
˜
U
Pj
i
+ λ
P
j
0
P
n
Q
i=1
U
Qj
i
2
.
For each ancestor(A
j
(p
0
)
, k), define the quantity
w
j
k
=
P
ancestor(A
j
0
(p
0
)
,k)=ancestor(A
j
(p
0
)
,k)
P
n
P
i=1
˜
U
Pj
0
i
P
ancestor(A
j
0
(p
0
)
,k)=ancestor(A
j
(p
0
)
,k)
P
n
P
i=1
˜
U
Pj
0
i
+ λ
P
n
Q
i=1
U
Qj
0
i
.
Also, let u
k
= 4
2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
. When the signs of ˜η
π
k
P
(x)
1
2
and
bη
π
k
Q
(x)
1
2
are identical, the object in (16) becomes
p
C
1
·
|˜η
π
k
P,Q
(x)
1
2
|
r
j
k
=
w
j
k
˜η
π
k
P
(x)
1
2
+ (1 w
j
k
) ·
bη
π
k
Q
(x)
1
2
s
w
j2
k
u
k
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
+
(1w
j
k
)
2
P
j
0
P
n
Q
i=1
U
Qj
0
i
v
u
u
u
t
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
u
k
˜η
P
π
k
(x)
1
2
2
+
X
j
0
n
Q
X
i=1
U
Qj
0
i
·
bη
Q
π
k
(x)
1
2
2
30
Optimal Locally Private Nonparametric Cla ssification with Public Data
where the inequality follows from Cauthy’s inequality. The maximum is attained when
w
j2
k
u
k
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
·
X
j
0
n
Q
X
i=1
U
Qj
0
i
·
bη
π
k
Q
(x)
1
2
2
=
(1 w
j
k
)
2
P
j
0
P
n
Q
i=1
U
Qj
0
i
·
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
u
k
˜η
π
k
P
(x)
1
2
2
.
By bringing in w
j
k
, this leads to the solution
λ =
u
k
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
·
bη
π
k
Q
(x)
1
2
˜η
π
k
P
(x)
1
2
.
When ˜η
π
k
P
(x)
1
2
and bη
π
k
Q
(x)
1
2
have different signs, the object in (16) is maximized at the
extremum, i.e. when λ = 0 or λ = . The maximum value is
v
u
u
u
u
t
max
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
u
k
˜η
π
k
P
(x)
1
2
2
,
P
j
0
P
n
Q
i=1
U
Qj
0
i
·
bη
π
k
Q
(x)
1
2
2
!
C
1
.
Appendix B. Proofs
B.1 Proof of Proposition 2
Proof of Proposition 2 Since all mechanisms R
i
are completely the same, we use R
for notation simplicity. For each conditional distribution R
˜
U
i
,
˜
V
i
|X
i
= x, Y
i
= y
, we can
compute the density ratio as
sup
x,x
0
R
d
, y,y
0
∈{0,1},Sσ(S)
R
(
˜
U,
˜
V ) S|X
i
= x, Y
i
= y
R
(
˜
U,
˜
V ) S|X
i
= x
0
, Y
i
= y
0
sup
x,x
0
R
d
dR(
˜
U|X
i
= x)
dR(
˜
U|X
i
= x
0
)
· sup
x,x
0
R
d
,y,y
0
∈{0,1}
dR(
˜
V |X
i
= x, Y
i
= y)
dR(
˜
V |X
i
= x
0
, Y
i
= y
0
)
. (17)
The first part is characterized by the Laplace mechanism. Since the conditional density is
identical if x and x
0
belongs to a same A
j
(
p), we have
sup
x,x
0
R
d
dR(
˜
U|X
i
= x)
dR(
˜
U|X
i
= x
0
)
= sup
u,u
0
∈{0,1}
|I|
dR(
˜
U|U
i
= u)
dR(
˜
U|U
i
= u
0
)
.
The right-hand side can be computed as
dR(
˜
U|U
i
= u)
dR(
˜
U|U
i
= u
0
)
=
Q
j
exp
ε
4
|˜u
j
i
u
j
|
Q
j
exp
ε
4
|˜u
j
i
u
j
0
|
Y
j
exp
ε
4
|u
j
u
j
0
|
.
31
Ma and Yang
By definition, u and u
0
are one-hot vectors and differ at most on two entries. As a result,
sup
x,x
0
R
d
dR(
˜
U|X
i
= x)
dR(
˜
U|X
i
= x
0
)
exp
ε
2
(18)
For the second part, note that v and v
0
also differ at most on two entries. Then with
analogous steps, we have
sup
x,x
0
R
d
,y,y
0
∈{0,1}
dR(
˜
V |X
i
= x, Y
i
= y)
dR(
˜
V |X
i
= x
0
, Y
i
= y
0
)
exp
ε
2
. (19)
Bringing (18) and (19) into (17) yields the desired conclusion.
B.2 Proof of Theorem 3
Proof of Theorem 3 The proof is analogous to Cai and Wei (2021, Theorem 3.2). We
first construct two classes of distributions P
σ
and Q
σ
for σ {−1, 1}
m
. We then combine the
information inequality under privacy constraint (Duchi et al., 2018) and Assouad’s Lemma
(Tsybakov, 2009). We first define two classes of distributions by construction. Define the
useful quantity
r = c
r
n
P
(e
ε
1)
2
+ n
2α+2d
2γα+d
Q
1
2α+2d
, w = c
w
r
d
, m = bc
m
r
αβd
c (20)
where c
r
, c
w
, and c
m
are some universal constants. Let
G =
(6k
1
r, 6k
2
r, . . . , 6k
d
r) , k
i
= 1, 2, . . . ,
(6r)
1
, i = 1, 2, . . . , d
.
be a grid of b(6r)
1
c
d
points in the unit cube. Denote x
1
, x
2
, . . . , x
b(6r)
1
c
d
as points in G.
Since m b(6r)
1
c
d
, we are interested in B (x
k
, 2r) , k = 1, 2, . . . , m which are balls with
radius 2r centered at x
k
. Let B
c
= [0, 1]
d
\
S
m
k=1
B (x
k
, 2r) denote the points that aren’t in
any of the m balls. Define function g(·) on [0, ) :
g(z) =
1 0 z < 1
2 z 1 z < 2
0 z 2
And define
h
P
(z) = C
α
r
α
g
α
(z/r), h
Q
(z) = C
γ
C
γ
α
r
αγ
g
αγ
(z/r)
where C
α
, C
γ
are some universal constants. Define the hypercube H of pairs (P
σ
, Q
σ
) by
H = {(P
σ
, Q
σ
) , σ = (σ
1
, σ
2
, . . . , σ
m
) {−1, 1}
m
}
32
Optimal Locally Private Nonparametric Cla ssification with Public Data
with P
σ
and Q
σ
defined as follows. For conditional density, we let
η
P
(x) =
(
1
2
(1 + σ
k
h
P
(kx x
k
k)) if x B (x
k
, 2r) for some k = 1, 2, . . . , m
1
2
if x B
c
.
η
Q
(x) =
(
1
2
(1 + σ
k
h
Q
(kx x
k
k)) if x B (x
k
, 2r) for some k = 1, 2, . . . , m
1
2
if x B
c
.
Let P
σ,X
, Q
σ,X
, σ {−1, 1}
m
all have the same marginal distribution on X, with density
dP
X
(x) defined as follows:
dP
X
(x) =
w · r
d
if x B (x
k
, r) for some k = 1, 2, . . . , m
c
B
c
(1 mw) if x B
c
0 otherwise.
where c
B
c
is an normalizing constant that makes dP well-defined. Similar to Cai and Wei
(2021), with a proper choice of all constants, H is a subset of F that satisfies 3, 1, and 2.
Then it suffices to show that
inf
b
f
sup
H
E
X,Y
h
R
P
σ
(
b
f) R
P
σ
i
&
n
P
(e
ε
1)
2
+ n
2α+2d
2γα+d
Q
(β+1)α
2α+2d
.
Let T V (·, ·) and H(·, ·) denote the total variation distance and the Hellinger distance be-
tween measures, respectively. Let Ham(·, ·) denote the hamming distance between ordered
indices. Let σ and σ
0
be two indices in {−1, 1}
m
such that Ham(σ, σ
0
) = 1, i.e. they differ
at only one element σ
k
and σ
0
k
. We compute the total variation between the induced P
σ
and P
σ
0
. By Scheffe’s theorem (Tsybakov, 2009), there holds
T V (P
σ
, P
σ
0
) =
1
2
Z
X ×Y
|dP
σ
(x, y) dP
σ
0
(x, y)|.
Since the marginal distributions for both measures are identical, we have
T V (P
σ
, P
σ
0
) =
Z
X
η
P
(x) η
P
0
(x)
dP
σ,X
(x)
=
Z
B(x
k
,r)
C
α
r
α
· w · r
d
dx = C
α
wr
α
.
According to Cai and Wei (2021), we have
H
2
(Q
σ
, Q
σ
0
) = C
2
γ
C
2γ
α
wr
2αγ
. (21)
Let conditional distribution R = {R
i
}
n
P
i=1
be any transformation on X
n
P
× Y
n
P
that is
sequentially-interactive ε-local differentially private as defined in Definition 1. Denote
RP
n
P
σ
by the random variable mapped compositionally through R and P
n
P
σ
. Therefore,
for Ham(σ, σ
0
) = 1, there holds
H
2
(RP
n
P
σ
× Q
n
Q
σ
, RP
n
P
σ
0
× Q
n
Q
σ
0
) H
2
(RP
n
P
σ
, RP
n
P
σ
0
) + H
2
(Q
n
Q
σ
× Q
n
Q
σ
0
). (22)
33
Ma and Yang
For the first term in (22), combining Tsybakov (2009, Lemma 2.4) and Duchi et al. (2018,
Corollary 3) yields
H
2
(RP
n
P
σ
, RP
n
P
σ
0
) 4(e
α
1)
2
n
P
T V
2
(P
σ
, P
σ
0
) = 4C
2
α
(e
α
1)
2
n
P
w
2
r
2α
.
This together with (20) lead to
H
2
(RP
n
P
σ
, RP
n
P
σ
0
) 4C
2
α
c
2
w
(e
ε
1)
2
n
P
r
2α+2d
=4C
2
α
c
2
w
c
2α+2d
r
(e
ε
1)
2
n
P
n
P
(ε 1)
2
+ n
2α+2d
2γα+d
Q
1
4C
2
α
c
2
w
c
2α+2d
r
.
(23)
For the second term in (22), (21) implies
H
2
(Q
n
Q
σ
× Q
n
Q
σ
0
) n
Q
H
2
(Q
σ
× Q
σ
0
) = C
2
γ
C
2γ
α
n
Q
wr
2αγ
.
Again, bringing in (20) leads to
H
2
(Q
n
Q
σ
× Q
n
Q
σ
0
) c
w
C
2
γ
C
2γ
α
n
Q
r
2αγ+d
=C
2
γ
C
2γ
α
c
w
n
Q
n
P
(e
ε
1)
2
+ n
2α+2d
2γα+d
Q
2γα+d
2α+2d
C
2
γ
C
2γ
α
c
w
. (24)
Bringing (23), (24) into (22), we have
H
2
(RP
n
P
σ
× Q
n
Q
σ
, RP
n
P
σ
0
× Q
n
Q
σ
0
)
1
4
as long as the constants are small enough such that max(4C
2
α
c
2
w
c
2α+2d
r
, C
2
γ
C
2γ
α
c
w
) 1/8.
Let
bσ = arg min
σ∈{−1,1}
m
R
P
σ
(
b
f) R
P
σ
.
Apply Tsybakov (2009, Theorem 2.12), there holds
inf
bσ
max
σ
E
X,Y P
σ
Ham(bσ, σ)
m
2
(1
2
2
). (25)
Next, we reduce the mini-max risk into this quantity. Since
R
P
σ
(f
bσ
) R
R
P
σ
(
b
f) R
+ R
P
bσ
(
b
f) R
2
R
P
σ
(
b
f) R
,
for any estimator
b
f, we can reduce the excess risk as
E
X,Y
h
R
P
σ
(
b
f) R
P
σ
i
1
2
E
X,Y
R
P
σ
(f
bσ
) R
P
σ
=
1
2
E
X,Y
Z
X
η
σ
(x)
1
2
1 {f
σ
(x) 6= f
bσ
(x)}dP
σ,X
(x)
=
1
4
E
X,Y
X
σ
k
6=bσ
k
Z
B(x
k
,r)
c
α
r
α
dP
σ,X
(x)
=
c
α
wr
α
4
E
X,Y
[Ham(σ, bσ)] .
34
Optimal Locally Private Nonparametric Cla ssification with Public Data
This together with (25) yield
inf
b
f
max
H
E
X,Y
h
R
P
σ
(
b
f) R
P
σ
i
1
8
(1
2
2
)c
α
wr
α
m
& r
(β+1)α
=
n
P
(e
ε
1)
2
+ n
2α+2d
2γα+d
Q
(β+1)α
2α+2d
.
B.3 Proof of Theorem 4
To prove Theorem 4, we rely on the following decomposition.
˜η
π
P,Q
(x)
1
2
= ˜η
π
P,Q
(x) bη
π
P,Q
(x)
| {z }
Privatized Error
+ bη
π
P,Q
(x) η
π
P,Q
(x)
| {z }
Sample Error
+ η
π
P,Q
(x)
1
2
| {z }
Signal Strength
(26)
where we define
bη
π
P,Q
(x) =
X
j∈I
p
1{x A
j
(p)
}
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
(27)
η
π
P,Q
(x) =
X
j∈I
p
1{x A
j
(p)
}
n
P
R
A
j
(p)
η
P
(x
0
)dP
X
(x
0
) + λn
Q
R
A
j
(p)
η
Q
(x
0
)dQ
X
(x
0
)
n
P
R
A
j
(p)
dP
X
(x
0
) + λn
Q
R
A
j
(p)
dQ
X
(x
0
)
. (28)
Correspondingly, we let
b
f
π
P,Q
(x) = 1
bη
π
P,Q
(x) >
1
2
and f
π
P,Q
(x) = 1
η
π
P,Q
(x) >
1
2
.
Loosely speaking, the first error term quantifies the degradation brought by adding
noises to the estimator, which we call the privatized error. The second term corresponds
to the expected estimation error brought by the randomness of the data, which we call the
sample error. The last term is called signal strength, which quantifies the uncertainty of
the population regression function. In general, we want the absolute value of the third term
to be large, (i.e. the population regression function is sure about the label) while the first
and the second terms to be small (i.e. the noise is small). In the proof of Theorem 4, we
specify a region
n
containing points with strong signal strength. We show that the first
two terms are dominated by the signal strength on
n
. The following three lemmas provide
bounds for each of the three terms.
Lemma 8 (Bounding of Privatized Error) Let π be constructed through the max-edge
partition in Section 3.2 with depth p. Let ˜η
π
P,Q
be the locally private two-sample tree classifier
in (8). Let bη
π
P,Q
be defined in (27). Then for n
α
2α+2d
P
. ε . n
d
4α+2d
P
, under the condition of
Theorem 4, there holds
k˜η
π
P,Q
bη
π
P,Q
k
.
2
p
·
p
n
P
log(n
P
+ n
Q
)
ε(n
P
+ λn
Q
)
with probability P
n
P
Q
n
Q
R at least 1 3/(n
P
+ n
Q
)
2
.
35
Ma and Yang
Lemma 9 (Bounding of Sample Error) Let π be constructed through the max-edge par-
tition in Section 3.2 with depth p. Let bη
π
P,Q
and η
π
P,Q
be defined in (27) and (28). Then for
n
α
2α+2d
P
. ε . n
α
4α+2d
P
, under the condition of Theorem 4, there holds
kbη
π
P,Q
η
π
P,Q
k
.
s
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
with probability P
n
P
Q
n
Q
R at least 1 3/(n
P
+ n
Q
)
2
.
Lemma 10 (Bounding of Signal Strength) Let π be constructed through the max-edge
partition in Section 3.2 with depth p. Define the region
n
:=
x X
|η
P
(x) 1/2| c
· 2
pα/d
(29)
for some constant c
large enough. Let η
π
P,Q
be defined in (28). Then under the condition
of Theorem 4, there holds
η
π
P,Q
(x)
1
2
·
η
P
(x)
1
2
> 0, and
η
π
P,Q
(x)
1
2
&
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
for all x
n
.
We also need the following technical lemma.
Lemma 11 Suppose n
α
2α+2d
P
. ε . n
α
4α+2d
P
. If we let
λ δ
(γ1)α
d
+(
(γ+1)α
d
1)0
n
P
ε
2
log(n
P
+ n
Q
)
+
n
Q
log(n
P
+ n
Q
)
2α+2d
2γα+d
!
(γαα)(2γαd)
2α+2d
,
then the following statements hold:
ε
2
(n
P
+ λ
2
n
Q
)
log(n
P
+ n
Q
)
& δ
2α+2d
d
,
n
P
+ λ
2
n
Q
log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
. δ,
p
n
P
ε
2
log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
. δ.
All proofs of the above lemmas are provided in Section B.4.
Proof of Theorem 4 Remind the definition in (29). Theorem 4 takes 2
p
δ
1
and
λ δ
(γ1)α/d
. We prove the theorem by separately considering the region
n
:=
x X
|η
P
(x) 1/2| c
· δ
α/d
for constant c
such that
n
n
. In this case, suppose we have 2
p
= C
p
δ. Then it
suffices to take c
= 2C
α/d
p
· c
.
36
Optimal Locally Private Nonparametric Cla ssification with Public Data
For x
n
, we show that
˜
f
π
P,Q
(x) = f
P
(x). Without loss of generality, let f
P
(x) = 1.
By Lemma 8, 9, and 10 we have
˜
f
π
P,Q
(x)
1
2
η
π
P,Q
(x)
1
2
kη
π
P,Q
bη
π
P,Q
k
kbη
π
P,Q
˜η
π
P,Q
k
&
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
2
p
·
n
P
ε(n
P
+ λn
Q
)
s
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
(30)
with probability 1 3/(n
P
+ n
Q
)
2
with respect to P
n
P
Q
n
Q
R. We first compare the
signal strength with the privatized error. From the conditions in Theorem 4, we have the
fact that λ 2
(γ1)pα/d
.
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
·
2
p
·
n
P
ε(n
P
+ λn
Q
)
1
2
pα/d
· 2
p
· ε ·
n
P
+ λ
2
n
Q
p
n
P
log(n
P
+ n
Q
)
.
By bringing in the value of δ and applying Lemma 11 (the first conclusion) , there holds
2
pα/d
· 2
p
· ε ·
n
P
+ λ
2
n
Q
p
n
P
log(n
P
+ n
Q
)
&C
α+d
d
p
· δ
α+d
d
· n
1
2
P
· ε
1
·
q
log(n
P
+ n
Q
)
&C
α+d
d
p
·
n
P
ε
2
log(n
P
+ n
Q
)
1
2
· n
1
2
P
· ε
1
·
q
log(n
P
+ n
Q
)
=C
α+d
d
p
.
Thus, for sufficiently large C
p
, we have
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
2
2
p
·
n
P
ε(n
P
+ λn
Q
)
. (31)
Analogously, we compare the signal strength with the sample error, which is
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
·
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
1
2
=2
pα/d
· δ
1
2
·
q
n
P
+ λ
2
n
Q
· log
1
2
(n
P
+ n
Q
).
By bringing in the value of δ and applying Lemma 11 (the first conclusion) , there holds
2
pα/d
· δ
1
2
·
q
n
P
+ λ
2
n
Q
· log
1
2
(n
P
+ n
Q
) & δ
1
2
· ε
1
& 1.
Thus, for sufficiently large n
P
and n
Q
, we have
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
2
s
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
. (32)
Combined with (31) and (32), (30) yields ˜η
π
P,Q
(x)
1
2
0, which indicates that
˜
f
π
P,Q
(x) =
f
P
(x) for all x
n
.
37
Ma and Yang
For
c
n
, by Assumption 2, we have
Z
c
n
η
P
(x)
1
2
1
n
˜
f
π
P,Q
(x) 6= f
P
(x)
o
dP
X
(x) c
δ
α/d
Z
c
n
dP
X
(x) c
1+β
C
β
δ
(β+1)α/d
.
Together, there holds
R
P
(
˜
f
π
P,Q
) R
P
=
Z
X
η
P
(x)
1
2
1
n
˜
f
π
P,Q
(x) 6= f
P
(x)
o
dP
X
(x)
=
Z
c
n
η
P
(x)
1
2
dP
X
(x) . δ
(β+1)α/d
.
B.4 Proofs of Lemmas in Section B.3
To prove lemmas in Section B.3, we first present several technical results. Their proof will
be found in Section B.4
Lemma 12 Suppose ξ
i
, i = 1, ··· , n are independent sub-exponential random variables with
parameters (a, b) (Wainwright, 2019, Definition 2.9). Then there holds
P
"
1
n
n
X
i=1
ξ
i
E
1
n
n
X
i=1
ξ
i
t
#
2e
nt
2
2a
2
.
for any t > 0. Moreover, a standard Laplace random variable is sub-exponential with pa-
rameters (
2, 1). A union bound argument yields that
P
"
1
n
n
X
i=1
ξ
j
i
E
1
n
n
X
i=1
ξ
j
i
t, for j = 1, ··· , 2
p
#
2e
p log 2
nt
2
2a
2
.
Lemma 13 Let π be a partition generated from Algorithm 1 with depth p. Suppose Assump-
tion 1 holds. Suppose D
P
= {(X
P
1
, Y
P
1
), ··· , (X
P
n
P
, Y
P
n
P
)} and D
Q
= {(X
Q
1
, Y
Q
1
), ··· , (X
Q
n
Q
, Y
Q
n
Q
)}
are drawn i.i.d. from P and Q, respectively. Let 0 < λ . (n
P
+ n
Q
)/ log(n
P
+ n
Q
). Then
for all A
j
(p)
in π
p
, there holds
1
n
P
+ λn
Q
n
P
X
i=1
1{X
P
i
A
j
(p)
} + λ
n
Q
X
i=1
1{X
Q
i
A
j
(p)
}
!
Z
A
j
(p)
dP
X
(x
0
)
.
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
with probability P
n
P
Q
n
Q
at least 1 1/(n
P
+ n
Q
)
2
.
38
Optimal Locally Private Nonparametric Cla ssification with Public Data
Lemma 14 Let π be a partition generated from Algorithm 1 with depth p. Suppose Assump-
tion 1 holds. Suppose D
P
= {(X
P
1
, Y
P
1
), ··· , (X
P
n
P
, Y
P
n
P
)} and D
Q
= {(X
Q
1
, Y
Q
1
), ··· , (X
Q
n
Q
, Y
Q
n
Q
)}
are drawn i.i.d. from P and Q, respectively. Let 0 < λ . (n
P
+ n
Q
)/ log(n
P
+ n
Q
). Then
for all A
j
(p)
in π
p
, there holds
1
n
P
+ λn
Q
n
P
X
i=1
Y
P
i
1{X
P
i
A
j
(p)
} + λ
n
Q
X
i=1
Y
Q
i
1{X
Q
i
A
j
(p)
}
!
n
P
Z
A
j
(p)
η
P
(x
0
)dP
X
(x
0
) + λ n
Q
Z
A
j
(p)
η
Q
(x
0
)dQ
X
(x
0
)
!
.
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
with probability P
n
P
Q
n
Q
at least 1 1/(n
P
+ n
Q
)
2
.
Lemma 15 Let π
i
:=
n
A
j
(i)
, j {1, ··· , 2
p
}
o
be the partition generated by Algorithm 1
with depth i. Then for any A
j
(i)
, there holds
2
1
d · 2
i/d
diam(A
j
(i)
) 2
d · 2
i/d
.
Proof of Lemma 8 We intend to bound
k˜η
π
P,Q
bη
π
P,Q
k
= max
j∈I
P
n
P
i=1
˜
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
˜
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
For each j I and any x A
j
(p)
, the error can be decomposed as
P
n
P
i=1
˜
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
˜
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
P
n
P
i=1
˜
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
+
P
n
P
i=1
˜
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
n
P
X
i=1
˜
V
Pj
i
+ λ
n
Q
X
i=1
V
Qj
i
4
ε
P
n
P
i=1
ζ
j
i
P
n
P
i=1
˜
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
| {z }
(I)
+
4
ε
P
n
P
i=1
ξ
j
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
| {z }
(II)
.
39
Ma and Yang
For the first term in (I), we can apply Lemma 12 and 14 to get
n
P
X
i=1
˜
V
Pj
i
+ λ
n
Q
X
i=1
V
Qj
i
n
P
X
i=1
V
Pj
i
+ λ
n
Q
X
i=1
V
Qj
i
+
4
ε
n
P
X
i=1
ξ
j
i
.(n
P
+ λn
Q
)
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
+ 2
p
+
p
n
P
ε
2
log(n
P
+ n
Q
)
n
P
+ λn
Q
!
with probability at least 1 2/(n
P
+ n
Q
)
2
, where we used the fact that
Z
A
j
(p)
dP
X
(x) 2
p
,
Z
A
j
(p)
η
P
(x)dP
X
(x) 2
p
,
Z
A
j
(p)
η
Q
(x)dQ
X
(x) 2
p
Using Lemma 11 (the second and the third conclusions), we know that the second term is
the dominating term, indicating that
n
P
X
i=1
˜
V
Pj
i
+ λ
n
Q
X
i=1
V
Qj
i
. (n
P
+ λn
Q
) · 2
p
.
Analogously, by Lemma 13, there holds
n
P
X
i=1
U
Pj
i
+ λ
n
Q
X
i=1
U
Qj
i
&(n
P
+ λn
Q
)
2
p
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
!
(n
P
+ λn
Q
) · 2
p
(33)
as well as
n
P
X
i=1
˜
U
Pj
i
+ λ
n
Q
X
i=1
U
Qj
i
n
P
X
i=1
U
Pj
i
+ λ
n
Q
X
i=1
U
Qj
i
4
ε
n
P
X
i=1
ζ
j
i
&(n
P
+ λn
Q
)
2
p
p
n
P
ε
2
log(n
P
+ n
Q
)
n
P
+ λn
Q
!
(n
P
+ λn
Q
) · 2
p
.
The last step is by Lemma 11 (the second and the third conclusions), knowing that 2
p
is
the dominating term. Combining these three pieces, there holds
(I) .
q
n
P
ε
2
log(n
P
+ n
Q
) ·
(n
P
+ λn
Q
) · 2
p
1
2
p
·
p
n
P
log(n
P
+ n
Q
)
ε(n
P
+ λn
Q
)
. (34)
For (II), (33) and Lemma 12 together yields
(II) .
2
p
·
p
n
P
log(n
P
+ n
Q
)
ε(n
P
+ λn
Q
)
. (35)
40
Optimal Locally Private Nonparametric Cla ssification with Public Data
(34) and (35) together lead to the desired conclusion.
Proof of Lemma 9 Let η
P,Q
(x) =
n
P
η
P
(x) + λn
Q
η
Q
(x)
/(n
P
+ λn
Q
). We intend to
bound
kbη
π
P,Q
η
π
P,Q
k
= max
j∈I
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
(n
P
+ λn
Q
)
R
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
(n
P
+ λn
Q
)
R
A
j
(p)
dP
X
(x
0
)
For each j I and any x A
j
(p)
, the error can be decomposed as
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
(n
P
+ λn
Q
)
R
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
(n
P
+ λn
Q
)
R
A
j
(p)
dP
X
(x
0
)
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
·
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
(n
P
+ λn
Q
)
R
A
j
(p)
dP
X
(x
0
)
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
R
A
j
(p)
dP
X
(x
0
)
| {z }
(I)
+
P
n
P
i=1
V
Pj
i
+ λ
P
n
Q
i=1
V
Qj
i
(n
P
+ λn
Q
)
R
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
|
{z }
(II)
We bound two terms separately. For the first term in (I), Assumption 1 yield
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
) . 2
p
For the second term, (33) and Lemma 13 yields
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
(n
P
+ λn
Q
)
R
A
j
(p)
dP
X
(x
0
)
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
R
A
j
(p)
dP
X
(x
0
)
.(n
P
+ λn
Q
) ·
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
·
(n
P
+ λn
Q
) · 2
2p
1
2
p
·
s
2
p
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
with probability 1 1/(n
P
+ n
Q
)
2
. Together, these two pieces yield
(I) .
s
2
p
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
. (36)
41
Ma and Yang
Similarly for (II), we have
(II) .
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ λn
Q
)
2
· 2
p
s
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(n
P
+ λn
Q
)
2
. (37)
(36) and (37) together yield the desired conclusion.
Proof of Lemma 10 Without loss of generality, consider f
P
(x) = 1 for any x
n
. By
Assumption 3, we have
η
π
P,Q
(x)
n
P
R
A
j
(p)
η
P
(x
0
)dP
X
(x
0
) + λn
Q
R
A
j
(p)
1
2
+ C
γ
η
P
(x
0
)
1
2
γ
dQ
X
(x
0
)
n
P
R
A
j
(p)
dP
X
(x
0
) + λn
Q
R
A
j
(p)
dQ
X
(x
0
)
(38)
If Assumption 1 holds, for any x
0
A
j
(p)
, we have
η
P
(x
0
)
1
2
η
P
(x)
1
2
|η
P
(x) η
P
(x
0
)|
η
P
(x)
1
2
c
L
kx x
0
k
α
η
P
(x)
1
2
2c
L
d2
pα/d
(c
2c
L
d)2
pα/d
where the second last and the last inequalities come from Lemma 15 and the definition of
n
, respectively. For sufficiently large c
, we have η
P
(x
0
) 1/2 + 2
pα/d
for all x
0
A
j
(p)
,
and thus (38) yields
η
π
P,Q
(x)
1
2
+
n
P
R
A
j
(p)
2
pα/d
dP
X
(x
0
) + λn
Q
R
A
j
(p)
C
γ
2
α/d
dQ
X
(x
0
)
n
P
R
A
j
(p)
dP
X
(x
0
) + λn
Q
R
A
j
(p)
dQ
X
(x
0
)
1
2
+
c
2
· (1 C
γ
)
c
2
n
P
2
pα/d
+ λn
Q
2
α/d
n
P
+ λn
Q
which yields the desired result. Note that the factor c/c comes from the lower bound of
ratio
R
A
j
(p)
dQ
X
(x
0
)/
R
A
j
(p)
dP
X
(x
0
).
Proof of Lemma 11 We prove the statement by discussing four cases depending on the
value of n
P
, n
Q
, and γ, α, d.
Case (i):
n
P
ε
2
log(n
P
+n
Q
)
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
and
(γ+1)α
d
1. We prove the three conclu-
sions by turn. The first conclusion follows from
n
P
+ λ
2
n
Q
δ
(2α+2d)/d
n
P
· δ
2α+2d
d
&n
P
·
n
P
ε
2
log(n
P
+ n
Q
)
2α+2d
2α+2d
&
log(n
P
+ n
Q
)
ε
2
.
42
Optimal Locally Private Nonparametric Cla ssification with Public Data
For the second conclusion, let n
Q
= n
a
P
for some constant a. Note that we have
2γα+d
2α+2d
a > 0. By replacing all terms with equivalent order of n
P
, it suffices to prove
n
P
+ n
a
2(γ1)α
2α+2d
P
· ε
2(γ1)α
α+d
. n
4α+3d
2α+2d
P
· ε
d
α+d
+ n
1+a
(γ1)α+d
2α+2d
P
· ε
(γ1)α+d
α+d
+ n
2a
2(γ1)α+d
2α+2d
P
· ε
2(γ1)α+d
α+d
The left-hand side is dominated respectively by n
P
. n
4α+3d
2α+2d
P
· ε
d
α+d
and
n
a
2(γ1)α
2α+2d
P
· ε
2(γ1)α
α+d
. n
a
2(γ1)α
2α+2d
+
1
2
+
γα
2α+2d
P
· ε
(γ1)α+d
α+d
= n
1+a
(γ1)α+d
2α+2d
P
· ε
(γ1)α+d
α+d
.
Here, the inequality is drawn by
n
1
2
P
· ε
dα
α+d
n
1
2
(1
dα
α+d
)
P
, n
γα
2α+2d
P
· ε
γα
α+d
& 1.
Note that we omit the log factor since the polynomial part of the left-hand side is strictly
dominated by the right-hand side, yielding that the conclusion holds with sufficiently large
n
P
and n
Q
. For the last statement, the same procedure yields that it suffices to show
n
1
2
P
.
n
2α+d
2α+2d
P
+ n
a
(γ1)α+d
2α+2d
P
· ε
α
α+d
,
which holds since n
α
2α+2d
P
· ε
α
α+d
& n
d
α+d
·
α
2α+2d
P
.
Case (ii):
n
P
ε
2
log(n
P
+n
Q
)
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
and
(γ+1)α
d
< 1. Compared to case (i), the
value is λ is multiplied by a factor δ
(
(γ+1)α
d
1)0
> 1. Thus, the first conclusion holds since
the value of the left-hand side increases. The third conclusion holds since the value of the
left-hand side decreases. For the second conclusion, by replacing all terms with equivalent
order of n
P
, it suffices to prove
n
P
+ n
a
4γα2d
2α+2d
P
· ε
4γα2d
α+d
.n
4α+3d
2α+2d
P
· ε
d
α+d
+ n
1+a
2γα
2α+2d
P
· ε
2γα
α+d
+ n
2a
4γαd
2α+2d
P
· ε
4γαd
α+d
.
For the left-hand side, the first term is dominated as n
P
. n
4α+3d
2α+2d
P
·ε
d
α+d
. Since
2γα+2d
2α+2d
< 1,
we have n
P
·
n
P
· ε
2
2γα2d
2α+2d
& n
P
. Thus, the second term of left-hand side is controlled by
n
a
4γα2d
2α+2d
P
· ε
4γα2d
α+d
. n
1+a
2γα
2α+2d
P
· ε
2γα
α+d
.
Case (iii):
n
P
ε
2
log(n
P
+n
Q
)
<
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
and
(γ+1)α
d
1. We now have a >
2γα+d
2α+2d
.
There holds
n
P
+ λ
2
n
Q
δ
(2α+2d)/d
n
Q
· δ
4γα2d+2α+2d
d
n
Q
·
n
Q
log(n
P
+ n
Q
)
2α+2d
2γα+d
!
2γα
2α+2d
=
n
Q
log(n
P
+ n
Q
)
d
2γα+d
· ε
2
·
log(n
P
+ n
Q
)
ε
2
&
log(n
P
+ n
Q
)
ε
2
43
Ma and Yang
for sufficiently large n
P
+ n
Q
, which implies the first conclusion. For the second conclusion,
we want
n
P
+ n
a
2α+d
2γα+d
P
. n
a
2γα
2γα+d
P
+ n
1+a
(γ+1)α+d
2γα+d
P
+ n
a
2(γ+1)α+d
2γα+d
P
.
The left-hand side is dominated by the term n
a
2α+d
2γα+d
P
. Thus, the conclusion holds since there
holds a
2α+d
2γα+d
< a
2(γ+1)α+d
2γα+d
for all a, γ, α, and d. For the last statement, the same procedure
yields that it suffices to show
n
1
2
P
.
n
1a
d
2γα+d
P
+ n
aa
2γα
2γα+d
P
· ε.
Thus, the conclusion holds since the following conditions hold:
1
2
1
d
2α + 2d
< 1 a
d
2γα + d
, and n
α
2α+2d
P
· ε & 1.
Case (iv):
n
P
ε
2
log(n
P
+n
Q
)
<
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
and
(γ+1)α
d
< 1. Compared to case (iii), the
value is λ is multiplied by a factor δ
(
(γ+1)α
d
1)0
> 1. Thus, the first conclusion holds since
the value of the left-hand side increases. The third conclusion holds since the value of the
left-hand side decreases. For the second conclusion, by replacing all terms with equivalent
order of n
P
, it suffices to prove
n
P
+ n
a
2γα+3d
2γα+d
P
. n
a
2γα
2γα+d
P
+ n
1+aa
2γα
2γα+d
P
+ n
2aa
4γαd
2γα+d
P
.
We have n
P
. n
1+
ad
2γα+d
P
= n
1+aa
2γα
2γα+d
P
and n
a
2γα+3d
2γα+d
P
. n
3ad
2γα+d
P
= n
2aa
4γαd
2γα+d
P
.
B.5 Proofs of Results in Section B.4
Proof of Lemma 12 The conclusion follows from (2.18) in Wainwright (2019).
In the subsequent proof, we define the empirical measure D :=
1
n
P
n
i=1
δ
(X
i
,Y
i
)
given
samples D = {(X
1
, Y
1
), ··· , (X
n
, Y
n
)}, where δ is the Dirac function. To conduct our
analysis, we need to introduce the following fundamental descriptions of covering number
which enables an approximation of an infinite set by finite subsets.
Definition 16 (Covering Number) Let (X, d) be a metric space and A X. For ε > 0,
the ε-covering number of A is denoted as
N(A, d, ε) := min
n 1 : x
1
, . . . , x
n
X such that A
n
[
i=1
B(x
i
, ε)
,
where B(x, ε) := {x
0
X : d(x, x
0
) ε}.
44
Optimal Locally Private Nonparametric Cla ssification with Public Data
To prove Lemma 17, we need the following fundamental lemma concerning the VC
dimension of random partitions in Section 3.2. To this end, let p N be fixed and π
p
be
a partition of X with the number of splits p and π
(p)
denote the collection of all partitions
π
p
.
Lemma 17 Let
˜
A be the collection of all cells ×
d
i=1
[a
i
, b
i
] in R
d
. For all 0 < ε < 1, there
exists a universal constant C such that
N(1
˜
A
, k · k
L
1
(Q)
, ε) C(2d + 1)(4e)
2d+1
(1)
2d
.
Proof of Lemma 17 The result follows directly from Theorem 9.2 in Kosorok (2008).
Before we proceed, we list the well-known Bernstein’s inequality that will be used fre-
quently in the proofs.
Lemma 18 (Bernstein’s inequality) Let B > 0 and σ > 0 be real numbers, and n 1 be
an integer. Furthermore, let ξ
1
, . . . , ξ
n
be independent random variables satisfying E
P
ξ
i
= 0,
kξ
i
k
B, and E
P
P
n
i=1
ξ
2
i
σ
2
for all i = 1, . . . , n. Then for all τ > 0, we have
P
1
n
n
X
i=1
ξ
i
r
2σ
2
τ
n
2
+
2Bτ
3n
e
τ
.
Proof of Lemma 13 Since P and Q share the same marginal distribution, it is equivalent
to prove
1
n
P
+ n
Q
n
P
X
i=1
1{X
i
A
j
(p)
} + λ
n
Q
X
i=1
1{X
i
A
j
(p)
}
!
n
P
+ λn
Q
n
P
+ n
Q
Z
A
j
(p)
dP
X
(x
0
)
.
s
(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
2
p
(n
P
+ n
Q
)
2
holds with probability P
n
P
+n
Q
at least 1 1/(n
P
+ n
Q
)
2
for X
i
P
X
. Let
˜
A be the
collection of all cells ×
d
i=1
[a
i
, b
i
] in R
d
. Applying Lemma 17 with (D
X
+ P
X
)/2, there exists
an ε-net {
˜
A
k
}
K
k=1
˜
A with
K C(2d + 1)(4e)
2d+1
(1)
2d
(39)
such that for any j I
p
, there exist some k {1, . . . , K} such that
k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
((D
X
+P
X
)/2)
ε,
Since
k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
((D
X
+P
X
)/2)
=1/2 · k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(D
X
)
+ 1/2 · k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(P
X
)
,
45
Ma and Yang
we get
k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(D
X
)
2ε, k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(P
X
)
2ε. (40)
Consequently, by the definition of the covering number and the triangle inequality, for any
j I
p
, there holds
1
n
P
+ n
Q
n
P
X
i=1
1{x A
j
(p)
}(X
i
) + λ
n
Q
X
i=1
1{x A
j
(p)
}(X
i
)
!
Z
˜
A
j
(p)
dP
X
(x
0
)
1
n
P
+ n
Q
n
P
X
i=1
1{x
˜
A
k
}(X
i
) + λ
n
Q
X
i=1
1{x
˜
A
k
}(X
i
)
!
Z
˜
A
k
dP
X
(x
0
)
+ k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(D
X
)
+ k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(P
X
)
1
n
P
+ n
Q
n
P
X
i=1
1{x
˜
A
k
}(X
i
) + λ
n
Q
X
i=1
1{x
˜
A
k
}(X
i
)
!
Z
˜
A
k
dP
X
(x
0
)
+ 4ε.
Therefore, we get
sup
j∈I
p
1
n
P
+ n
Q
n
P
X
i=1
1{x A
j
(p)
}(X
i
) + λ
n
Q
X
i=1
1{x A
j
(p)
}(X
i
)
!
Z
˜
A
j
(p)
dP
X
(x
0
)
(41)
sup
1kK
1
n
P
+ n
Q
n
P
X
i=1
1{x
˜
A
k
}(X
i
) + λ
n
Q
X
i=1
1{x
˜
A
k
}(X
i
)
!
Z
˜
A
k
dP
X
(x
0
)
+ 4ε.
(42)
For any fixed 1 k K, let the random variable ξ
i
be defined by ξ
i
:= 1{X
i
˜
A
k
}
R
˜
A
k
dP
X
(x
0
) for i = 1, ··· , n
P
and ξ
i
:= λ(1{X
i
˜
A
k
}
R
˜
A
k
dP
X
(x
0
)) for i = n
P
+
1, ··· , n
P
+ n
Q
. Then we have E
P
X
ξ
i
= 0, kξk
1 λ, and E
P
X
P
n
P
+n
Q
i=1
ξ
2
i
(n
P
+
λ
2
n
Q
)
R
˜
A
k
dP
X
(x
0
). According to Assumption 1, there holds E
P
X
P
n
P
+n
Q
i=1
ξ
2
i
(n
P
+λ
2
n
Q
)·
c · 2
p
. Applying Bernstein’s inequality in Lemma 18, we obtain
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
n
P
+ λn
Q
n
P
+ n
Q
Z
˜
A
k
dP
X
(x
0
)
s
c 2
1p
τ(n
P
+ λ
2
n
Q
)
(n
P
+ n
Q
)
2
+
2τ · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
with probability P
n
P
+n
Q
at least 12e
τ
. Then the union bound together with the covering
number estimate (39) implies that
sup
1kK
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
n
P
+ λn
Q
n
P
+ n
Q
Z
˜
A
k
dP
X
(x
0
)
s
c · 2
1p
(n
P
+ λ
2
n
Q
)(τ + log(2K))
(n
P
+ n
Q
)
2
+
2(τ + log(2K)) · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
46
Optimal Locally Private Nonparametric Cla ssification with Public Data
with probability P
n
P
+n
Q
at least 1 e
τ
. Let τ = 2 log(n
P
+ n
Q
) and ε = 1/(n
P
+ n
Q
).
Then for any n
P
+ n
Q
> N
1
:= (2C) (2d + 1) (4e), we have τ + log(2K) = 2 log n +
log(2C) + log(2d + 1) + (2d + 1) log(4e) + 2d log n (4d + 5) log(n
P
+ n
Q
). Therefore, we
have
sup
1kK
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
n
P
+ λn
Q
n
P
+ n
Q
Z
˜
A
k
dP
X
(x
0
)
s
c · 2
1p
(n
P
+ λ
2
n
Q
)(4d + 5) log(n
P
+ n
Q
)
(n
P
+ n
Q
)
2
+
2(4d + 5) · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
with probability P
n
P
+n
Q
at least 1 1/(n
P
+ n
Q
)
2
. This together with (41) yields that
sup
j∈I
p
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{x A
j
(p)
}
n
P
+ λn
Q
n
P
+ n
Q
Z
˜
A
j
(p)
dP
X
(x
0
)
s
c · 2
1p
(4d + 5)(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
(n
P
+ n
Q
)
2
+
2(4d + 5) · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
+
4
n
P
+ n
Q
.
We now compare to see which term is dominating. Apply Lemma 11 (the first conclusion),
we have
2
p
(n
P
+ λ
2
n
Q
) · log(n
P
+ n
Q
)
(1 λ
2
) · log
2
(n
P
+ n
Q
)
&
δ
2α+d
d
ε
2
· (1 λ
2
)
.
When γ
d
α
1 and γ
d
2α
,
δ
2α+d
d
ε
2
· (1 λ
2
)
= ε
2
· δ
2α+d
d
(γ1)α
d
> ε
2
· δ
2α+3d
2d
ε
2
· (n
P
· ε
2
)
2α+3d
4α+4d
=(n
P
· ε
4α+2d
2α+3d
)
2α+3d
4α+4d
& 1.
When γ
d
α
1 and γ >
d
2α
,
δ
2α+d
d
ε
2
· (1 λ
2
)
= ε
2
· δ
2α+d
d
ε
2
· (n
P
· ε
2
)
2α+d
2α+2d
= (n
P
· ε
2d
2α+d
)
2α+d
2α+2d
& 1.
When γ >
d
α
1 and γ 1, the derivation is the same as the last situation. When γ >
d
α
1
and γ < 1,
δ
2α+d
d
ε
2
· (1 λ
2
)
= ε
2
· δ
2α+d
d
2γαd
d
ε
2
· (n
P
· ε
2
)
2α
d
= (n
P
· ε
4α2d
2α
)
2α
d
& 1.
These derivations imply that the first term is dominating the upper bound, which implies
the desired result.
47
Ma and Yang
Proof of Lemma 14 For notation simplicity, let (X
i
, Y
i
) P for i = 1, ··· , n
P
and
(X
i
, Y
i
) Q for i = n
P
+ 1, ··· , n
P
+ n
Q
. Let
˜
Y
i
= Y
i
for i = 1, ··· , n
P
and
˜
Y
i
= λ · Y
i
for i = n
P
+ 1, ··· , n
P
+ n
Q
. Let η
P,Q
(x) =
n
P
η
P
(x) + λn
Q
η
Q
(x)
/(n
P
+ λn
Q
). Let
˜
A
be the collection of all cells ×
d
i=1
[a
i
, b
i
] in R
d
. Then there exists an ε-net {
˜
A
k
}
K
k=1
˜
A
with K bounded by (39) such that for any j I
p
, (40) holds for some k {1, . . . , K}.
Consequently, by the definition of the covering number and the triangle inequality, for any
j I
p
, there holds
n
P
+n
Q
X
i=1
1{X
i
A
j
(p)
}
˜
Y
i
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
˜
Y
i
Z
˜
A
k
η
P,Q
(x
0
)dP
X
(x
0
)
+
Z
R
d
1{x
0
A
j
(p)
} 1{x
0
˜
A
k
}
η
P,Q
(x
0
)
dP
X
(x
0
) +
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
} 1{X
i
A
j
(p)
}
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
˜
Y
i
Z
˜
A
k
η
P,Q
(x
0
)dP
X
(x
0
)
+ max
1in
·k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(D
X
)
+ k1{x A
j
(p)
} 1{x
˜
A
k
}k
L
1
(P
X
)
n
P
+n
Q
X
i=1
1{X
i
˜
A
k
}
˜
Y
i
Z
˜
A
k
η
P,Q
(x
0
)dP
X
(x
0
)
+ 4ε. (43)
where the last inequality follow from the condition Y = {0, 1}.
For any fixed 1 k K, let the random variable
˜
ξ
i
be defined by
˜
ξ
i
:= 1{X
i
˜
A
k
}
˜
Y
i
R
˜
A
k
η
P
(x
0
) dP
X
(x
0
) for i = 1, ··· , n
P
and
˜
ξ
i
:= 1{X
i
˜
A
k
}
˜
Y
i
λ
R
˜
A
k
η
Q
(x
0
) dP
X
(x
0
)
for i = n
P
+ 1, ··· , n
P
+ n
Q
. Then we have E
˜
ξ
i
= 0, kξk
1 λ, and E
P
n
P
+n
Q
i=1
˜
ξ
2
i
(n
P
+ λ
2
n
Q
)
R
˜
A
k
dP
X
(x
0
). According to Assumption 1, there holds E
P
n
P
+n
Q
i=1
˜
ξ
2
i
c ·(n
P
+
λ
2
n
Q
) · 2
p
. Applying Bernstein’s inequality in Lemma 18, we obtain
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
A
j
(p)
}
˜
Y
i
n
P
+ λn
Q
n
P
+ n
Q
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
s
c · 2
1p
τ(n
P
+ λ
2
n
Q
)
(n
P
+ n
Q
)
2
+
2τ · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
with probability P
n
P
+n
Q
at least 1 2e
τ
. Similar to the proof of Lemma 13, one can show
that for any n N
1
, there holds
sup
1kK
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
A
j
(p)
}
˜
Y
i
n
P
+ λn
Q
n
P
+ n
Q
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
s
c · 2
1p
τ(n
P
+ λ
2
n
Q
)
(n
P
+ n
Q
)
2
+
2τ · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
48
Optimal Locally Private Nonparametric Cla ssification with Public Data
with probability P
n
P
+n
Q
at least 1 1/(n
P
+ n
Q
)
2
. This together with (43) yields that
1
n
P
+ n
Q
n
P
+n
Q
X
i=1
1{X
i
A
j
(p)
}
˜
Y
i
n
P
+ λn
Q
n
P
+ n
Q
Z
A
j
(p)
η
P,Q
(x
0
)dP
X
(x
0
)
s
c · 2
1p
(4d + 5)(n
P
+ λ
2
n
Q
) log(n
P
+ n
Q
)
(n
P
+ n
Q
)
2
+
2(4d + 5) · 1 λ · log(n
P
+ n
Q
)
3(n
P
+ n
Q
)
+
4
n
P
+ n
Q
.
This implies the desired result by analogous derivations as in the last part of the proof of
Lemma 13.
Proof of Lemma 15 According to the max-edge partition rule, when the depth of the
tree i is a multiple of dimension d, each cell of the tree partition is a high-dimensional cube
with a side length 2
i/d
. On the other hand, when the depth of the tree p is not a multiple
of dimension d, we consider the max-edge tree partition with depth bi/dc and di/de, whose
corresponding side length of the higher dimensional cube is 2
−bi/dc
and 2
−di/de
. Note that in
the splitting procedure of max-edge partition, the side length of each sub-rectangle decreases
monotonically with the increase of p, so the side length of a random tree partition cell is
between 2
−di/de
and 2
−bi/dc
. This implies that
d · 2
−di/de
diam(A
j
(i)
)
d · 2
−bi/dc
Since i/d 1 bi/dc di/de i/d + 1, we immediately get 2
1
d · 2
i/d
diam(A
j
(i)
)
2
d · 2
i/d
.
B.6 Proof of Theorem 5
Proof of Theorem 5 We let
˜
f
π,M
P,Q
:= (
˜
f
π
P,Q
· 1
[0,1]
d
) M and X P
X
. The excess risk
can be decomposed by
R
P
˜
f
π,M
P,Q
R
P
=
Z
[0,1]
d
L(x, y,
˜
f
π,M
P,Q
(x))dP(x, y)
Z
[0,1]
d
L(x, y, f
(x))dP(x, y)
+
Z
[0,1]
dc
L(x, y,
˜
f
π,M
P,Q
(x))dP(x, y)
Z
[0,1]
dc
L(x, y, f
(x))dP(x, y)
. δ
α(β+1)
d
+ P(M(X) / ×
d
j=1
[0, 1])
where δ is defined in Theorem 4. Note that
P(M(X) / ×
d
j=1
[0, 1]) = P(X / ×
d
j=1
[
b
a
j
,
b
b
j
]])
d
X
j=1
P(X
j
/ [
b
a
j
,
b
b
j
]).
Denote the CDF of X
j
as F
j
. Since
P
F
j
(
b
a
j
) F
j
(a
j
)
>
2 log n
Q
n
Q
= P
F
j
(
b
a
j
) >
2 log n
Q
n
Q
=
1
2 log n
Q
n
Q
n
Q
=
1
n
2
Q
49
Ma and Yang
and similar result holds for b
j
, we have P(X
j
/ [
b
a
j
,
b
b
j
]) .
log n
Q
n
Q
with probability 1 1/n
2
Q
.
Thus, there holds P(X / ×
d
j=1
[
b
a
j
,
b
b
j
])) .
log n
Q
n
Q
with probability 1 d/n
2
Q
.
B.7 Proof of Proposition 6
Proof of Proposition 6 We prove the conclusion for k = p
0
first. When k = p
0
and
x A
j
(k)
, the term ˜η
π
k
P,Q
(x) E
Y |X
h
bη
π
k
P,Q
(x)
i
can be written as
1
P
n
P
i=1
˜
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
·
(I)
z }| {
n
P
X
i=1
V
Pj
i
E
Y
P
|X
P
[V
Pj
i
]
+
(II)
z }| {
λ
n
Q
X
i=1
V
Qj
i
E
Y
Q
|X
Q
[V
Qj
i
]
+
4
ε
n
P
X
i=1
ξ
j
i
P
n
P
i=1
E
Y
P
|X
P
[V
Pj
i
] + λ
P
n
Q
i=1
E
Y
Q
|X
Q
[V
Qj
i
]
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
·
4
ε
n
P
X
i=1
ζ
j
i
| {z }
(III)
.
We bound the three parts separately. Since E
Y
P
|X
P
[V
Pj
i
] U
Pj
i
, there holds
P
n
P
i=1
E
Y
P
|X
P
[V
Pj
i
] + λ
P
n
Q
i=1
E
Y
Q
|X
Q
[V
Qj
i
]
P
n
P
i=1
U
Pj
i
+ λ
P
n
Q
i=1
U
Qj
i
1.
Applying Lemma 12, (III) is bounded by
|(III)| 2
q
6ε
2
n
P
log(n
P
+ n
Q
) (44)
with probability R at least 1 1/(n
P
+ n
Q
)
2
for all j. The proofs for (I) and (II) are analo-
gous to the proof of Lemma 14 which utilizes Bernstein’s inequality with the VC dimension
of possible squares on R
d
. (I) is an independent sum of Bernoulli random variables with at
most
P
n
P
i=1
U
Pj
i
non-zero elements. Thus by Bernstein’s inequality (Lemma 18), we have
|(I)|
s
P
n
P
i=1
U
Pj
i
· τ
2
with probability P
n
P
Y
P
|X
P
at least 1 e
τ
. The order of p
0
yields that p
0
log(n
P
ε
2
+ n
Q
).
Since there are at most 2
p
0
+1
cells in the union of all π
k
, there holds
|(I)|
s
P
n
P
i=1
U
Pj
i
· (τ + (p
0
+ 1) log 2)
2
v
u
u
t
n
P
X
i=1
U
Pj
i
(2 log(n
P
+ n
Q
) + log ε) (45)
v
u
u
t
3
n
P
X
i=1
U
Pj
i
log(n
P
+ n
Q
)
50
Optimal Locally Private Nonparametric Cla ssification with Public Data
with probability P
n
P
Y
P
|X
P
at least 11/(n
P
+n
Q
)
2
. By Lemma 12, for all j, with probability
R at least 1 1/(n
P
+ n
Q
)
2
, we have
|
n
P
X
i=1
˜
U
Pj
i
n
P
X
i=1
U
Pj
i
| 2
q
3ε
2
n
P
log(n
P
+ n
Q
).
If
P
n
P
i=1
˜
U
Pj
i
8ε
2
n
P
, there holds
n
P
X
i=1
˜
U
Pj
i
8ε
2
n
P
2
q
3ε
2
n
P
log(n
P
+ n
Q
) 6ε
2
n
P
,
as well as
4
3
n
P
X
i=1
˜
U
Pj
i
n
P
X
i=1
U
Pj
i
2
q
3ε
2
n
P
log(n
P
+ n
Q
) +
1
3
n
P
X
i=1
˜
U
Pj
i
n
P
X
i=1
U
Pj
i
v
u
u
t
3
n
P
X
i=1
˜
U
Pj
i
log(n
P
+ n
Q
) +
1
3
n
P
X
i=1
˜
U
Pj
i
n
P
X
i=1
U
Pj
i
for sufficiently large n
P
. In this case, we have
|(I)|+ |(III)|
v
u
u
t
3
n
P
X
i=1
U
Pj
i
log(n
P
+ n
Q
) + 2
q
6ε
2
n
P
log(n
P
+ n
Q
)
2
v
u
u
t
n
P
X
i=1
˜
U
Pj
i
log(n
P
+ n
Q
) + 2
v
u
u
t
n
P
X
i=1
˜
U
Pj
i
log(n
P
+ n
Q
)
=4
v
u
u
t
n
P
X
i=1
˜
U
Pj
i
log(n
P
+ n
Q
). (46)
If
P
n
P
i=1
˜
U
Pj
i
< 8ε
2
n
P
, there holds
n
P
X
i=1
U
Pj
i
n
P
X
i=1
˜
U
Pj
i
+ 2
q
3ε
2
n
P
log(n
P
+ n
Q
) 8ε
2
n
P
+
8
3
ε
2
n
P
for sufficiently large n
P
. In this case, we get
|(I)|+ |(III)|
v
u
u
t
3
n
P
X
i=1
U
Pj
i
log(n
P
+ n
Q
) + 2
q
6ε
2
n
P
log(n
P
+ n
Q
)
2
q
8ε
2
n
P
log(n
P
+ n
Q
) + 2
q
8ε
2
n
P
log(n
P
+ n
Q
)
=4
q
8ε
2
n
P
log(n
P
+ n
Q
). (47)
51
Ma and Yang
Combinging (46) and (47), we get
|(I)|+ |(III)| 4
v
u
u
t
(8ε
2
n
P
)
n
P
X
i=1
˜
U
Pj
i
!
log(n
P
+ n
Q
). (48)
Similarly to (45), we have
|(II)| λ
s
P
n
P
i=1
U
Qj
i
· (τ + (p
0
+ 1) log 2)
2
λ
v
u
u
t
n
P
X
i=1
U
Qj
i
(2 log(n
P
+ n
Q
) + log ε) (49)
λ
v
u
u
t
3
n
P
X
i=1
U
Qj
i
log(n
P
+ n
Q
)
with probability Q
n
Q
Y
Q
|X
Q
at least 1 1/(n
P
+ n
Q
)
2
. Combining (48) and (49), we get
|(I) + (II) + (III)|
4
v
u
u
t
(8ε
2
n
P
)
n
P
X
i=1
˜
U
Pj
i
!
+ λ
v
u
u
t
3
n
P
X
i=1
U
Qj
i
·
q
log(n
P
+ n
Q
)
v
u
u
t
4C
1
(8ε
2
n
P
)
n
P
X
i=1
˜
U
Pj
i
!
+ C
1
λ
2
n
P
X
i=1
U
Qj
i
.
with probability P
n
P
Y
P
|X
P
Q
n
Q
Y
Q
|X
Q
R at least 1 3/(n
P
+ n
Q
)
2
. Here, we define constant
C
1
for notation simplicity
C
1
= 8 log(n
P
+ n
Q
).
We now prove the conclusion for k < p
0
with a slight modification on the above statement.
For the term (II), summing among all nodes with the same ancestor yields the sum of
2
p
0
k
· n
P
Laplace random variables, for which we have
|(I) + (II) + (III)|
v
u
u
u
t
4C
1
(2
p
0
k+3
ε
2
n
P
)
X
j
0
n
P
X
i=1
˜
U
Pj
0
i
+ C
1
λ
2
X
j
0
n
P
X
i=1
U
Qj
0
i
.
The proof is completed by bringing the bound into the decomposition in the beginning.
B.8 Proof of Theorem 7
In the proof, depth p
0
, p, k are all integers. Without loss of generality, we only write p
0
, p,
k equal some order which will only cause difference in constants.
Proof of Theorem 7 In the following proof, we give several version of upper bounds of
˜η
prune
P,Q
(x) E
Y |X
h
bη
prune
P,Q
(x)
i
. b for each fixed x
n
. Then, we consider the region
n
which is
n
:=
x X
η
P
(x)
1
2
c
· b
.
52
Optimal Locally Private Nonparametric Cla ssification with Public Data
For some c
, by techniques analogous to the proof of Theorem 4, we can show that all x in
is assigned to the same label as the Bayes classifier. For each fixed x, let A
j
k
denote the
partition cell in π
k
that contains x. We will discuss the stopping time of the pruning process
and the resulting point-wise error. We prove the conclusion under two cases respectively.
Case (i):
n
P
ε
2
log(n
P
+n
Q
)
.
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
. In this case, δ
1
. n
d
2γα+d
Q
. n
Q
b
d
2+2d
log
2
n
P
ε
2
+ n
2+2d
d
Q
c. Without loss of generality, we let η
P
(x)
1
2
0. Following
the analysis in Section A.2, the final statistics will satisfy
|˜η
π
k
P,Q
(x)
1
2
|
r
j
k
=
v
u
u
u
t
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
2
u
k
˜η
π
k
P
(x)
1
2
2
+
X
j
0
n
Q
X
i=1
U
Qj
0
i
bη
π
k
Q
(x)
1
2
2
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
bη
π
k
Q
(x)
1
2
.
By Proposition 6, there exists a constant C
p
such that
bη
π
k
Q
(x) E
Y |X
bη
π
k
Q
(x)
C
p
s
log(n
P
+ n
Q
)
P
j
0
P
n
Q
i=1
U
Qj
0
i
(50)
with probability P
n
P
Y
P
|X
P
Q
n
Q
Y
Q
|X
Q
R at least 1 3/(n
P
+ n
Q
)
2
. Consider the region
n
:=
x X
η
P
(x)
1
2
C
· δ
α/d
.
Let 2
k
= δ. For each x
n
, by Lemma 10, there exist C
p1
, C
p2
such that,
min
xA
j
k
E
Y |X
bη
π
k
Q
(x)
1
2
C
1
p
E
Y |X
bη
π
k
Q
(x)
1
2
C
p1
(C
C
p2
)δ
γα/d
. (51)
Also, by Lemma 13, there exists constant C
p3
such that
X
j
0
n
Q
X
i=1
U
Qj
0
i
C
p3
δ · n
Q
(52)
with probability P
n
P
X
P
Q
n
Q
X
Q
at least 1 1/(n
P
+ n
Q
)
2
. (51) and (52) together lead to
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
E
Y |X
bη
π
k
Q
(x)
1
2
C
p1
C
1/2
p3
(C
C
p2
)
n
Q
· δ
2γα+d
2d
2
2γα+d
2d
C
p1
C
1/2
p3
(C
C
p2
)
q
log(n
P
+ n
Q
). (53)
53
Ma and Yang
For sufficiently large constant C
, (53) yields that the choice of 2
k
= δ leads to
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
E
Y |X
bη
π
k
Q
(x)
1
2
3C
p
q
log(n
P
+ n
Q
). (54)
Combining (50) and (54), there holds the following
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
bη
π
k
Q
(x)
1
2
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
E
Y |X
bη
π
k
Q
(x)
1
2
v
u
u
t
X
j
0
n
Q
X
i=1
U
Qj
0
i
bη
π
k
Q
(x) E
Y |X
bη
π
k
Q
(x)
>C
p
q
log(n
P
+ n
Q
).
for the choice of 2
k
= δ
1
. Thus, for x and x A
j
(p
0
)
, the terminating condition will
be triggered at least at 2
k
= δ
1
, indicating that 2
k
j
δ
1
with probability P
n
P
Q
n
Q
R
at least 1 4/(n
P
+ n
Q
)
2
. Thus, in this case, we have
E
Y |X
h
bη
π
k
j
P,Q
(x)
i
1
2
= E
Y |X
h
bη
π
k
j
P,Q
(x)
i
˜η
π
k
j
P,Q
(x) + ˜η
π
k
j
P,Q
(x)
1
2
r
j
k
r
j
k
0, (55)
which implies that the signs of E
Y |X
h
bη
π
k
j
P,Q
(x)
i
and ˜η
π
k
j
P,Q
(x) are identical. By steps analogous
to proof of Proposition 10, there holds E
Y |X
h
bη
π
k
j
P,Q
(x)
i
1
2
> 0 for x
n
. Thus we show
that, for each x
n
and x A
j
(p
0
)
, the label is assigned correctly as the Bayes classifier.
Therefore, by Assumption 2, we have
R
P
(
˜
f
prune
P,Q
) R
P
=
Z
X
η
P
(x)
1
2
1
n
˜
f
π
P,Q
(x) 6= f
P
(x)
o
dP
X
(x)
=
Z
c
n
η
P
(x)
1
2
dP
X
(x) . δ
(β+1)α/d
.
Case (ii)
n
P
ε
2
log(n
P
+n
Q
)
&
n
Q
log(n
P
+n
Q
)
2α+2d
2γα+d
. We first note that, with n
α
2α+2d
P
. ε .
n
d
4+2d
P
, there holds 2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
with high probability, i.e. Algorithm 2
will determine true in first if block. Since ε . n
d
4+2d
P
, there holds 2
p
0
n
P
ε
2
d
2+2d
& ε
2
.
By Lemma 13, when λ = 0, we have
1
n
P
P
j
0
P
n
P
i=1
U
Pj
0
i
2
k
. Together, we have
2
p
0
k+3
ε
2
n
P
&
X
j
0
n
P
X
i=1
U
Pj
0
i
. (56)
For the Laplace noise, there holds
2
p
0
k
ε
2
n
P
&
q
2
p
0
k
ε
2
n
P
log(n
P
+ n
Q
) &
4
ε
X
j
0
n
P
X
i=1
ζ
Pj
0
i
(57)
54
Optimal Locally Private Nonparametric Cla ssification with Public Data
due to Lemma 12. Adding (56) and (57) shows 2
p
0
k+3
ε
2
n
P
P
j
0
P
n
P
i=1
˜
U
Pj
0
i
holds with
probability 1 2/(n
P
+ n
Q
)
2
. Under this case, if k
j
> b
d
2+2d
log
2
(n
P
ε
2
)c for all j I
p
0
, we
end up with ˜η
prune
P,Q
=
P
j
s
j
k
· 1{A
j
(p
0
)
}. We define
n
:=
x X||η
P
(x)
1
2
| C
· δ
α(α+d)/d(1+d)
for some C
large enough. Assume η
P
(x) > 1/2. If x
0
belongs to the same A
k
j
as x, then
η
P
(x
0
)
1
2
η
P
(x)
1
2
C
0
2
k
j
α/d
η
P
(x)
1
2
C
0
δ
α(α+d)/d(1+d)
0
for some C
0
. Analogous to (55), all x
n
are correctly classified and thus
R
P
(
˜
f
prune
P,Q
) R
P
=
Z
X
η
P
(x)
1
2
1
n
˜
f
π
P,Q
(x) 6= f
P
(x)
o
dP
X
(x)
=
Z
c
n
η
P
(x)
1
2
dP
X
(x) . δ
(β+1)α(α+d)/d(1+d)
.
If there exists some j I
p
0
such that the termination is triggered, i.e. we will end up with
˜
f
π
p
P
, the proof becomes analogous to that of Theorem 4. Specifically, the trade off (30)
will have a larger approximation error δ
α(α+d)/d(1+d)
and smaller sample error as well as
privacy error. Thus, the final rate is dominated by the approximation error, which becomes
δ
(β+1)α(α+d)/d(1+d)
. In Case (ii), the overall incidents happen with probability P
n
P
Q
n
Q
R
at least 1 4/(n
P
+ n
Q
)
2
.
Appendix C. Experiment Details
C.1 Real Datasets Pre-processing
We present additional information on the data sets including data source and pre-processing
details. The data sets are sourced from UCI Machine Learning Repository (Dua and
Graff, 2017) and Kaggle.
Anonymity: The Anonymity data set (piAI, 2019) contains poll results about Internet
privacy from a large number of people, including answers about whether the interviewee
believes United States law provides reasonable privacy protection for Internet users or their
conservativeness about Internet privacy. We use other variables to predict whether the
interviewee has ever tried to mask his/her identity when using the Internet. We randomly
select 50 samples as the public data.
Diabetes: This data set from Pore (2023) contains a diverse range of health-related
attributes, meticulously collected to aid in the development of predictive models for identi-
fying individuals at risk of diabetes. We randomly select 200 samples as the public data.
Email: The data set is used for spam or not-spam classification (Biswas, 2019). For each
row, the count of each word (column) in that email (row) is stored in the respective cells.
55
Ma and Yang
Thus, information regarding all emails is stored in a compact data frame. We randomly
select 200 samples as the public data.
Employ: The data set (Hamoutn, 2022) was collected from different agencies in the
Philippines consisting of mock interview performances of students. The final task is to
decide whether the student got employed. We randomly select 200 samples as the public
data.
Rice: The Rice data set (Cammeo and Osmancik, 2019) contains morphological features
obtained for two grains of rice to classify them. We randomly select 300 samples as the
public data.
Census: This data set is the well-known census data (Becker and Kohavi, 1996) and
is used to predict whether income exceeds 50,000 per year. We select the samples whose
native country is the U.S. to be public data and others to be private data.
Election: This data set from (Winner and Thacker, 2022) contains county-level demo-
graphics and whether or not Bill Clinton won each county during the 1992 U.S. presidential
election. The goal of this data set is to predict if Clinton won a county. We use samples
whose state is TX as public data and the rest as private data.
Employee: The Employee data (Elmetwally, 2023) contains information about employees
in a company, including their educational backgrounds, work history, and employment-
related factors. We use these variables to predict whether the employee leaves or not. We
select the sample to be public if the employee joined the company before 2012.
Jobs: The data (Tankha, 2023)contains a comprehensive collection of information re-
garding job applicants and their respective employability scores to predict whether the
applicant has been hired. We select the samples whose native country is the U.S. to be
public data and others to be private data.
Landcover: The data is from Johnson and Iizuka (2016), which contains Landsat time-
series satellite imagery information on given pixels and their corresponding land cover class
labels (farm, forest, water, etc.) obtained from multiple sources. We classify whether the
image comes from a farm or a forest. Within this data set, there are two kinds of label
sources: labels obtained from Open-StreetMap and labels that are accurately labeled by
experts. We take the latter as public data and the former as private data.
Taxi: The Taxi data set was obtained from the Differential Privacy Temporal Map
Challenge (DrivenData, 2021), which aims to develop algorithms that preserve data util-
ity while guaranteeing individual privacy protection. The data set contains quantitative
and categorical information about taxi trips in Chicago, including time, distance, location,
payment, and service provider. These features include the identification number of taxis
(taxi id), time of each trip (seconds), the distance of each trip (miles), index of the zone
where the trip starts (pca), index of the zone where the trip ends (dca), service provider
(company), the method used to pay for the trip (payment type), amount of tips (tips) and
fares (fare). We use the other variable to predict whether the passenger is paying by credit
card or by cash. We select data from one company as public data and others as private
data.
56
Optimal Locally Private Nonparametric Cla ssification with Public Data
References
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal
Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016
ACM SIGSAC conference on computer and communications security, 2016.
Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public
data. Advances in neural information processing systems, 32, 2019.
Ehsan Amid, Arun Ganesh, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas
Steinke, Vinith M Suriyakumar, Om Thakkar, and Abhradeep Thakurta. Public data-
assisted mirror descent for private model training. In International Conference on Ma-
chine Learning. PMLR, 2022.
Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers.
Annals of statistics, 35(2):608–633, 2007.
Raef Bassily, Om Thakkar, and Abhradeep Guha Thakurta. Model-agnostic private learn-
ing. Advances in Neural Information Processing Systems, 31, 2018.
Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven
Wu. Private query release assisted by public data. In International Conference on Ma-
chine Learning, pages 695–703. PMLR, 2020.
Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. URL
https://doi.org/10.24432/C5XW20.
Shai Ben-David, Alex Bie, Clement Louis Canonne, Gautam Kamath, and Vikrant Singhal.
Private distribution learning with public data: The view from sample compression. In
Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Thomas Berrett and Cristina Butucea. Classification under local differential privacy. arXiv
preprint arXiv:1912.04629, 2019.
Thomas B Berrett, L´aszl´o Gy¨orfi, and Harro Walk. Strongly universally consistent nonpara-
metric regression and classification with privatised data. Electronic Journal of Statistics,
15:2430–2453, 2021.
Tom Berrett and Yi Yu. Locally private online change point detection. Advances in Neural
Information Processing Systems, 34:3425–3437, 2021.
Alex Bie, Gautam Kamath, and Vikrant Singhal. Private estimation with public data. In
Advances in Neural Information Processing Systems, 2022.
Balaka Biswas. Email spam classification dataset csv, 2019. URL https://www.kaggle.
com/datasets/balaka18/email-spam-classification-dataset-csv/data.
L Breiman. Classification and regression trees. The Wadsworth & Brooks/Cole, 1984.
57
Ma and Yang
Cristina Butucea, Amandine Dubois, Martin Kroll, and Adrien Saumard. Local differential
privacy: Elbow effect in optimal density estimation and adaptation over besov ellipsoids.
Bernoulli, 26(3):1727–1764, 2020.
T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax
rate and adaptive classifier. The Annals of Statistics, 49(1), 2021.
Yuchao Cai, Yuheng Ma, Yiwei Dong, and Hanfang Yang. Extrapolated random tree for
regression. In Proceedings of the 40th International Conference on Machine Learning,
pages 3442–3468. PMLR, 2023.
Cammeo and Osmancik. Rice. UCI Machine Learning Repository, 2019. URL https:
//doi.org/10.24432/C5MW4Z.
Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor
classification. Advances in Neural Information Processing Systems, 27, 2014.
Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for dif-
ferentially private machine learning. Advances in Neural Information Processing Systems,
26, 2013.
Yuval Dagan and Vitaly Feldman. Interaction is necessary for distributed learning with
privacy or communication constraints. In Proceedings of the 52nd Annual ACM SIGACT
Symposium on Theory of Computing, pages 450–462, 2020.
Amit Daniely and Vitaly Feldman. Locally private learning without interaction requires
separation. Advances in neural information processing systems, 32, 2019.
DrivenData. Differential privacy temporal map challenge: Sprint 3 (pre-
screened arena), 2021. URL https://www.drivendata.org/competitions/77/
deid2-sprint-3-prescreened/page/332/.
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://
archive.ics.uci.edu/ml.
John Duchi, Michael Jordan, and Martin Wainwright. Minimax optimal procedures for
locally private estimation. Journal of the American Statistical Association, 113(521):
182–201, 2018.
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284.
Springer, 2006.
Tawfik Elmetwally. Employee dataset, 2023. URL https://www.kaggle.com/datasets/
tawfikelmetwally/employee-dataset/data.
´
Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable
privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference
on computer and communications security, pages 1054–1067, 2014.
58
Optimal Locally Private Nonparametric Cla ssification with Public Data
Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar,
Abhradeep Guha Thakurta, and Lun Wang. Why is public pretraining necessary for
private model training? In International Conference on Machine Learning, pages 10611–
10627. PMLR, 2023.
Xin Gu, Gautam Kamath, and Zhiwei Steven Wu. Choosing public datasets for private
machine learning via gradient subspace distance. arXiv preprint arXiv:2303.01256, 2023.
aszl´o Gy¨orfi and Martin Kroll. On rate optimal private regression under local differential
privacy. arXiv preprint arXiv:2206.00114, 2022.
Anas Hamoutn. Students’ employability dataset - philippines, 2022. URL https://www.
kaggle.com/datasets/anashamoutni/students-employability-dataset/data.
Brian A Johnson and Kotaro Iizuka. Integrating openstreetmap crowdsourced data and
landsat time-series imagery for rapid land use/land cover (lulc) mapping: Case study of
the laguna de bay area of the philippines. Applied Geography, 67:140–149, 2016.
Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private
gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.
Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local dif-
ferential privacy. Advances in neural information processing systems, 27, 2014.
Peter Kairouz, Monica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. (nearly) di-
mension independent private erm with adagrad rates via publicly estimated subspaces.
In Conference on Learning Theory, pages 2717–2746. PMLR, 2021.
Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference.
Springer Series in Statistics. Springer, New York, 2008.
Martin Kroll. On density estimation at a fixed point under local differential privacy. Elec-
tronic Journal of Statistics, 15:1783–1813, 2021.
OV Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds. optimally
adaptive estimates. Theory of Probability & Its Applications, 36(4):682–697, 1992.
Xuechen Li, Florian Tram`er, Percy Liang, and Tatsunori Hashimoto. Large language models
can be strong differentially private learners. In The Tenth International Conference on
Learning Representations, 2022.
Chong Liu, Yuqing Zhu, Kamalika Chaudhuri, and Yu-Xiang Wang. Revisiting model-
agnostic private learning: Faster rates and active learning. The Journal of Machine
Learning Research, 22(1):11936–11979, 2021.
Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings
of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019.
Andrew Lowy, Zeman Li, Tianjian Huang, and Meisam Razaviyayn. Optimal differentially
private learning with public data. arXiv preprint arXiv:2306.15056, 2023.
59
Ma and Yang
Yuheng Ma, Han Zhang, Yuchao Cai, and Hanfang Yang. Decision tree for locally pri-
vate estimation with public data. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023.
Shubhankar Mohapatra, Sajin Sasy, Xi He, Gautam Kamath, and Om Thakkar. The role
of adaptive optimizers for honest private hyperparameter selection. In Proceedings of the
aaai conference on artificial intelligence, volume 36, pages 7806–7813, 2022.
Milad Nasr, Saeed Mahloujifar, Xinyu Tang, Prateek Mittal, and Amir Houmansadr. Ef-
fectively using public data in privacy preserving machine learning. In Proceedings of the
40th International Conference on Machine Learning, volume 202. PMLR, 2023.
Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential pri-
vacy. In International Conference on Learning Representations, 2021.
Nicolas Papernot, Mart´ın Abadi,
´
Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar.
Semi-supervised knowledge transfer for deep learning from private training data. In
International Conference on Learning Representations, 2017.
Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and
Ulfar Erlingsson. Scalable private learning with pate. In International Conference on
Learning Representations, 2018.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
piAI. Internet privacy poll, 2019. URL https://www.kaggle.com/datasets/econdata/
internet-privacy-poll.
Nandita Pore. Healthcare diabetes dataset, 2023. URL https://www.kaggle.com/
datasets/nanditapore/healthcare-diabetes/data.
Swaroop Ramaswamy, Om Thakkar, Rajiv Mathews, Galen Andrew, H Brendan McMahan,
and Fran¸coise Beaufays. Training production language models without memorizing user
data. arXiv preprint arXiv:2009.10031, 2020.
Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Youn`es Bennani. A
survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv
preprint arXiv:2004.11829, 2020.
Richard J Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statis-
tics, pages 2733–2763, 2012.
Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for dis-
tributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages
58–77. IEEE, 2017.
60
Optimal Locally Private Nonparametric Cla ssification with Public Data
Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with
differentially private updates. In 2013 IEEE global conference on signal and information
processing, pages 245–248. IEEE, 2013.
Jinyan Su, Jinhui Xu, and Di Wang. On pac learning halfspaces in non-interactive local
privacy model with public unlabeled data. In Asian Conference on Machine Learning,
pages 927–941. PMLR, 2023.
Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Pri-
vacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint
arXiv:1709.02753, 2017.
Ayush Tankha. Employability classification of over 70,000 job appli-
cants, 2023. URL https://www.kaggle.com/datasets/ayushtankha/
70k-job-applicants-data-human-resource/data.
Florian Tram`er, Gautam Kamath, and Nicholas Carlini. Considerations for differentially
private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470,
2022.
Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series
in Statistics. Springer, New York, 2009. ISBN 978-0-387-79051-0. doi: 10.1007/
b13794. URL http://hfhhc391f4815d8064db7sqbf60x0xf60c6kpo.faxb.libproxy.
ruc.edu.cn/10.1007/b13794. Revised and extended from the 2004 French original,
Translated by Vladimir Zaiats.
Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.
Cambridge University Press, 2019.
Di Wang and Jinhui Xu. Principal component analysis in the local differential privacy
model. Theoretical computer science, 809:296–312, 2020.
Di Wang, Marco Gaboardi, and Jinhui Xu. Empirical risk minimization in non-interactive
local differential privacy revisited. Advances in Neural Information Processing Systems,
31, 2018.
Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, and Jinhui Xu. Generalized linear
models in non-interactive local differential privacy with public data. Journal of Machine
Learning Research, 24(132):1–57, 2023a.
Hua Wang, Sheng Gao, Huanyu Zhang, Weijie J Su, and Milan Shen. Dp-hypo: An adaptive
private framework for hyperparameter optimization. In Thirty-seventh Conference on
Neural Information Processing Systems, 2023b.
Jun Wang and Zhi-Hua Zhou. Differentially private learning with small public data. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020.
Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer
bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
61
Ma and Yang
Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics,
pages 196–202. Springer, 1992.
Larry Winner and Jared Thacker. 1992 u.s presidential election, 2022. URL https://www.
kaggle.com/datasets/jwt0024/1992-us-presidential-election.
Xiaotong Wu, Lianyong Qi, Jiaquan Gao, Genlin Ji, and Xiaolong Xu. An ensemble of
random decision trees with local differential privacy in edge computing. Neurocomputing,
485:181–195, 2022.
Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Do not let privacy overbill utility:
Gradient embedding perturbation for private learning. In International Conference on
Learning Representations, 2021a.
Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Large scale private learning
via low-rank reparametrization. In International Conference on Machine Learning, pages
12208–12218. PMLR, 2021b.
Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath,
Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin,
and Huishuai Zhang. Differentially private fine-tuning of language models. In The Tenth
International Conference on Learning Representations, 2022.
Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz
Religa, Jian Yin, and Huishuai Zhang. Selective pre-training for private fine-tuning.
arXiv preprint arXiv:2305.13865, 2023.
Kai Zheng, Wenlong Mou, and Liwei Wang. Collect at once, use effectively: Making non-
interactive locally private learning possible. In International Conference on Machine
Learning, pages 4130–4139. PMLR, 2017.
Yingxue Zhou, Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Pri-
vate sgd with gradient subspace identification. In International Conference on Learning
Representations, 2021.
62