Overleaf Example

Journal of Machine Learning Research 25 (2024) 1-62 Submitted 11/23; Revised 4/24; Published 5/24

Optimal Locally Private Nonparametric Classiﬁcation

with Public Data

Yuheng Ma [email protected]

School of Statistics

Renmin University of China

100872 Beijing, China

Hanfang Yang hy[email protected]

Center for Applied Statistics, School of Statistics

Renmin University of China

100872 Beijing, China

Editor: Po-Ling Loh

Abstract

In this work, we investigate the problem of public data assisted non-interactive Lo-

cal Diﬀerentially Private (LDP) learning with a focus on non-parametric classiﬁcation.

Under the posterior drift assumption, we for the ﬁrst time derive the mini-max optimal

convergence rate with LDP constraint. Then, we present a novel approach, the locally

diﬀerentially private classiﬁcation tree, which attains the mini-max optimal convergence

rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tun-

ing and provides a fast converging estimator. Comprehensive experiments conducted on

synthetic and real data sets show the superior performance of our proposed methods. Both

our theoretical and experimental ﬁndings demonstrate the eﬀectiveness of public data com-

pared to private data, which leads to practical suggestions for prioritizing non-private data

collection.

Keywords: Decision Tree, Local Diﬀerential Privacy, Non-parametric Classiﬁcation,

Posterior Drift, Public Data

1. Introduction

Local diﬀerential privacy (LDP) (Kairouz et al., 2014; Duchi et al., 2018), which is a variant

of diﬀerential privacy (DP) (Dwork et al., 2006), has gained considerable attention in recent

years, particularly among industrial developers (Erlingsson et al., 2014; Tang et al., 2017).

Unlike central DP, which relies on a trusted curator who has access to the raw data, LDP

assumes that each sample is possessed by a data holder, and each holder privatizes their

data before it is collected by the curator. Although this setting provides stronger protection,

learning with perturbed data requires more samples (Duchi et al., 2018) compared to the

central setting. Moreover, basic techniques such as principal component analysis (Wang and

Xu, 2020; Bie et al., 2022), data standardization (Bie et al., 2022), and tree partitioning (Wu

et al., 2022) are troublesome or even prohibited. Consequently, LDP introduces challenges

for various machine learning tasks that are otherwise considered straightforward, including

2024 Yuheng Ma and Hanfang Yang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided

at http://jmlr.org/papers/v25/23-1563.html.

Ma and Yang

density estimation (Duchi et al., 2018), mean estimation (Duchi et al., 2018), Gaussian

estimation (Joseph et al., 2019), and change-point detection (Berrett and Yu, 2021).

Fortunately, in certain scenarios, private estimation performance can be enhanced with

an additional public data set. From an empirical perspective, numerous studies demon-

strated the eﬀectiveness of using public data (Papernot et al., 2017, 2018; Yu et al., 2021b,

2022; Nasr et al., 2023; Gu et al., 2023). In most cases, the public data is out-of-distribution

and collected from other related sources. Yet, though public data from an identical distri-

bution has been extensively studied (Bassily et al., 2018; Alon et al., 2019; Kairouz et al.,

2021; Amid et al., 2022; Ben-David et al., 2023; Lowy et al., 2023; Wang et al., 2023a),

only a few works systematically described the out-of-distribution relationship. Focusing on

unsupervised problems, i.e. Gaussian estimation, Bie et al. (2022) discussed the gain of

using public data with respect to the total variation distance between private and public

data distributions. As for supervised learning, Gu et al. (2023) discussed the similarities

between data sets to guide the selection of appropriate public data. Ma et al. (2023) lever-

aged public data to create informative partitions for LDP regression trees. Neither of them

discussed the relationship between regression functions of private and public data distribu-

tions. Aiming at the gap between theory and practice in supervised private learning with

public data, we pose the following intriguing questions:

1. Theoretically, when is labeled public data beneﬁcial for LDP learning?

2. Under such conditions, how to eﬀectively leverage public data?

Hyper-parameter tuning is an essential yet challenging task in private learning (Papernot

and Steinke, 2021; Mohapatra et al., 2022) and remains an unsolved problem in local diﬀer-

ential privacy. Common strategies, such as cross-validation and information criteria, remain

unavailable due to their requirement for multiple queries to the training data. Tuning by

omitting a validation set is restrictive in terms of the size of the potential hyperparameter

space (Ma et al., 2023). Moreover, leveraging public data necessitates more complex mod-

els, which often results in an increased number of hyperparameters to tune. Consequently,

models with public data may face diﬃculties in parameter selection, which leads to the

third consideration:

3. Can data-driven hyperparameter tuning approaches be derived?

In this work, we answer the above questions from the perspective of non-parametric

classiﬁcation with the non-interactive LDP constraint. Though parametric methods enjoy

better privacy-utility trade-oﬀ (Duchi et al., 2018) under certain assumptions, they are

vulnerable to model misspeciﬁcation. Under LDP setting where we have limited prior

knowledge of the data, non-parametric methods ensure the worst-case performance. As

for private classiﬁcation, the local setting remains rarely explored compared to the central

setting. A notable reason is that most gradient-based methods (Song et al., 2013; Abadi

et al., 2016) are prohibited due to the high demand for memory and communication capacity

on the terminal machine (Tram`er et al., 2022). We consider non-interactive LDP, where the

communication between the curator and the data holders is limited to a single round. This

type of method is favored by practitioners (Smith et al., 2017; Zheng et al., 2017; Daniely

Optimal Locally Private Nonparametric Cla ssification with Public Data

and Feldman, 2019) since they avoid multi-round protocols that are prohibitively slow in

practice due to network latency.

Under such background, we summarize our contributions by answering the questions:

• For the ﬁrst question, we propose to adopt the framework of posterior drift, a setting

in transfer learning (Cai and Wei, 2021), for our analysis. Speciﬁcally, given the target

distribution P and the source distribution Q, we require P

= Q

, i.e. they have

the same marginal distributions. Moreover, their Bayes decision rule are identical, i.e.

(P(Y = 1|X) −1/2) · (Q(Y = 1|X) −1/2) ≥ 0. The assumption covers a wide range

of distributions and includes P = Q as a special case. Theoretically, we for the ﬁrst

time establish a mini-max lower bound for nonparametric LDP learning with public

data.

• To answer the second question, we propose the Locally diﬀerentially Private Classiﬁ-

cation Tree (LPCT), a novel algorithm that leverages both public and private data.

Speciﬁcally, we ﬁrst create a tree partition on public data and then estimate the re-

gression function through a weighted average of public and private data. Besides

inheriting the merits of tree-based models such as eﬃciency, interpretability, and ex-

tensivity to multiple data types, LPCT is superior from both theoretical and empirical

perspectives. Theoretically, we show that LPCT attains the mini-max optimal con-

vergence rate. Empirically, we conduct experiments on both synthetic and real data

sets to show the eﬀectiveness of LPCT over state-of-the-art competitors.

• To answer the last question, we propose the Pruned Locally diﬀerentially Private

Classiﬁcation Tree (LPCT-prune), a data-driven classiﬁer that is free of parameter

tuning. Speciﬁcally, we query the privatized information with a suﬃcient tree depth

and conduct an eﬃcient pruning procedure. Theoretically, we show that LPCT-prune

achieves the optimal convergence rate in most cases and maintains a reasonable con-

vergence rate otherwise. Empirically, we illustrate that LPCT-prune with a default

parameter performs comparable to LPCT with the best parameters.

This article is organized as follows. In Section 2, we provide a comprehensive literature

review. Then we formally deﬁne our problem setting and introduce our methodology in

Section 3. The obtained theoretical results are presented in Section 4. Then we develop

the data-driven estimator in Section 5. Extensive experiments are conducted on synthetic

data and real data in Section 6 and 7 respectively. The conclusion and discussions are in

Section 8.

2. Related Work

2.1 Private Learning with Public Data

Leveraging public data in private learning is a research topic that is both theoretically and

practically meaningful to closing the gap between private and non-private methods. There

is a long list of works addressing the issue of private learning with public data from a cen-

tral diﬀerential privacy perspective. Through unlabeled public data, Papernot et al. (2017,

2018) fed knowledge privately into student models, whose theoretical beneﬁts are estab-

lished by Liu et al. (2021). Empirical investigations have demonstrated the eﬀectiveness of

Ma and Yang

pretraining on public data and ﬁne-tuning privately on sensitive data (Li et al., 2022; Yu

et al., 2022; Ganesh et al., 2023; Yu et al., 2023). Using the information obtained in public

data, Wang and Zhou (2020); Kairouz et al. (2021); Yu et al. (2021a); Zhou et al. (2021);

Amid et al. (2022); Nasr et al. (2023) investigated preconditioning or adaptively clipping

the private gradients, which reduce the required amount of noise in DP-SGD and accel-

erate its convergence. Theoretical works such as Bassily et al. (2018); Alon et al. (2019);

Bassily et al. (2020) studied sample complexity bounds for PAC learning and query release

based on the VC dimension of the function space. Bie et al. (2022) used public data to

standardize private data and showed that the sample complexity of Gaussian mean esti-

mation can be augmented in the sense of the range parameter. More recently, by relating

to sample compression schemes, Ben-David et al. (2023) presented sample complexities for

several problems including Gaussian mixture estimation. Ganesh et al. (2023) explained

the necessity of public data by investigating the loss landscape.

In contrast, there is less attention focused on the local setting. Su et al. (2023) consid-

ered half-space estimation with an auxiliary unlabeled public data. They construct several

weak LDP classiﬁers and label the public data by majority vote. They established sample

complexity that is linear in the dimension and polynomial in other terms for both private

and public data. Wang et al. (2023a) employed public data to estimate the leading eigen-

value of the covariance matrix, which will serve as a reference for data holders to clip their

statistics. Given the learned clipping threshold, the sample complexity for GLM with LDP

for a general class of functions is improved to polynomial concerning error, dimension, and

privacy budget, which is shown impossible without public data (Smith et al., 2017; Dagan

and Feldman, 2020). Both Wang et al. (2023a) and Su et al. (2023) considered the un-

labeled data, while our work further leverages the information contained in the labels if

labeled public data is available. More recently, Ma et al. (2023) enhanced LDP regression

by using public data to create an informative partition. However, the relationship between

the source and target distributions is vaguely deﬁned. Also, their theoretical results are no

better than using just private data. Lowy et al. (2023) studied mean estimation, DP-SCO,

and DP-ERM with public data. They assume that the public data and the private data

are identically distributed. Also, the theoretical results are established under sequentially-

interactive LDP. Despite the distinctions in the research problem, their conclusions are

analogous to ours: the asymptotic optimal error can be achieved by using either the private

or public estimation solely, while weighted average estimation achieves better constant as

well as empirical performance.

2.2 Locally Private Classiﬁcation

LDP classiﬁcation with parametric assumptions can be viewed as a special case of LDP-

ERM (Smith et al., 2017; Zheng et al., 2017; Wang et al., 2018; Daniely and Feldman, 2019;

Dagan and Feldman, 2020; Wang et al., 2023a). Most works provide evidence of hardness on

non-interactive LDP-ERM. Non-parametric methods possess a worse privacy-utility trade-

oﬀ (Duchi et al., 2018) and are considered to be harder than parametric problems. Zheng

et al. (2017) investigated the kernel ridge regression and established a bound of estimation

error of the order n

−1/4

. Their results only apply to losses that are strongly convex and

smooth. The works most related to ours is that of Berrett and Butucea (2019); Berrett

Optimal Locally Private Nonparametric Cla ssification with Public Data

et al. (2021). Berrett and Butucea (2019) studied non-parametric classiﬁcation with the

h¨older smoothness assumption. They proposed a histogram-based method using the Laplace

mechanism and showed that this approach is mini-max optimal under their assumption. The

strong consistency of this approach is established in Berrett et al. (2021) as a by-product.

Our results include their conclusion as a special case and hold in a stronger sense, i.e.

“uniformly with high probability” instead of “in expectation”.

2.3 Private Hyperparameter Tuning

Researchers highlighted the importance and diﬃculty of parameter tuning under DP con-

straint (Papernot and Steinke, 2021; Wang et al., 2023b) focusing on central DP. Chaudhuri

and Vinterbo (2013) explored this problem under strong stability assumptions. Liu and Tal-

war (2019) presented DP algorithms for selecting the best parameters in several candidates.

Compared to the original algorithm, their method suﬀers a factor of 3 in the privacy param-

eter, meaning that 2/3 of the privacy budget is spent on the selection process. Papernot

and Steinke (2021) improved the result with a much tighter privacy bound in Renyi-DP

by running the algorithm random times and returning the best result. Recently, Wang

et al. (2023b) proposed a framework that reduces DP parameter selection to a non-DP

counterpart. This approach leverages existing hyperparameter optimization methods and

outperforms the uniform sampling approaches (Liu and Talwar, 2019; Papernot and Steinke,

2021). Ramaswamy et al. (2020) proposed to tune hyperparameters on public data sets.

Mohapatra et al. (2022) argued that, under a limited privacy budget, DP optimizers that re-

quire less tuning are preferable. As far as we know, there is a lack of research focusing on the

LDP parameter selection. Butucea et al. (2020) investigated adaptive parameter selection

in non-parametric density estimation, and provided a data-driven wavelet estimator.

3. Locally Diﬀerentially Private Classiﬁcation Tree

In this section, we introduce the Locally diﬀerentially Private Classiﬁcation Tree (LPCT).

After explaining necessary notations and problem setting in Section 3.1, we propose our

decision tree partition rule in Section 3.2. Finally, Section 3.3 introduces our privacy mech-

anism and proposes the ﬁnal algorithm.

3.1 Problem Setting

We introduce necessary notations. For any vector x, let x

denote the i-th element of

x. Recall that for 1 ≤ p < ∞, the L

-norm of x = (x

, . . . , x

) is deﬁned by kxk

(|x

+···+|x

)

1/p

. Let x

i:j

= (x

, ··· , x

) be the slicing view of x in the i, ··· , j position.

Throughout this paper, we use the notation a

. b

and a

& b

to denote that there exist

positive constant c and c

such that a

≤ cb

and a

≥ c

, for all n ∈ N. In addition,

we denote a

 b

if a

. b

and b

. a

. Let a ∨ b = max(a, b) and a ∧ b = min(a, b).

Besides, for any set A ⊂ R

, the diameter of A is deﬁned by diam(A) := sup

x,x

∈A

kx−x

Let f

◦ f

represent the composition of functions f

and f

. Let A × B be the Cartesian

product of sets where A ∈ X

and B ∈ X

. For measure P on X

and Q on X

, deﬁne the

product measure P ⊗ Q on X

× X

as P ⊗ Q(A × B) = P(A) · Q(B). For an integer k,

Ma and Yang

denote the k-fold product measure on X

as P

. Let the standard Laplace random variable

have probability density function e

−|x|

/2 for x ∈ R.

It is legitimate to consider binary classiﬁcation while an extension of our results to

multi-class classiﬁcation is straightforward. Suppose there are two unknown probability

measures, the private measure P and the public measure Q on X ×Y = [0, 1]

×{0, 1}. We

observe n

i.i.d. private samples D

= {(X

, Y

), . . . , (X

, Y

)} drawn from the target

distribution P, and n

i.i.d. public samples D

= {(X

, Y

), . . . , (X

, Y

)} drawn from

the source distribution Q. The data points from the distributions P and Q are also mutually

independent. Our goal is to classify under the target distribution P. Given the observed

data, we construct a classiﬁer

f : X → Y which minimizes the classiﬁcation risk under the

target distribution P :

(

f) , P

(X,Y )∼P

(Y 6=

f(X)).

Here P

(X,Y )∼P

(·) means the probability when (X, Y ) are drawn from distribution P. The

Bayes risk, which is the smallest possible risk with respect to P, is given by R

∗

inf{R

(f)|f : X → Y is measurable}. In such binary classiﬁcation problems, the regression

functions are deﬁned as

∗

(x) , P(Y = 1 | X = x) and η

∗

(x) , Q(Y = 1 | X = x), (1)

which represent the conditional distributions P

Y |X

and Q

Y |X

. The function that achieves

Bayes risk with respect to P is called Bayes function, namely, f

∗

(x) := 1 (η

∗

(x) > 1/2).

We consider the following setting. The estimator

f is considered as a random function

with respect to both D

and D

, while its construction process with respect to D

is locally

diﬀerentially private (LDP). The rigorous deﬁnition of LDP is as follows.

Deﬁnition 1 (Local Diﬀerential Privacy) Given data {(X

, Y

)}

i=1

, a privatized in-

formation Z

, which is a random variable on S, is released based on (X

, Y

) and Z

, ··· , Z

i−1

Let σ(S) be the σ-ﬁeld on S. Z

is drawn conditional on (X

, Y

) and Z

1:i−1

via the

distribution R



S | X

= x, Y

= y, Z

1:i−1

= z

1:i−1



for S ∈ σ(S). Then the mechanism

R = {R

}

i=1

is sequentially-interactive ε-locally diﬀerentially private (ε-LDP) if



S | X

= x, Y

= y, Z

1:i−1

= z

1:i−1





S | X

= x

, Y

= y

, Z

1:i−1

= z

1:i−1



≤ e

for all 1 ≤ i ≤ n

, S ∈ σ(S), x, x

∈ X, y, y

∈ Y, and z

1:i−1

∈ S

i−1

. Moreover, if



S | X

= x, Y

= y





S | X

= x

, Y

= y



≤ e

for all 1 ≤ i ≤ n

, S ∈ σ(S), x, x

∈ X, y, y

∈ Y, then R is non-interactive ε-LDP.

This formulation is widely adopted (Duchi et al., 2018; Berrett and Butucea, 2019). In

contrast to central DP where the likelihood ratio is taken concerning some statistics of all

data, LDP requires individuals to guarantee their own privacy by considering the likelihood

ratio of each (X

, Y

). Once the view z is provided, no further processing can reduce

Optimal Locally Private Nonparametric Cla ssification with Public Data

the deniability about taking a value (x, y) since any outcome z is nearly as likely to have

come from some other initial value (x

, y

). Sequentially-interactive mechanisms fulﬁll the

requirements for a wide range of methods including gradient-based approaches. However,

they can prove to be prohibitively slow in practice. Conversely, non-interactive mechanisms

are quite eﬃcient but suﬀer from poorer performance (Smith et al., 2017). Our analysis

does not encompass the more general fully-interactive mechanisms due to their complex

nature (Duchi et al., 2018).

Besides the privacy guarantee of P data, we consider the existence of additional pub-

lic data from distribution Q. To depict the relationship between P and Q, we consider

the posterior drift setting in transfer learning literature (Cai and Wei, 2021). Under the

posterior drift model, let P

= Q

be the identical marginal distribution. The main dif-

ference between P and Q lies in the regression functions η

∗

(x) and η

∗

(x). Speciﬁcally, let

∗

(x) = φ (η

∗

(x)) for some strictly increasing link function φ(·) with φ (1/2) = 1/2. The

condition that φ is strictly increasing leads to the situation where those X that are more

likely to be labeled Y = 1 under P are more likely to be labeled Y = 1 under Q. The

assumption φ (1/2) = 1/2 guarantees that those X that are non-informative under P are

the same under Q, and more importantly, the sign of η

∗

− 1/2 and η

∗

− 1/2 are identical.

3.2 Decision Tree Partition

In this section, we formalize the decision tree partition process. A binary tree partition

π is a disjoint cover of X obtained by recursively split grids into subgrids. While our

methodology applies to any tree partition, it can be challenging to use general partitions

such as the original CART (Breiman, 1984) for theoretical analysis. Following Cai et al.

(2023); Ma et al. (2023), we propose a new splitting rule called the max-edge partition rule.

This rule is amenable to theoretical analysis and can also achieve satisfactory practical

performance. In the non-interactive setting, the private data set is unavailable during the

partition process and the partition is created solely on the public data. Given public data

set {(X

, Y

)}

i=1

, the partition rule is stated as follows:

• Let A

(0)

:= [0, 1]

be the initial rectangular cell and π

:= {A

(0)

}

j∈I

be the initialized

cell partition. I

= {1} stands for the initialized index set. In addition, let p ∈ N

represent the maximum depth of the tree. The parameter is ﬁxed beforehand by the

user and possibly depends on n.

• Suppose we have obtained a partition π

i−1

of X after i −1 steps of the recursion. Let

= ∅. In the i-th step, let A

(i−1)

∈ π

i−1

be ×

`=1

, b

] for j ∈ I

i−1

. We choose the

edge to be split among the longest edges. The index set of longest edges is deﬁned as

(i−1)



k | |b

− a

| = max

`=1,··· ,d

− a

|, k = 1, ··· , d



• Assume we split along the `-th dimension for ` ∈ M

(i−1)

, A

(i−1)

is then parti-

tioned into a left sub-cell A

j,0

(i−1)

(`) and a right sub-cell A

j,1

(i−1)

(`) along the midpoint

of the chosen dimension, where A

j,0

(i−1)

(`) =

x | x ∈ A

(i−1)

, x

< (a

+ b

)/2

and

Ma and Yang

Algorithm 1: Max-edge Partition

Input: Public data D

, depth p.

Initialization: π

= [0, 1]

for i = 1 to p do

= ∅

for j in I

i−1

Select ` as in (2).

= π

∪ {A

j,0

(i−1)

(`), A

j,1

(i−1)

(`)}.

end

Output: Parition π

j,1

(i−1)

(`) = A

(i−1)

j,0

(i−1)

(`). Then the dimension to be split is chosen by

min arg min

`∈M

(i−1)

score



j,0

(i−1)

(`), A

j,1

(i−1)

(`), D



, (2)

where the score(·) function can be criteria for decision tree classiﬁers such as the Gini

index or information gain. The min operator implies that if the arg min function

returns multiple indices, we select the smallest index. Thus, the partition process is

feasible even if there is no public data, where all axes have the same score.

• Once ` is selected, let π

= π

∪ {A

j,0

(i−1)

(`), A

j,1

(i−1)

(`))}.

The complete process is presented in Algorithm 1 and illustrated in Figure 1. For each grid,

the partition rule selects the midpoint of the longest edges that achieves the largest score

reduction. This procedure continues until either node has no samples or the depth of the

tree reaches its limit. We also deﬁne relationships between nodes which are necessary in

Section 5. For node A

(i)

, we deﬁne its parent node as

parent(A

(i)

) := A

(i−1)

s.t. A

(i−1)

∈ π

i−1

, A

(i)

⊂ A

(i−1)

For i

< i, the ancestor of a node A

(i)

with depth i

is then deﬁned as

ancestor(A

(i)

, i

) = parent

i−i

(i)

) = parent ◦ ··· ◦ parent

| {z }

i−i

(i)

which is the grid in π

that contains A

(i)

3.3 Privacy Mechanism for Any Partition

This section focuses on the privacy mechanism based on general tree partitions. We ﬁrst

introduce the necessary deﬁnitions and then present our private estimation based on the

Laplacian mechanism in (8). To avoid confusion, we refer the readers to a clear summariza-

tion of the deﬁned estimators in Appendix A.1.

Optimal Locally Private Nonparametric Cla ssification with Public Data

(0)

(1)

(2)

(3)

Ancestor

Ancestor/Parent

Figure 1: Partition created by the max-edge rule. The areas ﬁlled with orange represent the corresponding

(i)

and the blue lines represent the partition boundaries after each level of partitioning. Red

boxes contain an ancestor with depth 2, which is also a parent. Blue boxes contain an ancestor

with depth 1.

For index set I, let π = π

= {A

(p)

}

j∈I

be the tree partition created by Algorithm 1.

A population classiﬁcation tree estimation is deﬁned as

(x) =

j∈I

1{x ∈ A

(p)

}

(p)

∗

) dP

)

(p)

)

(3)

The label is inferred using the population tree classiﬁer which is deﬁned as f

(x) =

1 (η

(x) > 1/2). Here, we let 0/0 = 0 by deﬁnition.

To get an empirical estimator given the data set D = {(X

, Y

), . . . , (X

, Y

)}, we

estimate the numerator and the denominator of (3) separately. To estimate the denomi-

nator, each sample (X

, Y

) contributes a one-hot vector U

∈ {0, 1}

where the j-th

element of U

is 1{X

∈ A

(p)

}. Then an estimation of

(p)

) is

i=1

, which

is the number of samples in A

(p)

divided by n

. Analogously, let V

= Y

· U

∈ {0, 1}

An estimation of

(p)

∗

)dP

) is

i=1

, which is the sum of the labels in A

(p)

divided by n

. Combining the pieces, a classiﬁcation tree estimation is deﬁned as

bη

(x) =

j∈I

1{x ∈ A

(p)

}

i=1

. (4)

The corresponding classifer is deﬁned as

(x) = 1 (bη

(x) > 1/2). In other words, bη

(x)

estimates η

∗

(x) by the average of the responses in the cell. In the non-private setting, each

data holder prepares U

and V

according to the partition π and sends it to the curator.

Then the curator aggregates the transmission following (4).

To protect the privacy of each data, we propose to estimate the numerator and denomi-

nator of the population regression tree using a privatized method. Speciﬁcally, U

and V

are combined with the Laplace mechanism (Dwork et al., 2006) before being sent to the

Ma and Yang

curator. For i.i.d. standard Laplace random variables ζ

and ξ

, let

:= U

= U

+ (

, ··· ,

). (5)

Then a privatized estimation of

(p)

(x) is

i=1

. Analogously, let

:= V

= V

+ (

, ··· ,

). (6)

An estimation of

(p)

∗

(x)dP

(x) is

i=1

. Then using the privatized information

(

), i = 1, ··· , n

, we can estimate the regression function as

˜η

(x) =

j∈I

1{x ∈ A

(p)

}

i=1

. (7)

The procedure is also derived by Berrett and Butucea (2019). As an alternative, one can

use random response (Warner, 1965) since both U and V are binary vectors. The following

proposition shows that the estimation procedures satisfy LDP.

Proposition 2 Let π = {A

}

j∈I

be any partition of X with ∪

j∈I

= X and A

∩A

= ∅,

i 6= j. Then the privacy mechanism deﬁned in (5) and (6) is non-interactive ε-LDP.

In the case where one also has access to the Q-data in addition to the P-data, the

Q data can be used to help the classiﬁcation task under the target distribution P and

should be taken into consideration. The utilization of P data must satisfy the local DP

constraint while the Q data can be arbitrarily adopted. We accommodate the existing

private estimation (7) and non-private estimation (4) by taking a weighted average based

on both estimators. Denote the encoded information from D

and D

as (U

, V

) and

, V

), respectively. We use the same partition for P estimator and Q estimator. Namely,

(4) and (7) are constructed via an identical π. When taking the average, data from the

diﬀerent distributions should have diﬀerent weights since the signal strengths are diﬀerent

between P and Q. To make the classiﬁcation at x ∈ X, the Locally diﬀerentially Private

Classiﬁcation Tree estimation (LPCT) is deﬁned as follows:

˜η

P,Q

(x) =

j∈I

1{x ∈ A

(p)

}

i=1

+ λ

i=1

+ λ

i=1

. (8)

Here, we let λ ≥ 0. The locally private tree classiﬁer is

P,Q

(x) = 1(˜η

P,Q

(x) > 1/2). In each

leaf node A

(p)

, the classiﬁer is assigned a label based on the weighted average of estimations

from P and Q data. The parameter λ serves to balance the relative contributions of both P

and Q. A higher value of λ indicates that the ﬁnal predictions are predominantly inﬂuenced

by the public data, and vice versa.

The estimator deﬁned in (8) easily adapts to solve the non-private transfer learning

problem in Cai and Wei (2021) if we deﬁne

bη

P,Q

(x) =

j∈I

1{x ∈ A

(p)

}

i=1

+ λ

i=1

+ λ

i=1

. (9)

Optimal Locally Private Nonparametric Cla ssification with Public Data

Similar to the nearest neighbor estimator in Cai and Wei (2021), (9) estimates by taking

weighted averages of labels from nearby samples. Our theoretical results in Section 4.2

imply that (9) is also rate optimal in terms of rate for non-private transfer learning. A key

advantage for incorporating (9) is that, given the partition, it stores only the prediction

values at each node, which makes it convenient to inject noise. Conversely, the nearest

neighbor estimator (similarly, the kernel estimator) requires a query to the entire training

data set for each single testing sample, which prohibits any modiﬁcation towards a private

estimator. See Kroll (2021) for an example of such limitations.

4. Theoretical Results

In this section, we present the obtained theoretical results. We present the matching lower

and upper bounds of excess risk in Section 4.1 and 4.2. Finally, we demonstrate that LPCT

is free from the unknown range parameter in Section 4.3. All technical proofs can be found

in Appendix B.

4.1 Mini-max Convergence Rate

We ﬁrst give the necessary assumptions on the distribution P and Q. Then we provide the

mini-max rate under the assumptions.

Assumption 1 Let α ∈ (0, 1]. Assume the regression function η

∗

: X → R is α-H¨older

continuous, i.e. there exists a constant c

> 0 such that for all x

, x

∈ X, |η

∗

) −

∗

)| ≤ c

− x

. Also, assume that the marginal density functions of P and Q are

upper and lower bounded, i.e. c ≤ dP

(x) ≤ c for some c, c > 0.

Assumption 2 (Margin Assumption) Let β ≥ 0, C

> 0. Assume





∗

(X) −



< t



≤ C

Assumption 3 (Relative Signal Exponent) Let the relative signal exponent γ ∈ (0, ∞)

and C

∈ (0, ∞). Assume that



∗

(x) −



∗

(x) −



≥ 0, and



∗

(x) −



≥ C



∗

(x) −



Assumption 1 and 2 are standard conditions that have been widely used for non-

parametric classiﬁcation problems (Audibert and Tsybakov, 2007; Samworth, 2012; Chaud-

huri and Dasgupta, 2014). Assumption 3 depicts the similarity between the conditional

distribution of public data and private data. When γ is small, the signal strength of Q

is strong compared to P. In the extreme instance of γ = 0, η

∗

(x) consistently remains

distant from 1/2 by a constant. In contrast, when γ is large, there is not much information

contained in Q. The special case of γ = ∞ allows η

∗

(x) to always be 1/2 and is thus non-

informative. We present the following theorem that speciﬁes the mini-max lower bound

with LDP constraints on P data under the above assumptions. The proof is based on ﬁrst

constructing two function classes and then applying the information inequalities for local

privacy (Duchi et al., 2018) and Assouad’s Lemma (Tsybakov, 2009).

Ma and Yang

Theorem 3 Denote the function class of (P, Q) satisfying Assumption 1, 2, and 3 by F.

Then for any estimator

f that is sequentially-interactive ε-LDP, there holds

inf

sup

X,Y

(

f) − R

∗



− 1)

+ n

2α+2d

2γα+d



−

α(β+1)

2α+2d

. (10)

If n

= 0, our results reduce to the special case of LDP non-parametric classiﬁca-

tion with only private data. In this case, the mini-max lower bound is of the order

− 1)

)

−

α(β+1)

2α+2d

. Previously, Berrett and Butucea (2019) provided an estimator that

reaches the mini-max optimal convergence rate for small ε. If n

= 0, the result recovers the

learning rate n

−

α(β+1)

2γα+d

established for transfer learning under posterior drift (Cai and Wei,

2021). In this case, we have only source data from Q and want to generalize the classiﬁer

on the target distribution P.

The mini-max convergence rate (10) is the same as if one ﬁts an estimator using only

P data with sample size n

+ (e

−1)

−2

·n

2α+2d

2γα+d

. The contribution of Q data is substantial

when n

 (e

− 1)

−2

· n

2α+2d

2γα+d

. Otherwise, the convergence rate is not signiﬁcantly better

than using P data only. The quantity (e

− 1)

−2

· n

2α+2d

2γα+d

, which is referred to as eﬀective

sample size, represent the theoretical gain of using Q data with sample size n

under given

ε, α, d, and γ. Notably, the smaller γ is, the more eﬀective the Q data becomes, and the

greater gain is acquired by using Q data. Regarding diﬀerent values of γ, our work covers

the setting in previous work when a small amount of eﬀective public samples are required

(Bie et al., 2022; Ben-David et al., 2023), and when a large amount of noisy/irrelevant public

samples are required (Ganesh et al., 2023; Yu et al., 2023). Also, the public data is more

eﬀective for small ε, i.e. under stronger privacy constraints. Intuitively, in the presence

of stringent privacy constraints, the utilization of Q data is as eﬀective as incorporating

a substantial amount of P data. Furthermore, even for a constant ε, our eﬀective sample

size signiﬁcantly exceeds n

2α+d

2γα+d

as in the non-private scenario (Cai and Wei, 2021). This

suggests that, in comparison to transfer learning in the non-private setting, public data

assumes greater value in the private setting.

4.2 Convergence Rate of LPCT

In the following, we show the convergence rate of LPCT reaches the lower bound with

proper parameters. The proof is based on a carefully designed decomposition in Appendix

B.3. Deﬁne the quantity

δ :=

log(n

+ n

)



log(n

+ n

)



2α+2d

2γα+d

−

2α+2d

. (11)

Theorem 4 Let π be constructed through the max-edge partition in Section 3.2 with depth

p. Let

P,Q

be the locally private two-sample tree classiﬁer. Suppose Assumption 1, 2, and

Optimal Locally Private Nonparametric Cla ssification with Public Data

3 hold. Then, for n

−

2α+2d

. ε . n

4α+2d

, if λ  δ

(γα−α)∧(2γα−d)

and 2

 δ

−1

, there holds

(

P,Q

) − R

∗

. δ

α(β+1)

log(n

+ n

)



log(n

+ n

)



2α+2d

2γα+d

−

α(β+1)

2α+2d

(12)

with probability 1 − 3/(n

+ n

)

with respect to P

⊗ Q

⊗ R where R is the joint

distribution of privacy mechanisms in (5) and (6).

Compared to Theorem 3, the LPCT with the best parameter choice reaches the mini-

max lower bound up to a logarithm factor. Both parameters λ and p should be assigned

a value of the correct order to achieve the optimal trade-oﬀ. The depth p increases when

either n

or n

grows. This is intuitively correct since the depth of a decision tree should

increase with a larger number of samples. The strength ratio λ, on the other hand, is closely

related to the value of γ. As explained earlier, a small γ guarantees a stronger signal in

∗

. Thus, the value of λ is decreasing for γ. In this case, when the signal strength of η

∗

large, we rely more on the estimation by Q data by assigning a larger ratio to it.

We note that when ε is large, there is a gap between (e

− 1)

in the lower bound and

in the upper bound. The problem can be mitigated by employing the random response

mechanism (Warner, 1965) instead of the Laplace mechanism, as mentioned in Section

3.3. The theoretical analysis generalizes analogously. Here, we do not address this issue

as we want to focus on public data. The theorem restricts n

−

2α+2d

. ε . n

4α+2d

. When

ε . n

−

2α+2d

, the estimator is drastically perturbed by the random noise and the convergence

rate is no longer optimal. As ε gets larger, the upper bound of excess risk decreases until

ε  n

4α+2d

. When ε & n

2α+2d

, the random noise is negligible and the convergence rate

remains to be

log(n

+ n

)



log(n

+ n

)



2α+d

2γα+d

−

α(β+1)

2α+d

which recovers the rate of non-private learning as established in Cai and Wei (2021).

We discuss the order of ratio λ. When γ ≥

− 1, we have λ  δ

(γ−1)α

. This rate

is exactly the same as in the conventional transfer learning case (Cai and Wei, 2021) and

is enough to achieve the optimal trade-oﬀ. Yet, under privacy constraints, this choice

of λ will fail when γ <

− 1. Roughly speaking, if we let λ  δ

(γ−1)α

in this case,

there is a probability such that the noise

i=1

in (8) will dominate the non-private

density estimation

i=1

+ λ

i=1

. When the domination happens, the point-wise

estimation is decided by random noises, and thus the overall excess risk fails to converge.

To alleviate this issue, we assign a larger value λ  δ

2γα−d

> δ

(γ−1)α

to ensure

i=1

+ λ

i=1

. As an interpretation, in the private setting, we place greater

reliance on the estimation by public data compared to the non-private setting.

Last, we discuss the computational complexity of LPCT. We ﬁrst consider the average

computation complexity of LPCT. The training stage consists of two parts. Similar to the

Ma and Yang

standard decision tree algorithm (Breiman, 1984), the partition procedure takes O(pn

time. The computation of (8) takes O(p(n

+ n

)) time. From the proof of Theorem 4,

we know that 2

 δ

−1

, where δ

−1

. n

+ n

. Thus the training stage complexity is

around O((n

+ n

d) log(n

+ n

)) which is linear. Since each prediction of the decision

tree takes O(p) time, the test time for each test instance is no more than O(log(n

+ n

)).

As for storage complexity, since LPCT only requires the storage of the tree structure and

the prediction value at each node, the space complexity of LPCT is O(δ

−1

) which is also

sub-linear. In short, LPCT is an eﬃcient method in time and memory.

4.3 Eliminating the Range Parameter

We discuss how public data can eliminate the range parameter contained in the convergence

rate. When X is unknown, one must select a predeﬁned range [−r, r]

for creating partitions.

Only then do the data holders have a reference for encoding their information. In this case,

the convergence rate of Berrett and Butucea (2019) becomes (r

/(n

))

(1+β)α

2α+2d

∨ (1 −

[−r,r]

dP(x)), which is slower than the known X = [0, 1]

for any r. A small r means that

a large part of the informative X will be ignored (the second term is large), whereas a large

r may create too many redundant cells (the ﬁrst term is large). See Gy¨orﬁ and Kroll (2022)

for detailed derivation. Without much prior knowledge, this parameter can be hard to tune.

The removal of the range parameter is ﬁrst discussed in Bie et al. (2022), where d+1 public

data is required. Our result shows that we do not need to tune this range parameter with

the help of public data.

Theorem 5 Suppose the domain X = ×

j=1

, b

] is unknown. Let

= min

and

max

. Deﬁne the min-max scaling map M : R

→ R

where M(x)

= (x

−

)/(

−

The the min-max scaled D

and D

is written as M(D

) and M(D

). Then for

P,Q

trained within [0, 1]

on M(D

) and M(D

) as in Theorem 4, we have



(

P,Q

· 1

[0,1]

) ◦ M



− R

∗

. δ

α(β+1)

log n

with probability 1 −3/(n

+ n

)

− d/n

with respect to P

⊗ Q

⊗ R.

Compared to Theorem 4, the upper bound of Theorem 5 increases by log n

, which

is a minor term in most cases. Moreover, Theorem 5 fails with additional probability d/n

since we are estimating [a

, b

] with [

]. Both changes indicate that suﬃcient public

data is required to resolve the unknown range issue.

5. Data-driven Tree Structure Pruning

In this section, we develop a data-driven approach for parameter tuning and investigate its

theoretical properties. In Section 5.1, we derive a selection scheme in the spirit of Lepski’s

method (Lepskii, 1992), a classical method for adaptively choosing the hyperparameters.

In Section 5.2, we make slight modiﬁcations to the scheme and illustrate the eﬀectiveness

of the amended approach.

Optimal Locally Private Nonparametric Cla ssification with Public Data

(0)

(1)

(2)

(3)

(a) Initial estimation

(0)

(1)

(2)

(3)

(b) During pruning

(0)

(1)

(2)

(3)

Figure 2: Illustration of pruning process. The yellow area ﬁlled at node A

(i)

means that for x ∈ A

(i)

, we use

the average of labels in the yellow area instead of A

(i)

itself. The prediction in A

(3)

is pruned to

its depth - 1 ancestor; the prediction in A

(3)

is pruned to its depth - 2 ancestor; the prediction in

(3)

is pruned to its depth - 3 ancestor, i.e. not pruned.

5.1 A Naive Strategy

We ﬁrst develop a private analog of Lepski’s method (Lepskii, 1992) for LPCT. There are

two parameters: the depth p and the relative ratio λ, both of which have an optimal value

depending on the unknown α and γ. We do not set the parameters to speciﬁc values.

Instead, we ﬁrst query the privatized information with a suﬃcient tree depth p

. Then, we

introduce a greedy procedure to locally prune the tree to an appropriate depth and select

λ based on local information.

We provide a preliminary result that will guide the derivation of the pruning procedure.

For max-edge partition π

with depth p

and node A

)

, the depth k ancestor private tree

estimation is deﬁned as

˜η

P,Q

(x) =

j∈I

1{x ∈ A

)

}

ancestor(A

)

,k)=ancestor(A

)

,k)



i=1

+ λ

i=1



ancestor(A

)

,k)=ancestor(A

)

,k)



i=1

+ λ

i=1



(13)

and consequently the classiﬁer is deﬁned as

P,Q

(x) = 1(˜η

P,Q

(x) > 1/2). See Figure 2(a)

for an illustration of estimation at ancestor nodes.

Proposition 6 Let π be constructed through the max-edge partition in Section 3.2 with

depth p

. Let ˜η

P,Q

be the tree estimator with depth k ancestor node in (13). Let bη

P,Q

be the

non-private tree estimator with partition π

in (9). Deﬁne the quantity

−k+3

−2

) ∨



i=1



+ C

i=1



i=1

+ λ

i=1



(14)

where the summation over j

is with respect to all descendent nodes of the depth-k ancestor,

i.e. ancestor(A

)

, k) = ancestor(A

)

, k). The constant C

is speciﬁed in the proof. Then,

Ma and Yang

for all λ and p

, if x ∈ A

)

, there holds



˜η

P,Q

(x) − E

Y |X

bη

P,Q

(x)



≤ r

with probability 1 − 3/(n

+ n

)

with respect to P

⊗ Q

⊗ R where R is the

joint distribution of privacy mechanisms in (5) and (6). E

Y |X

is taken with respect to

Y |X

⊗ Q

Y |X

The above proposition indicates that if ˜η

P,Q

(x) − 1/2 ≥ r

, then there holds

Y |X

bη

P,Q

(x)

−

= E

Y |X

bη

P,Q

(x)

− ˜η

P,Q

(x) + ˜η

P,Q

(x) −

≥ 0

with probability 1 − 3/(n

+ n

)

. Similar conclusion holds when ˜η

P,Q

(x) − 1/2 < r

. In

other words, when we have

|˜η

P,Q

(x) −

≥ 1, (15)

the population version of ancestor estimation is of the same sign as the sample version

estimation. As long as (15) holds, it is enough to use the sample version estimation, and

our goal is to ﬁnd the best k for E

Y |X

[bη

P,Q

(x)] to approximate η

∗

(x). Under continuity

assumption 1, the approximation error is bounded by the radius of the largest cell (see

Lemma 15). Consequently, one can show that R

(1(E

Y |X

[bη

P,Q

(x)] > 1/2)) is monotonically

decreasing with respect to k. Thus, our goal is to ﬁnd the largest k such that (15) holds.

On the other hand, (15) is dependent on λ. We select the best possible λ in order to let

(15) hold for each j and k, i.e.

= arg max

|˜η

P,Q

(x) − 1/2|

(16)

for x ∈ A

)

. The optimization problem in (16) has a closed-form solution that can be

computed eﬃciently and explicitly. The derivation of the closed-form solution is postponed

to Section A.2.

Based on the above analysis, we can perform the following pruning procedure. We ﬁrst

query with a suﬃcient depth, i.e. we select p

large enough, create a partition on D

, and

receive the information from data holders. Then we prune back by the following procedure:

• For j ∈ I

, we do the following operation: for k = p

, ··· , 2 and r

deﬁned in (14),

we assign λ

as in (16). Then if (15) holds, let k

= k; else, k ← k − 1.

• If k

is not assigned, we select k

= arg max

|˜η

P,Q

(x) − 1/2|/r

• Assign ˜η

prune

P,Q

(x) = ˜η

P,Q

(x) for all x ∈ A

)

This process can be done eﬃciently. The exact process is illustrated in Figure 2. We ﬁrst

calculate estimations as well as the relative signal strength at all nodes. Then we trace back

from each leaf node to the root and ﬁnd the node that maximizes the statistic (15). The

prediction value at the leaf node is assigned as the prediction at the ancestor node with the

maximum statistic value. In total, the pruning causes additional time complexity at most

O(2

), which is ignorable compared to the original complexity as long as 2

. n

+ dn

Optimal Locally Private Nonparametric Cla ssification with Public Data

5.2 Pruned LDP Classiﬁcation Tree

In the non-private case, i.e. when ε = ∞, ˜η

prune

P,Q

is essentially the Lepski’s method (Lepskii,

1992; Cai and Wei, 2021). However, in the presence of privacy concerns, a tricky term

−k+3

−2

arises in (14) due to the Laplace noises. Speciﬁcally, if 2

−k+3

−2

≥

i=1

, the ﬁrst term in the numerator of (14) contains no information other than

the level of noise. As a result, the theoretical analysis of Lepski’s method fails to adapt.

To deal with the issue, we simultaneously perform two pruning procedures using informa-

tion of solely P or Q data. We ﬁrst query with a large depth p

= b

2+2d

log

2+2d

and half of the budget ε/2. When 2

−k+3

−2

≥

i=1

, we compare the values

of |˜η

P,Q

(x) − 1/2|/r

with λ = 0 and λ = ∞, associating to P and Q data, respectively.

If either of the statistics is larger than 1, we terminate the pruning. Else, we compare the

values of the statistics. If the statistic associated with Q is larger, we return the estimation

. In this case, we guarantee that the pruned classiﬁer is at least as good as the estimation

using only Q data. If the statistic associated with P is larger, we proceed the pruning to

k − 1. If we reach k ≤ b

2+2d

log

)c for any j, we terminate the whole algorithm and

return (7) with depth b

2+2d

log

)c, which costs another ε/2. In this case, we ensure

Q data is unimportant and adopt a terminating condition that is close to the optimal value

2α+2d

log

)c. The detailed algorithm is presented in Algorithm 2. Moreover, we show

that the adjusted pruning procedure will result in a near-optimal excess risk.

Theorem 7 Let

prune

P,Q

= 1(˜η

prune

P,Q

≥ 1/2) be the pruned LPCT deﬁned in Algorithm 2.

Suppose Assumption 1, 2, and 3 hold. Let δ be deﬁned in (11). Then, for n

−

2α+2d

. ε .

4+2d

, with probability 1 − 4/(n

+ n

)

with respect to P

⊗ Q

⊗ R, we have

(i) if

log(n

)



log(n

)



2α+2d

2γα+d

, R

(

prune

P,Q

) − R

∗

. δ

α(β+1)

;

(ii) if

log(n

)



log(n

)



2α+2d

2γα+d

, R

(

prune

P,Q

) − R

∗

. δ

α(α+d)(β+1)

d(1+d)

The conclusion of Theorem 7 is twofold. On one hand, when Q data dominates P

data, i.e.

log(n

)

. (

log(n

)

2α+2d

2γα+d

, the pruned estimator remains rate-optimal. In

this case, the estimator is appropriately pruned to the correct depth. On the other hand,

when P data dominates Q data, the convergence rate suﬀers degradation. The degrada-

tion factor (α + d)/(1 + d) is mild, and even diminishes when α = 1, i.e. η

∗

is Lipschitz

continuous. In this case, there are two possible regions in which our conclusion is derived.

If (

log(n

)

2α+2d

2γα+d

. (

log(n

)

2α+2d

2+d

, the estimator triggers termination and is com-

puted with depth b

2+2d

log

)c. If the opposite, i.e.

log(n

)

& (

log(n

)

2α+2d

2γα+d

(

log(n

)

2α+2d

2+d

, pruning fails to identify the optimal depth b

2α+2d

log

)c. Yet, it

also does not trigger termination at b

2+2d

log

)c. In this region, we take a shortcut

to avoid intricate analysis, while the true convergence rate is more complex yet faster than

that stated in Theorem 7. Notably, one drawback of this shortcut is the discontinuity of

the upper bound at the boundary

log(n

)

 (

log(n

)

2α+2d

2γα+d

Ma and Yang

Algorithm 2: Pruned LPCT

Input: Private data D

= {(X

, Y

)}

i=1

, public data D

= {(X

, Y

)}

i=1

Initialization: p

= b

2+2d

log

+ n

2+2d

)c, flag = 0.

Create π with depth p

on D

. Query (5) and (6) with ε/2.

for A

)

in π do

for k = p

, ···, 1 do

if 2

−k+3

−2

≥

i=1

then

4 log(n

)

i=1

, v

= |bη

− 1/2|/r

−k+5

−2

log(n

)



i=1



, v

= |˜η

− 1/2|/r

if v

≤ v

then

= ˜η

, r

= r

, v

= v

if k ≤ b

2+2d

log

)c then

= k, flag = 1, break

end

else

= bη

, r

= r

, v

= v

end

else



i=1

+4λ

i=1



log(n

)



i=1

+λ

i=1



= max

|˜η

P,Q

− 1/2|/r

, s

= ˜η

P,Q

with λ = λ

in (16)

end

if v

≥ 1 then

= k, break

end

if flag then

p = b

2+2d

log

)c. Create π

on D

. Query (5) and (6) with ε/2.

Output: ˜η

prune

P,Q

= ˜η

else

Output: ˜η

prune

P,Q

· 1{A

)

end

Optimal Locally Private Nonparametric Cla ssification with Public Data

Algorithm 2 lacks true adaptiveness, given that its upper bound suﬀers a signiﬁcant

loss factor of (α + d)/(1 + d). Conventional adaptive methods typically incur a loss factor

of logarithmic order (Butucea et al., 2020; Cai and Wei, 2021). Following the approach

outlined in Butucea et al. (2020), one potential avenue for improvement involves ﬁtting

multiple trees for p = b

2+2d

log





c, ··· , b

2+2d

log

+ n

2+2d

)c and designing a

test statistic to zero out the suboptimal candidates.

In summary, this proposed scheme oﬀers beneﬁts from two perspectives: (1) It avoids

multiple queries to the data. All data holders send their privatized messages at most twice.

(2) It selects sensitive parameters p and λ in a data-driven manner. In sacriﬁce, the bound

of excess risk is reasonably loosened compared to the model with the best parameters, and

one more query to the private data is required.

6. Experiments on Synthetic Data

We validate our theoretical ﬁndings by comparing LPCT with several of its variants on

synthetic data in this section. All experiments are conducted on a machine with 72-core

Intel Xeon 2.60GHz and 128GB memory. Reproducible codes are available on GitHub

6.1 Simulation Design

We motivate the design of simulation settings. While conducting experiments across various

combinations of γ, n

, and n

, our primary focus is on two categories of data: (1) both n

and γ are small, i.e. we have a small amount of high-quality public data (2) both n

and γ

are large, i.e. we have a large amount of low-quality public data. When γ is small but n

is large, using solely public data yields suﬃcient performance, rendering private estimation

less meaningful. Conversely, when γ is large but n

is small, incorporating public data into

the estimation oﬀers limited assistance. Moreover, we require n

to be much larger than

since practical data collection becomes much easier with a privacy guarantee.

We consider the following pair of distributions P and Q. The marginal distributions

= Q

are both uniform distributions on X = [0, 1]

. The regression function of P is

(x) =

+ sign(x

−

) · sign(x

−

) ·





(3x

− 1)(3x

− 1)



· max



0, 1 −



− 1







while the regression function of Q with parameter γ is

(x) =



(x) −



It can be easily veriﬁed that the constructed distributions above satisfy Assumption 1, 2,

and 3. Throughout the following experiments, we take γ in 0.5 and 5. For better illustration,

their regression functions are plotted in Figure 3.

We conduct experiments with privacy budgets ε ∈ {0.5, 1, 2, 4, 8, 1000}. Here, ε = 1000

can be regarded as the non-private case which marks the performance limitation of our

methods. The other budget values cover all commonly seen magnitudes of privacy budgets.

1. https://github.com/Karlmyh/LPCT.

Ma and Yang

(a) η

(b) η

with γ = 0.5 (c) η

with γ = 5

Figure 3: Contour plots of the simulation distributions.

We take classiﬁcation accuracy as the evaluation metric. In the simulation studies, the

comparison methods and their abbreviation are as follows:

• LPCT is the proposed algorithm with the max-edge partition rule. We select the

parameters among p ∈ {1, ··· , 8} and λ ∈ {0.1 , 0.5, 1, 2, 5, 10, 50, 100, 200, 300, 400,

500, 750, 1000, 1250, 1500, 2000}. We choose the Gini index as the reduction criterion.

• LPCT-P and LPCT-Q are the estimation associated to P and Q data, respectively.

Their parameter grids are the same as LPCT.

• LPCT-R is the proposed algorithm using the random max-edge partition (Cai et al.,

2023). Speciﬁcally, we randomly select an edge among the longest edges to cut at the

midpoint. Its parameter grids are the same as LPCT.

• LPCT-prune is the pruned locally private classiﬁcation tree in Algorithm 2. We use

constants that yield a tighter bound than (14).

For each model, we report the best result over its parameter grids, with the best result

determined based on the average result of at least 100 replications.

6.2 Simulation Results

In Section 6.2.1 and 6.2.2, we analyze the inﬂuence of underlying parameters and sample

sizes to validate the theoretical ﬁndings. In Section 6.2.3, we analyze the parameters of

LPCT to understand their behaviors. We illustrate various ways that public data beneﬁts

the estimation in Section 6.2.4.

6.2.1 Influence of underlying parameters

Privacy-utility trade-oﬀ We analyze how the privacy budget ε inﬂuences the quality

of prediction in terms of accuracy. We generate 10,000 private samples, with 50 and 1,000

public samples for γ = 0.5 and γ = 5, respectively. The results are presented in Figure 4(a)

and 4(b). In both cases, the performances of all models increase as ε increases. Furthermore,

the performance of LPCT is consistently better than both LPCT-P and LPCT-Q, demon-

strating the eﬀectiveness of our proposed method in combining both data. LPCT-prune

performs reasonably worse than LPCT. In the high privacy regime (ε ≤ 1), LPCT-P is al-

most non-informative and our estimator performs similarly to LPCT-Q. In the low privacy

Optimal Locally Private Nonparametric Cla ssification with Public Data

(a) Privacy-utility trade-oﬀ with

γ = 0.5 and n

= 50

(b) Privacy-utility trade-oﬀ with

γ = 5 and n

= 1, 000

Figure 4: Illustration of relationship between accuracy and underlying parameters, i.e. ε and γ.

regime (ε ≥ 8), the performance of our estimator improves rapidly along with LPCT-P. In

the medium regime, neither LPCT-P nor LPCT-Q obviously outperforms each other, and

there is a clear performance gain from using LPCT.

Inﬂuence of γ We analyze how the relative signal exponent γ inﬂuences the performance.

We set γ = 0.5, n

= 10, 000 and n

= 50, while varying γ in {0.5, 0.75, 1, 1.25, 1.5, 2}. In

Figure 4(c), we can see that the accuracy of LPCT is decreasing with respect to γ for all ε,

which aligns with Theorem 4. Since γ only inﬂuences the performance through Q data, its

eﬀect is less apparent when the P data takes the dominance, for instance ε = 8.

6.2.2 Influence of sample sizes

Accuracy - n

We conduct experiments to investigate the inﬂuence of the private sample

size n

. We ﬁx n

= 50 and γ = 0.5 while varying n

from 4, 000 to 16, 000. The results

are presented in Figure 5(a). Additionally, we compare the performance of LPCT-P and

LPCT-Q to analyze the contribution of both data sets, as shown in Figure 5(b). Moreover,

we compare LPCT and LPCT-prune in Figure 5(c). We refrain from invoking ˜η

even

if the flag = 1 condition is triggered, as the termination induces sudden changes in the

continuous curves, thereby rendering the ﬁndings less evident. In Figure 5(a), except for

(a) Accuracy - n

(b) With LPCT-P and LPCT-Q

Figure 5: Analysis of accuracy with respect to n

Ma and Yang

(a) With LPCT-prune

(b) With LPCT-P and LPCT-Q

Figure 6: Analysis of accuracy with respect to n

ε = 0.5, the accuracy improves as n

increases for all ε values. Moreover, the rate of

improvement is more pronounced for higher values of ε. This observation aligns with the

convergence rate established in Theorem 4. In 5(b), LPCT always outperforms methods

using single data set, i.e. LPCT-P and LPCT-Q. The improvement of LPCT over LPCT-

Q is more signiﬁcant for large ε. In Figure 5(c), we justify that LPCT-prune performs

analogously to LPCT and its performance improves as ε and n

increase.

Accuracy - n

We conduct experiments to investigate the inﬂuence of the private sample

size n

. We keep n

= 10, 000 and γ = 0.5 while varying n

from 10 to 200. The

classiﬁcation accuracy monotonically increases with respect to n

for all methods. In Figure

6(a), the performance of LPCT-prune is closer to LPCT when n

gets larger. As explained

in Section 5.2, the pruning is appropriately conducted when Q data takes the dominance,

and is disturbed when P data takes the dominance. In Figure 6(b), it is worth noting that

the performance improvement is more signiﬁcant for small values of n

, speciﬁcally when

≤ 50. Moreover, the performance of LPCT-P also improves in this region and remains

approximately unchanged for larger n

. This can be attributed to the information gained

during partitioning, which is further validated in Figure 9.

6.2.3 Analysis of model parameters

Parameter analysis of depth We conduct experiments to investigate the choice of

depth. Under (n

, n

) = (10000, 50) and γ = 0.5, we plot the best performance of LPCT

and LPCT-prune when ﬁxing p and p

to certain values, respectively in Figure 7(a) and

7(b). To avoid obscuring empirical ﬁndings, we refrain from invoking ˜η

even if flag = 1 is

triggered, since the experiment is conducted with respect to p

. In Figure 7(a), the accuracy

ﬁrst increases as p increases until it reaches a certain value. Then the accuracy begins to

decrease as p increases. This conﬁrms the trade-oﬀ observed in Theorem 4. Moreover, the

p that maximizes the accuracy is increasing with respect to ε. This is compatible with

the theory since the optimal choice p  log(n

+ n

2α+2d

2γα+d

) is monotonically increasing

with respect to ε. As in Figure 7(b), the accuracy increases with p

at ﬁrst, whereas the

performance remains unchanged or decreases slightly for p

larger than the optimum. Note

Optimal Locally Private Nonparametric Cla ssification with Public Data

(a) p of LPCT (b) p

of LPCT-prune

Figure 7: Parameter analysis.

that the axes of 7(a) and 7(b) are aligned. LPCT-prune behaves analogously to LPCT in

terms of accuracy and optimal depth choice.

Parameter analysis of λ We conduct experiments to investigate the choice of λ in terms

of accuracy. Under (n

, n

) = (10000, 1000) and γ = 5, we plot in Figure 7(c) the best

performance of LPCT when ﬁxing λ to certain values. In Figure 7(c), the relation between

accuracy and λ is inverted U-shaped for ε = 2, 4, 8, while is monotonically increasing and

decreasing for ε = 1 and 1000, respectively. The results indicate that a properly chosen

λ is necessary as stated in Theorem 4. Moreover, the λ that maximizes the accuracy is

decreasing with respect to ε, which justify the fact that we should rely more on public data

with higher privacy constraint.

6.2.4 Benefits of public data

Range parameter We demonstrate the eﬀectiveness of range estimation using public

data when the domain is unknown. Consider the case where X = [0, a]

and a is a random

variable uniformly distributed in [1, 5]. We compare the following scenarios: (I) a is known.

(II) a is estimated using public data as shown in Theorem 5. (III) a is unknown. Case (I)

is equivalent to X = [0, 1]

. In case (II), we min-max scale the samples from the estimated

(a) Illustration of range estimation

(b) ε = 0.5 (c) ε = 8

Figure 8: Analysis of range parameter. In 8(a), we have a = 2.9. The black square is the range estimated

by public data in case (II). The grey squares represent the grid of a in case (III).

Ma and Yang

(a) Illustration of

random partition and

max-edge partition

(b) Comparison of LPCT and LPCT-R

Figure 9: Illustration of partition with public data.

range to [0, 1]

. In case (III), we scale the samples from [0, a]

to [0, 1]

and select the best

a among a ∈ {1, 1.5, ··· , 5}. In Figure 8(a), we showcase the scaling schemes in (II) and

(III). The estimated range from public data is much more concise than the best a. Given

no prior knowledge, the a chosen in practice can perform signiﬁcantly worse. In Figure 8(b)

and 8(c), we observe that the performance of the estimated range matches that of (I) as n

increases, whereas the unknown range signiﬁcantly degrades performance. The performance

gap between the known and estimated range is more substantial when n

is small. This

observation is compatible with the conclusion drawn in Theorem 5 emphasizing the need

for suﬃcient public data to resolve the unknown range issue.

Partition We demonstrate the eﬀectiveness of partitioning with information of public

data by comparing LPCT with LPCT-R. In Figure 9(a), the showcase of p = 3 is presented.

The random partition is possibly created as the top ﬁgure, where the upper left and the

lower right cells are hard to distinguish as 0 or 1. Using suﬃcient public data, the max-edge

partition is bound to create an informative partition like the lower ﬁgure. Moreover, Figure

9(b) shows that LPCT outperforms LPCT-R.

7. Experiments on Real Data

7.1 Data Description

We conduct experiments on real-world data sets to show the superiority of LPCT. We

utilize 11 real data sets that are available online. Each of these data sets contains sensitive

information and necessitates privacy protection. Furthermore, these data sets cover a wide

range of sample sizes and feature dimensions commonly encountered in real-world scenarios.

When creating D

and D

from the original data set D, some data sets have explicit criteria

for splitting, reﬂecting real-world considerations. In these cases, we have P 6= Q. For others,

the split is performed randomly and we have P = Q. A summary of key information for

these data sets after pre-processing can be found in Table 1. Following the analysis in

Optimal Locally Private Nonparametric Cla ssification with Public Data

Table 1: Information of real data sets.

Dataset P = Q n

d Area

Anonymity X 453 50 14 Social

Diabetes X 2,054 200 8 Medical

Email X 3,977 200 3,000 Text

Employ X 2,225 200 8 Business

Rice X 240 3,510 7 Image

Census × 41,292 3,144 46 Social

Election × 1,752 222 9 Social

Employee × 3,319 504 9 Business

Jobs × 46,218 14,318 11 Social

Landcover × 7,097 131 28 Image

Taxi × 3,297,639 117,367 93 Social

Section 4.3, we min-max scale each data to [0, 1]

according to public data. Detailed data

information and pre-processing procedures are provided in Appendix C.1.

7.2 Competitors

We compare LPCT with several state-of-the-art competitors. For each model, we report

the best result over its parameter grids based on an average of 20 replications. The tested

methods are

• LPCT is described in Section 6. In real data experiments, we enlarge the grids

to p ∈ {1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16} and λ ∈ {0.1 , 0.5, 1, 2, 5, 10, 50, 100, 200,

300, 400, 500, 750, 1000, 1250, 1500, 2000} due to the large sample size. Since the max-

edge rule is easily aﬀected by useless features or categorical features (Ma et al., 2023),

we boost the performance of LPCT by incorporating the criterion reduction scheme

from the original CART (Breiman, 1984) to the tree construction. Since Proposition

2 holds for any partition, the method is also ε-LDP. We choose the Gini index as the

reduction criterion.

• LPCT-prune is the pruned locally private classiﬁcation tree in Algorithm 2. We also

adopt the criterion reduction scheme for tree construction.

• CT is the original classiﬁcation tree (Breiman, 1984) with the same grid as LPCT.

We deﬁne two approaches, CT-W and CT-Q. For CT-W, we ﬁt CT on the whole

data, namely P and Q data together, with no privacy protection. Its accuracy will

serve as an upper bound. For CT-Q, we ﬁt CT on Q data only. Its performance is

taken into comparison. We use implementation by sklearn (Pedregosa et al., 2011)

with default parameters while varying max depth in {1, 2, ··· , 16}.

• PHIST is the locally diﬀerentially private histogram estimator proposed by Berrett

and Butucea (2019). We ﬁt PHIST on private data solely as it does not leverage

public data. We choose the number of splits h

−1

along each axis in {1, 2, 3, 4, 5, 6}.

• LPDT is the locally private decision tree originally proposed for regression (Ma et al.,

2023). The approach creates a partition on Q data and utilizes P data for estimation.

We adapt the strategy to classiﬁcation. We select p in the same range as LPCT.

Ma and Yang

• PGLM is the locally private generalized linear model proposed by Wang et al. (2023a)

with logistic link function. Note that their privacy result is under (ε, δ)-LDP. We let

δ = 1/n

for a fair comparison.

• PSGD is the locally private stochastic gradient descent developed in Lowy et al.

(2023). The presented algorithm is in a sequentially interactive manner, meaning

that the privatized information from X

depends on the information from X

i−1

. We

adopt the warm start paradigm where we ﬁrst ﬁnd a minimizer on the public data

and then initialize the training process with the minimizer. Following Lowy et al.

(2023), we choose the parameters of the privacy mechanism, i.e. PrivUnitG, by the

procedure they provided. We let C = 1 and learning rate η ∈ {10

−5

, 10

−4

, ··· , 1}. We

consider two models: PSGD-L is the linear classiﬁer using the logistic loss; PSGD-N

is the non-linear neural network classiﬁer with one hidden layer and ReLU activation

function. The number of neurons is selected in d ∈ {d/2, d, 2d}.

7.3 Experiment Results

We ﬁrst illustrate the substantial utility gain brought by public data. We choose Diabetes

and Employee, representing the case of P = Q and P 6= Q, respectively. On each data,

we randomly select a portion of private data and 80 public samples to train three models:

LPCT, LPDT, and PHIST. PHIST does not leverage any information of public data, while

LPDT beneﬁts from public data only through the partition. LPCT utilizes public data

in both partition and estimation. In Figure 10, the accuracy increases with respect to

the extent of public data utilization for both identically and non-identically distributed

public data. Moreover, the accuracy gain is signiﬁcant. For instance, in Figure 10(b), 80

public samples and 0.2 × 3319 ≈ 664 private samples achieve the same accuracy as using

0.6 × 3319 ≈ 1991 private samples alone. The observation yields potential practical clues.

For instance, when data-collecting resources are limited, organizations can put up rewards

to get a small amount of public data rather than collecting a large amount of private data.

Next, we report the averaged accuracy for ε ∈ {0.5, 2, 8} in Table 2 on each real data

set. Here, CT-W is not private and serves as a reference. To ensure signiﬁcance, we adopt

the Wilcoxon signed-rank test (Wilcoxon, 1992) with a signiﬁcance level of 0.05 to check if

(a) P = Q

(b) P 6= Q

Figure 10: Illustration of utility gain brought by public data.

Optimal Locally Private Nonparametric Cla ssification with Public Data

the result is signiﬁcantly better. For each data set, trials that exceed a memory of 100GB

or a running time of one hour are terminated.

From Table 2, we can see that LPCT-based methods generally outperform their competi-

tors, both in the sense of average performance (rank sum) and best performance (number of

No.1). Compared to LPCT, LPCT-prune performs reasonably worse due to its data-driven

nature. In practice, however, all other methods require parameter tuning which limits the

privacy budget for the ﬁnal estimation. Thus, their real performance is expected to drop

down while that of LPCT-prune remains. Methods with the CART rule substantially out-

perform those with the max-edge rule, as the CART rule is robust to useless features or

categorical features and can fully employ data information. We note that on some data

sets, the results of LPCT remain the same for all ε. This is the case where public data

takes dominance and the estimator is barely aﬀected by private data. Moreover, we observe

that linear classiﬁers such as PSDG-L and PGLM perform poorly on some data sets. This

is attributed to the non-linear underlying nature of these data sets.

We compare the eﬃciency of the methods by reporting the total running time in Table

3. PSGD-L and PSGD-N are generally slower and even take more than one hour for a

single round on taxi data. Compared to these sequentially-interactive LDP methods, non-

interactive methods achieve acceptable running time. We argue that the performance gap

between LPCT and CT is caused by cython acceleration in sklearn. Moreover, PHIST

causes memory leakage on three data sets. Though histogram-based methods and tree-

based methods possess the same storage complexity theoretically (Ma et al., 2023), LPCT

avoids memory leakage when d is large. The results show that LPCT is an eﬃcient method.

8. Conclusion and Discussions

In this work, we address the under-explored problem of public data assisted LDP learning

by investigating non-parametric classiﬁcation under the posterior drift assumption. We

propose the locally private classiﬁcation tree that achieves the mini-max optimal conver-

gence rate. Moreover, we introduce a data-driven pruning procedure that avoids parameter

tuning and produces an estimator with a fast convergence rate. Extensive experiments

on synthetic and real data sets validate the superior performance of these methods. Both

theoretical and experimental results suggest that public data is signiﬁcantly more eﬀective

compared to data under privacy protection. This ﬁnding leads to practical suggestions.

For instance, prioritizing non-private data collection should be considered. When data-

collecting resources are limited, organizations can oﬀer rewards to encourage users to share

non-private data instead of collecting a large amount of privatized data.

We summarize the ways that public data contributes to LPCT. The beneﬁts lie in four-

fold: (1) Elimination of range parameter. Since the domain is unknown in the absence of the

raw data, the convergence rate and performance of partition-based estimators are degraded

by a range parameter. For LPCT, public data helps to eliminate the range parameter both

theoretically and empirically, as demonstrated in Section 4.3 and 6.2.4, respectively. (2)

Implicit information in partition. Through the partition created on public data, LPCT

implicitly feeds information from public data to the estimation. The performance gain

is explained empirically in Section 6.2.4. (3) Augmented estimation. In the weighted

average estimation, the public data reduces the instability caused by perturbations and

Ma and Yang

Table 2: Accuracy (10

−1

) over real data sets. The best results are marked with *. The results that are not

signiﬁcantly worse than the best result are bolded. Due to memory limitation, PHIST is corrupted

on three data sets. PSGD-L and PSGD-N exceed the running time of 3,600 seconds on one data

set and are terminated. These results are marked with -.

(a) ε = 0.5

Datasets CT-W

LPCT LPCT-prune

CT-Q PHIST LPDT PGLM PSGD-L PSGD-N

Maxedge CART Maxedge CART

Anonymity 8.346 8.346* 8.346* 8.309 8.309 8.346* 8.346* 7.546 5.899 7.790 8.123

Diabetes 9.964 7.067 7.590* 6.661 7.529 7.301 6.676 6.305 5.499 6.582 6.448

Email 9.261 7.332 7.487* 7.205 7.185 7.283 - 7.112 5.136 6.435 6.292

Employ 9.048 7.403 8.646* 6.684 6.362 8.577 5.704 5.644 4.874 5.742 5.595

Rice 8.992 9.039 9.183* 9.053 9.173 9.117 5.467 6.583 4.886 8.517 8.142

Census 8.685 8.539 8.743* 8.471 8.528 8.285 - 7.195 5.104 8.310 8.405

Election 6.408 5.610 5.846 5.259 5.773 5.958* 5.060 5.089 4.948 5.207 5.074

Employee 7.648 7.397* 7.131 7.381 7.040 7.151 6.660 6.098 5.280 6.493 6.387

Jobs 7.863 5.628 7.814* 5.561 7.793 7.599 5.327 5.441 4.942 5.530 5.732

Landcover 9.404 8.615 8.960* 8.299 8.719 6.670 8.362 8.366 4.955 8.368 8.349

Taxi 9.601 5.647 9.589* 5.637 9.589* 9.490 - 5.438 4.912 - -

Rank Sum 36 18* 54 36 38 83 84 106 70 80

No. 1 2 10* 2 5 5 1 0 0 0 0

(b) ε = 2

Datasets CT-W

LPCT LPCT-prune

CT-Q PHIST LPDT PGLM PSGD-L PSGD-N

Maxedge CART Maxedge CART

Anonymity 8.346 8.346* 8.346* 8.086 8.168 8.346* 8.346* 8.346* 5.899 7.965 8.039

Diabetes 9.964 7.274 7.590* 6.661 7.412 7.301 6.676 6.676 5.499 6.653 6.674

Email 9.261 7.350 7.485* 7.224 7.459 7.283 - 7.181 5.136 7.136 7.203

Employ 9.048 7.702 9.121* 7.005 6.578 8.577 6.174 7.181 4.874 5.786 5.792

Rice 8.992 9.027 9.183* 9.053 9.183* 9.117 5.467 8.458 4.886 9.050 8.750

Census 8.685 8.537 8.743* 8.460 8.633 8.285 - 8.017 5.104 8.432 8.471

Election 6.408 5.648 5.850 5.327 5.595 5.958* 5.056 5.114 4.948 5.323 5.111

Employee 7.648 7.437* 7.175 7.362 7.175 7.151 6.472 6.951 5.280 6.400 6.383

Jobs 7.863 5.628 7.821* 5.597 7.778 7.599 5.333 5.591 4.942 6.788 5.934

Landcover 9.404 8.715 8.960* 8.623 8.734 6.670 8.367 8.493 4.955 8.553 8.369

Taxi 9.601 5.659 9.590* 5.643 9.580 9.490 - 5.536 4.912 - -

Rank Sum 36 18* 54 36 38 83 84 106 70 80

No. 1 3 10* 1 5 4 1 1 0 1 0

Datasets CT-W

LPCT LPCT-prune

CT-Q PHIST LPDT PGLM PSGD-L PSGD-N

Maxedge CART Maxedge CART

Anonymity 8.346 8.346* 8.346* 8.221 8.309 8.346* 8.346* 8.346* 5.395 8.338 8.154

Diabetes 9.964 7.503 7.930* 6.648 7.433 7.301 6.676 7.496 5.588 7.244 7.101

Email 9.261 7.355 7.520 7.251 7.491 7.283 - 7.326 5.271 8.446 7.060

Employ 9.048 7.619 9.068* 7.393 7.492 8.577 9.043 7.572 5.237 5.795 5.798

Rice 8.992 9.011 9.183* 9.053 9.137 9.117 8.858 9.017 6.283 9.017 8.875

Census 8.685 8.557 8.743* 8.404 8.671 8.285 - 8.527 5.389 8.530 8.154

Election 6.408 5.894 6.082* 5.321 5.563 5.958 5.015 5.892 5.462 5.765 5.705

Employee 7.648 7.441* 7.173 7.384 7.105 7.151 7.437 7.439 4.930 6.775 6.883

Jobs 7.863 5.650 7.861* 5.584 7.794 7.599 5.992 5.641 6.802 7.712 7.414

Landcover 9.404 8.737 8.963 8.620 8.745 6.670 8.367 8.695 5.551 9.137* 8.948

Taxi 9.601 5.666 9.610* 5.637 9.579 9.490 - 5.620 5.087 - -

Rank Sum 42 18* 66 36 49 90 68 104 54 78

No. 1 2 8* 1 4 4 2 2 0 3 0

Optimal Locally Private Nonparametric Cla ssification with Public Data

Table 3: Running time (s) on real data sets.

Datasets CT-W

LPCT LPCT-prune

CT-Q PHIST LPDT PGLM PSGD-L PSGD-N

Maxedge CART Maxedge CART

Anonymity 3e-3 1e-2 6e-3 9e-3 1e-2 2e-3 1e-2 2e-3 2e-3 8e-1 1e0

Diabetes 1e-2 1e-1 3e-2 3e-2 3e-2 3e-3 3e-2 6e-3 3e-3 3e0 6e0

Email 3e0 1e0 3e-2 2e-1 1e0 6e-2 - 1e0 4e0 8e0 3e3

Employ 5e-3 9e-2 2e-1 1e0 9e-1 3e-3 3e-2 4e-2 4e-3 5e0 7e0

Rice 1e-2 2e-2 1e-2 1e-2 2e-2 2e-2 5e-3 6e-3 8e-3 2e0 2e0

Census 6e-1 1e0 5e0 6e-1 9e-1 3e-1 - 2e-1 5e-2 2e1 3e1

Election 1e-2 6e-2 3e-2 3e-2 3e-2 3e-3 1e0 3e-2 3e-3 3e0 5e0

Employee 6e-3 8e-2 1e-1 4e-2 1e-1 3e-3 5e-2 5e-2 4e-3 7e0 1e2

Jobs 1e-1 1e0 1e0 6e-1 6e-1 3e-2 8e-1 5e-1 3e-2 1e2 1e2

Landcover 2e-1 1e-1 8e-2 9e-2 1e-1 5e-3 2e-1 9e-2 5e-3 1e1 2e1

Taxi 1e2 2e2 7e1 4e1 4e1 1e0 - 1e2 1e0 - -

produces more accurate estimations. Theoretically, the convergence rate can be signiﬁcantly

improved. Empirically, even a small amount of public data improves the performance

signiﬁcantly (Section 6 and 7). (4) Data-driven pruning. As shown in Section 5.2, we are

capable of deriving the data-driven approach only with the existence of public data.

We discuss the potential future works and improvements. The assumption that P

and

are identical can be mitigated. Similar to Ma et al. (2023), we only need the density

ratio to be bounded. Moreover, the marginal densities are not necessarily lower bounded,

if we add a parameter to guarantee that suﬃcient samples are contained in each cell (Ma

et al., 2023). In practice, X

and X

can be distinctive (Yu et al., 2021b; Gu et al., 2023).

This problem is related to domain adaption (Redko et al., 2020) and can serve as future

work direction. Determining whether a data set satisﬁes Assumption 3 is of independent

interest, as demonstrated in Gu et al. (2023); Yu et al. (2023).

Acknowledgments

We would like to thank the reviewers and action editor for their help and advice, which led

to a signiﬁcant improvement of the article. The research is supported by the Special Funds

of the National Natural Science Foundation of China (Grant No. 72342010). Yuheng Ma

is supported by the Outstanding Innovative Talents Cultivation Funded Programs 2024 of

Renmin University of China. This research is also supported by Public Computing Cloud,

Renmin University of China.

Ma and Yang

Appendix A. Methodology

A.1 Estimator Summarization

Function Meaning Deﬁned

∗

, η

∗

true regression function of P and Q (1)

population tree estimation (3)

bη

, bη

non-private tree estimation of P and Q (4)

˜η

private tree estimation based on P (7)

˜η

P,Q

private tree estimation (8)

bη

P,Q

non-private two-sample tree estimation (9)

˜η

P,Q

ancestor private tree estimation (13)

˜η

prune

P,Q

pruned private tree estimation Algorithm 2

A.2 Derivation of the Closed-form Solution for Pruning Procedure

Remind that

−k+3

−2

) ∨



i=1



+ C

i=1



i=1

+ λ

i=1



For each ancestor(A

)

, k), deﬁne the quantity

ancestor(A

)

,k)=ancestor(A

)

,k)

i=1

ancestor(A

)

,k)=ancestor(A

)

,k)



i=1

+ λ

i=1



Also, let u

= 4



−k+3

−2



∨



i=1



. When the signs of ˜η

(x) −

and

bη

(x) −

are identical, the object in (16) becomes

|˜η

P,Q

(x) −





˜η

(x) −



+ (1 − w

) ·



bη

(x) −







i=1



(1−w

)

i=1

≤



i=1





˜η

(x) −



i=1



bη

(x) −



Optimal Locally Private Nonparametric Cla ssification with Public Data

where the inequality follows from Cauthy’s inequality. The maximum is attained when



i=1



i=1



bη

(x) −



(1 − w

)

i=1



i=1





˜η

(x) −



By bringing in w

, this leads to the solution

λ =

i=1

bη

(x) −

˜η

(x) −

When ˜η

(x) −

and bη

(x) −

have diﬀerent signs, the object in (16) is maximized at the

extremum, i.e. when λ = 0 or λ = ∞. The maximum value is

max



i=1





˜η

(x) −



i=1



bη

(x) −



Appendix B. Proofs

B.1 Proof of Proposition 2

Proof of Proposition 2 Since all mechanisms R

are completely the same, we use R

for notation simplicity. For each conditional distribution R



= x, Y

= y



, we can

compute the density ratio as

sup

x,x

∈R

, y,y

∈{0,1},S∈σ(S)



(

V ) ∈ S|X

= x, Y

= y





(

V ) ∈ S|X

= x

, Y

= y



≤ sup

x,x

∈R

dR(

U|X

= x)

dR(

U|X

= x

)

· sup

x,x

∈R

,y,y

∈{0,1}

dR(

V |X

= x, Y

= y)

dR(

V |X

= x

, Y

= y

)

. (17)

The ﬁrst part is characterized by the Laplace mechanism. Since the conditional density is

identical if x and x

belongs to a same A

(

p), we have

sup

x,x

∈R

dR(

U|X

= x)

dR(

U|X

= x

)

= sup

u,u

∈{0,1}

|I|

dR(

U|U

= u)

dR(

U|U

= u

)

The right-hand side can be computed as

dR(

U|U

= u)

dR(

U|U

= u

)

exp



|˜u

− u



exp



|˜u

− u



≤

exp



− u



Ma and Yang

By deﬁnition, u and u

are one-hot vectors and diﬀer at most on two entries. As a result,

sup

x,x

∈R

dR(

U|X

= x)

dR(

U|X

= x

)

≤ exp





(18)

For the second part, note that v and v

also diﬀer at most on two entries. Then with

analogous steps, we have

sup

x,x

∈R

,y,y

∈{0,1}

dR(

V |X

= x, Y

= y)

dR(

V |X

= x

, Y

= y

)

≤ exp





. (19)

Bringing (18) and (19) into (17) yields the desired conclusion.

B.2 Proof of Theorem 3

Proof of Theorem 3 The proof is analogous to Cai and Wei (2021, Theorem 3.2). We

ﬁrst construct two classes of distributions P

and Q

for σ ∈ {−1, 1}

. We then combine the

information inequality under privacy constraint (Duchi et al., 2018) and Assouad’s Lemma

(Tsybakov, 2009). We ﬁrst deﬁne two classes of distributions by construction. Deﬁne the

useful quantity

r = c



− 1)

+ n

2α+2d

2γα+d



−

2α+2d

, w = c

, m = bc

αβ−d

c (20)

where c

, c

, and c

are some universal constants. Let

G =



(6k

r, 6k

r, . . . , 6k

r) , k

= 1, 2, . . . ,



(6r)

−1



, i = 1, 2, . . . , d



be a grid of b(6r)

−1

points in the unit cube. Denote x

, x

, . . . , x

b(6r)

−1

as points in G.

Since m  b(6r)

−1

, we are interested in B (x

, 2r) , k = 1, 2, . . . , m which are balls with

radius 2r centered at x

. Let B

= [0, 1]

k=1

B (x

, 2r) denote the points that aren’t in

any of the m balls. Deﬁne function g(·) on [0, ∞) :

g(z) =











1 0 ≤ z < 1

2 − z 1 ≤ z < 2

0 z ≥ 2

And deﬁne

(z) = C

(z/r), h

(z) = C

αγ

(z/r)

where C

, C

are some universal constants. Deﬁne the hypercube H of pairs (P

, Q

) by

H = {(P

, Q

) , σ = (σ

, σ

, . . . , σ

) ∈ {−1, 1}

}

Optimal Locally Private Nonparametric Cla ssification with Public Data

with P

and Q

deﬁned as follows. For conditional density, we let

P,σ

(x) =

(

(1 + σ

(kx − x

k)) if x ∈ B (x

, 2r) for some k = 1, 2, . . . , m

if x ∈ B

Q,σ

(x) =

(

(1 + σ

(kx − x

k)) if x ∈ B (x

, 2r) for some k = 1, 2, . . . , m

if x ∈ B

Let P

σ,X

, Q

σ,X

, σ ∈ {−1, 1}

all have the same marginal distribution on X, with density

(x) deﬁned as follows:

(x) =











w · r

−d

if x ∈ B (x

, r) for some k = 1, 2, . . . , m

(1 − mw) if x ∈ B

0 otherwise.

where c

is an normalizing constant that makes dP well-deﬁned. Similar to Cai and Wei

(2021), with a proper choice of all constants, H is a subset of F that satisﬁes 3, 1, and 2.

Then it suﬃces to show that

inf

sup

X,Y

(

f) − R

∗



− 1)

+ n

2α+2d

2γα+d



−

(β+1)α

2α+2d

Let T V (·, ·) and H(·, ·) denote the total variation distance and the Hellinger distance be-

tween measures, respectively. Let Ham(·, ·) denote the hamming distance between ordered

indices. Let σ and σ

be two indices in {−1, 1}

such that Ham(σ, σ

) = 1, i.e. they diﬀer

at only one element σ

and σ

. We compute the total variation between the induced P

and P

. By Scheﬀe’s theorem (Tsybakov, 2009), there holds

T V (P

, P

) =

X ×Y

|dP

(x, y) − dP

(x, y)|.

Since the marginal distributions for both measures are identical, we have

T V (P

, P

) =



P,σ

(x) − η

P,σ

(x)



σ,X

(x)

B(x

,r)

· w · r

−d

dx = C

According to Cai and Wei (2021), we have

, Q

) = C

2γ

2αγ

. (21)

Let conditional distribution R = {R

}

i=1

be any transformation on X

× Y

that is

sequentially-interactive ε-local diﬀerentially private as deﬁned in Deﬁnition 1. Denote

by the random variable mapped compositionally through R and P

. Therefore,

for Ham(σ, σ

) = 1, there holds

(RP

× Q

, RP

× Q

) ≤ H

(RP

, RP

) + H

× Q

). (22)

Ma and Yang

For the ﬁrst term in (22), combining Tsybakov (2009, Lemma 2.4) and Duchi et al. (2018,

Corollary 3) yields

(RP

, RP

) ≤ 4(e

− 1)

T V

, P

) = 4C

− 1)

2α

This together with (20) lead to

(RP

, RP

) ≤4C

− 1)

2α+2d

=4C

2α+2d

− 1)



(ε − 1)

+ n

2α+2d

2γα+d



−1

≤ 4C

2α+2d

(23)

For the second term in (22), (21) implies

× Q

) ≤ n

× Q

) = C

2γ

2αγ

Again, bringing in (20) leads to

× Q

) ≤ c

2γ

2αγ+d

2γ



− 1)

+ n

2α+2d

2γα+d



−

2γα+d

2α+2d

≤C

2γ

. (24)

Bringing (23), (24) into (22), we have

(RP

× Q

, RP

× Q

) ≤

as long as the constants are small enough such that max(4C

2α+2d

, C

2γ

) ≤ 1/8.

Let

bσ = arg min

σ∈{−1,1}

(

f) − R

∗

Apply Tsybakov (2009, Theorem 2.12), there holds

inf

bσ

max

X,Y ∼P

Ham(bσ, σ) ≥

(1 −

√

). (25)

Next, we reduce the mini-max risk into this quantity. Since

∗

bσ

) − R

∗

≤ R

(

f) − R

∗

+ R

bσ

(

f) − R

∗

≤ 2



(

f) − R

∗



for any estimator

f, we can reduce the excess risk as

X,Y

(

f) − R

∗

≥

X,Y



∗

bσ

) − R

∗



X,Y





(x) −



1 {f

∗

(x) 6= f

∗

bσ

(x)}dP

σ,X

(x)



X,Y





6=bσ

B(x

,r)

σ,X

(x)





X,Y

[Ham(σ, bσ)] .

Optimal Locally Private Nonparametric Cla ssification with Public Data

This together with (25) yield

inf

max

X,Y

(

f) − R

∗

≥

(1 −

√

& r

(β+1)α



− 1)

+ n

2α+2d

2γα+d



−

(β+1)α

2α+2d

B.3 Proof of Theorem 4

To prove Theorem 4, we rely on the following decomposition.

˜η

P,Q

(x) −

= ˜η

P,Q

(x) − bη

P,Q

(x)

| {z }

Privatized Error

+ bη

P,Q

(x) − η

P,Q

(x)

| {z }

Sample Error

+ η

P,Q

(x) −

| {z }

Signal Strength

(26)

where we deﬁne

bη

P,Q

(x) =

j∈I

1{x ∈ A

(p)

}

i=1

+ λ

i=1

+ λ

i=1

(27)

P,Q

(x) =

j∈I

1{x ∈ A

(p)

}

(p)

∗

)dP

) + λn

(p)

∗

)dQ

)

(p)

) + λn

(p)

)

. (28)

Correspondingly, we let

P,Q

(x) = 1



bη

P,Q

(x) >



and f

P,Q

(x) = 1



P,Q

(x) >



Loosely speaking, the ﬁrst error term quantiﬁes the degradation brought by adding

noises to the estimator, which we call the privatized error. The second term corresponds

to the expected estimation error brought by the randomness of the data, which we call the

sample error. The last term is called signal strength, which quantiﬁes the uncertainty of

the population regression function. In general, we want the absolute value of the third term

to be large, (i.e. the population regression function is sure about the label) while the ﬁrst

and the second terms to be small (i.e. the noise is small). In the proof of Theorem 4, we

specify a region Ω

containing points with strong signal strength. We show that the ﬁrst

two terms are dominated by the signal strength on Ω

. The following three lemmas provide

bounds for each of the three terms.

Lemma 8 (Bounding of Privatized Error) Let π be constructed through the max-edge

partition in Section 3.2 with depth p. Let ˜η

P,Q

be the locally private two-sample tree classiﬁer

in (8). Let bη

P,Q

be deﬁned in (27). Then for n

−

2α+2d

. ε . n

4α+2d

, under the condition of

Theorem 4, there holds

k˜η

P,Q

− bη

P,Q

∞

log(n

+ n

)

ε(n

+ λn

)

with probability P

⊗ Q

⊗ R at least 1 − 3/(n

+ n

)

Ma and Yang

Lemma 9 (Bounding of Sample Error) Let π be constructed through the max-edge par-

tition in Section 3.2 with depth p. Let bη

P,Q

and η

P,Q

be deﬁned in (27) and (28). Then for

−

2α+2d

. ε . n

4α+2d

, under the condition of Theorem 4, there holds

kbη

P,Q

− η

P,Q

∞

+ λ

) · log(n

+ n

)

+ λn

)

with probability P

⊗ Q

⊗ R at least 1 − 3/(n

+ n

)

Lemma 10 (Bounding of Signal Strength) Let π be constructed through the max-edge

partition in Section 3.2 with depth p. Deﬁne the region

∆



x ∈ X



|η

∗

(x) − 1/2| ≥ c

∆

· 2

−pα/d



(29)

for some constant c

∆

large enough. Let η

P,Q

be deﬁned in (28). Then under the condition

of Theorem 4, there holds



P,Q

(x) −





∗

(x) −



> 0, and



P,Q

(x) −



−pα/d

+ λn

−pγα/d

+ λn

for all x ∈ ∆

We also need the following technical lemma.

Lemma 11 Suppose n

−

2α+2d

. ε . n

4α+2d

. If we let

λ  δ

(γ−1)α

(γ+1)α

−1)∧0



log(n

+ n

)



log(n

+ n

)



2α+2d

2γα+d

−

(γα−α)∧(2γα−d)

2α+2d

then the following statements hold:

+ λ

)

log(n

+ n

)

& δ

−

2α+2d



+ λ



log(n

+ n

)

+ λn

)

. δ,

−2

log(n

+ n

)

+ λn

)

. δ.

All proofs of the above lemmas are provided in Section B.4.

Proof of Theorem 4 Remind the deﬁnition in (29). Theorem 4 takes 2

 δ

−1

and

λ  δ

(γ−1)α/d

. We prove the theorem by separately considering the region

Ω



x ∈ X



|η

∗

(x) − 1/2| ≥ c

Ω

· δ

α/d



for constant c

Ω

such that Ω

⊂ ∆

. In this case, suppose we have 2

−p

= C

δ. Then it

suﬃces to take c

Ω

= 2C

α/d

· c

∆

Optimal Locally Private Nonparametric Cla ssification with Public Data

For x ∈ Ω

, we show that

P,Q

(x) = f

∗

(x). Without loss of generality, let f

∗

(x) = 1.

By Lemma 8, 9, and 10 we have

P,Q

(x) −

≥ η

P,Q

(x) −

− kη

P,Q

− bη

P,Q

∞

− kbη

P,Q

− ˜η

P,Q

∞

−pα/d

+ λn

−pγα/d

+ λn

−

√

ε(n

+ λn

)

−

+ λ

) · log(n

+ n

)

+ λn

)

(30)

with probability 1 − 3/(n

+ n

)

with respect to P

⊗ Q

⊗ R. We ﬁrst compare the

signal strength with the privatized error. From the conditions in Theorem 4, we have the

fact that λ ≥ 2

−(γ−1)pα/d

−pα/d

+ λn

−pγα/d

+ λn



√

ε(n

+ λn

)



−1

≥ 2

−pα/d

· 2

−p

· ε ·

+ λ

log(n

+ n

)

By bringing in the value of δ and applying Lemma 11 (the ﬁrst conclusion) , there holds

−pα/d

· 2

−p

· ε ·

+ λ

log(n

+ n

)

α+d

· δ

−

α+d

· n

−

· ε

−1

log(n

+ n

)

α+d



log(n

+ n

)



· n

−

· ε

−1

log(n

+ n

)

α+d

Thus, for suﬃciently large C

, we have

−pα/d

+ λn

−pγα/d

+ λn

≥ 2

√

ε(n

+ λn

)

. (31)

Analogously, we compare the signal strength with the sample error, which is

−pα/d

+ λn

−pγα/d

+ λn



+ λ

) · log(n

+ n

)

+ λn

)



−

−pα/d

· δ

+ λ

· log

−

+ n

By bringing in the value of δ and applying Lemma 11 (the ﬁrst conclusion) , there holds

−pα/d

· δ

+ λ

· log

−

+ n

) & δ

−

· ε

−1

& 1.

Thus, for suﬃciently large n

and n

, we have

−pα/d

+ λn

−pγα/d

+ λn

≥ 2

+ λ

) · log(n

+ n

)

+ λn

)

. (32)

Combined with (31) and (32), (30) yields ˜η

P,Q

(x) −

≥ 0, which indicates that

P,Q

(x) =

∗

(x) for all x ∈ Ω

Ma and Yang

For Ω

, by Assumption 2, we have

Ω



∗

(x) −



P,Q

(x) 6= f

∗

(x)

(x) ≤ c

Ω

α/d

Ω

(x) ≤ c

1+β

Ω

(β+1)α/d

Together, there holds

(

P,Q

) − R

∗



∗

(x) −



P,Q

(x) 6= f

∗

(x)

Ω



∗

(x) −



(x) . δ

(β+1)α/d

B.4 Proofs of Lemmas in Section B.3

To prove lemmas in Section B.3, we ﬁrst present several technical results. Their proof will

be found in Section B.4

Lemma 12 Suppose ξ

, i = 1, ··· , n are independent sub-exponential random variables with

parameters (a, b) (Wainwright, 2019, Deﬁnition 2.9). Then there holds



i=1

− E

i=1



≥ t

≤ 2e

−

for any t > 0. Moreover, a standard Laplace random variable is sub-exponential with pa-

rameters (

√

2, 1). A union bound argument yields that



i=1

− E

i=1



≥ t, for j = 1, ··· , 2

≤ 2e

p log 2−

Lemma 13 Let π be a partition generated from Algorithm 1 with depth p. Suppose Assump-

tion 1 holds. Suppose D

= {(X

, Y

), ··· , (X

, Y

)} and D

= {(X

, Y

), ··· , (X

, Y

)}

are drawn i.i.d. from P and Q, respectively. Let 0 < λ . (n

+ n

)/ log(n

+ n

). Then

for all A

(p)

in π

, there holds



+ λn

i=1

1{X

∈ A

(p)

} + λ

i=1

1{X

∈ A

(p)

}

−

(p)

)



+ λ

) log(n

+ n

)

+ λn

)

with probability P

⊗ Q

at least 1 − 1/(n

+ n

)

Optimal Locally Private Nonparametric Cla ssification with Public Data

Lemma 14 Let π be a partition generated from Algorithm 1 with depth p. Suppose Assump-

tion 1 holds. Suppose D

= {(X

, Y

), ··· , (X

, Y

)} and D

= {(X

, Y

), ··· , (X

, Y

)}

are drawn i.i.d. from P and Q, respectively. Let 0 < λ . (n

+ n

)/ log(n

+ n

). Then

for all A

(p)

in π

, there holds

+ λn



i=1

1{X

∈ A

(p)

} + λ

i=1

1{X

∈ A

(p)

}

−

(p)

∗

)dP

) + λ n

(p)

∗

)dQ

)



+ λ

) log(n

+ n

)

+ λn

)

with probability P

⊗ Q

at least 1 − 1/(n

+ n

)

Lemma 15 Let π

(i)

, j ∈ {1, ··· , 2

}

be the partition generated by Algorithm 1

with depth i. Then for any A

(i)

, there holds

−1

√

d · 2

−i/d

≤ diam(A

(i)

) ≤ 2

√

d · 2

−i/d

Proof of Lemma 8 We intend to bound

k˜η

P,Q

− bη

P,Q

∞

= max

j∈I



i=1

+ λ

i=1

+ λ

i=1

−

i=1

+ λ

i=1

+ λ

i=1



For each j ∈ I and any x ∈ A

(p)

, the error can be decomposed as



i=1

+ λ

i=1

+ λ

i=1

−

i=1

+ λ

i=1

+ λ

i=1



i=1

+ λ

i=1

+ λ

i=1

−

i=1

+ λ

i=1

+ λ

i=1



≤



i=1

+ λ

i=1



i=1



i=1

+ λ

i=1



i=1

+ λ

i=1





| {z }

(I)



i=1

+ λ

i=1



| {z }

(II)

Ma and Yang

For the ﬁrst term in (I), we can apply Lemma 12 and 14 to get



i=1

+ λ

i=1



≤



i=1

+ λ

i=1



i=1



.(n

+ λn

)

+ λ

) log(n

+ n

)

+ λn

)

+ 2

−p

−2

log(n

+ n

)

+ λn

with probability at least 1 − 2/(n

+ n

)

, where we used the fact that

(p)

(x)  2

−p

(p)

∗

(x)dP

(x)  2

−p

(p)

∗

(x)dQ

(x)  2

−p

Using Lemma 11 (the second and the third conclusions), we know that the second term is

the dominating term, indicating that



i=1

+ λ

i=1



. (n

+ λn

) · 2

−p

Analogously, by Lemma 13, there holds



i=1

+ λ

i=1



&(n

+ λn

)

−p

−

+ λ

) log(n

+ n

)

+ λn

)

(n

+ λn

) · 2

−p

(33)

as well as



i=1

+ λ

i=1



≥



i=1

+ λ

i=1



−



i=1



&(n

+ λn

)

−p

−

−2

log(n

+ n

)

+ λn

(n

+ λn

) · 2

−p

The last step is by Lemma 11 (the second and the third conclusions), knowing that 2

−p

the dominating term. Combining these three pieces, there holds

(I) .

−2

log(n

+ n

) ·



+ λn

) · 2

−p



−1



log(n

+ n

)

ε(n

+ λn

)

. (34)

For (II), (33) and Lemma 12 together yields

(II) .

log(n

+ n

)

ε(n

+ λn

)

. (35)

Optimal Locally Private Nonparametric Cla ssification with Public Data

(34) and (35) together lead to the desired conclusion.

Proof of Lemma 9 Let η

∗

P,Q

(x) =



∗

(x) + λn

∗

(x)



/(n

+ λn

). We intend to

bound

kbη

P,Q

− η

P,Q

∞

= max

j∈I



i=1

+ λ

i=1

+ λ

i=1

−

+ λn

)

(p)

∗

P,Q

)dP

)

+ λn

)

(p)

)



For each j ∈ I and any x ∈ A

(p)

, the error can be decomposed as



i=1

+ λ

i=1

+ λ

i=1

−

+ λn

)

(p)

∗

P,Q

)dP

)

+ λn

)

(p)

)



≤



(p)

∗

P,Q

)dP

)



i=1

+ λ

i=1

− (n

+ λn

)

(p)

)



i=1

+ λ

i=1



(p)

)



| {z }

(I)



i=1

+ λ

i=1

− (n

+ λn

)

(p)

∗

P,Q

)dP

)

i=1

+ λ

i=1



{z }

(II)

We bound two terms separately. For the ﬁrst term in (I), Assumption 1 yield

(p)

∗

P,Q

)dP

) . 2

−p

For the second term, (33) and Lemma 13 yields



i=1

+ λ

i=1

− (n

+ λn

)

(p)

)



i=1

+ λ

i=1



(p)

)



.(n

+ λn

) ·

+ λ

) log(n

+ n

)

+ λn

)



+ λn

) · 2

−2p



−1

2

+ λ

) log(n

+ n

)

+ λn

)

with probability 1 − 1/(n

+ n

)

. Together, these two pieces yield

(I) .

+ λ

) log(n

+ n

)

+ λn

)

. (36)

Ma and Yang

Similarly for (II), we have

(II) .

+ λ

) log(n

+ n

)

+ λn

)

· 2



+ λ

) · log(n

+ n

)

+ λn

)

. (37)

(36) and (37) together yield the desired conclusion.

Proof of Lemma 10 Without loss of generality, consider f

∗

(x) = 1 for any x ∈ ∆

. By

Assumption 3, we have

P,Q

(x) ≥

(p)

∗

)dP

) + λn

(p)



+ C



∗

) −



)

(p)

) + λn

(p)

)

(38)

If Assumption 1 holds, for any x

∈ A

(p)

, we have

∗

) −

≥η

∗

(x) −

− |η

∗

(x) − η

∗

≥η

∗

(x) −

− c

kx − x

≥ η

∗

(x) −

− 2c

√

−pα/d

≥ (c

∆

− 2c

√

d)2

−pα/d

where the second last and the last inequalities come from Lemma 15 and the deﬁnition of

∆

, respectively. For suﬃciently large c

∆

, we have η

∗

) ≥ 1/2 + 2

−pα/d

for all x

∈ A

(p)

and thus (38) yields

P,Q

(x) ≥

(p)

−pα/d

) + λn

(p)

−pγα/d

)

(p)

) + λn

(p)

)

≥

· (1 ∨ C

)

−pα/d

+ λn

−pγα/d

+ λn

which yields the desired result. Note that the factor c/c comes from the lower bound of

ratio

(p)

Proof of Lemma 11 We prove the statement by discussing four cases depending on the

value of n

, n

, and γ, α, d.

Case (i):

log(n

)

≥



log(n

)



2α+2d

2γα+d

and

(γ+1)α

≥ 1. We prove the three conclu-

sions by turn. The ﬁrst conclusion follows from

+ λ

−(2α+2d)/d

≥n

· δ

2α+2d



log(n

+ n

)



−

2α+2d

log(n

+ n

)

Optimal Locally Private Nonparametric Cla ssification with Public Data

For the second conclusion, let n

= n

for some constant a. Note that we have

2γα+d

2α+2d

≥

a > 0. By replacing all terms with equivalent order of n

, it suﬃces to prove

+ n

a−

2(γ−1)α

2α+2d

· ε

−

2(γ−1)α

α+d

. n

4α+3d

2α+2d

· ε

−

α+d

+ n

1+a−

(γ−1)α+d

2α+2d

· ε

−

(γ−1)α+d

α+d

+ n

2a−

2(γ−1)α+d

2α+2d

· ε

−

2(γ−1)α+d

α+d

The left-hand side is dominated respectively by n

. n

4α+3d

2α+2d

· ε

−

α+d

and

a−

2(γ−1)α

2α+2d

· ε

−

2(γ−1)α

α+d

. n

a−

2(γ−1)α

2α+2d

γα

2α+2d

· ε

−

(γ−1)α+d

α+d

= n

1+a−

(γ−1)α+d

2α+2d

· ε

−

(γ−1)α+d

α+d

Here, the inequality is drawn by

· ε

−

d−α

α+d

≥ n

(1−

d−α

α+d

)

, n

γα

2α+2d

· ε

γα

α+d

& 1.

Note that we omit the log factor since the polynomial part of the left-hand side is strictly

dominated by the right-hand side, yielding that the conclusion holds with suﬃciently large

and n

. For the last statement, the same procedure yields that it suﬃces to show



2α+d

2α+2d

+ n

a−

(γ−1)α+d

2α+2d



· ε

α+d

which holds since n

2α+2d

· ε

α+d

& n

α+d

2α+2d

Case (ii):

log(n

)

≥



log(n

)



2α+2d

2γα+d

and

(γ+1)α

< 1. Compared to case (i), the

value is λ is multiplied by a factor δ

(

(γ+1)α

−1)∧0

> 1. Thus, the ﬁrst conclusion holds since

the value of the left-hand side increases. The third conclusion holds since the value of the

left-hand side decreases. For the second conclusion, by replacing all terms with equivalent

order of n

, it suﬃces to prove

+ n

a−

4γα−2d

2α+2d

· ε

−

4γα−2d

α+d

4α+3d

2α+2d

· ε

−

α+d

+ n

1+a−

2γα

2α+2d

· ε

−

2γα

α+d

+ n

2a−

4γα−d

2α+2d

· ε

−

4γα−d

α+d

For the left-hand side, the ﬁrst term is dominated as n

. n

4α+3d

2α+2d

·ε

−

α+d

. Since

−2γα+2d

2α+2d

< 1,

we have n



· ε



2γα−2d

2α+2d

& n

. Thus, the second term of left-hand side is controlled by

a−

4γα−2d

2α+2d

· ε

−

4γα−2d

α+d

. n

1+a−

2γα

2α+2d

· ε

−

2γα

α+d

Case (iii):

log(n

)



log(n

)



2α+2d

2γα+d

and

(γ+1)α

≥ 1. We now have a >

2γα+d

2α+2d

There holds

+ λ

−(2α+2d)/d

≥n

· δ

4γα−2d+2α+2d

n



log(n

+ n

)



2α+2d

2γα+d

−

2γα

2α+2d



log(n

+ n

)



2γα+d

· ε

log(n

+ n

)

log(n

+ n

)

Ma and Yang

for suﬃciently large n

+ n

, which implies the ﬁrst conclusion. For the second conclusion,

we want

+ n

2α+d

2γα+d

. n

2γα

2γα+d

+ n

1+a

(γ+1)α+d

2γα+d

+ n

2(γ+1)α+d

2γα+d

The left-hand side is dominated by the term n

2α+d

2γα+d

. Thus, the conclusion holds since there

holds a

2α+d

2γα+d

< a

2(γ+1)α+d

2γα+d

for all a, γ, α, and d. For the last statement, the same procedure

yields that it suﬃces to show



1−a

2γα+d

+ n

a−a

2γα

2γα+d



· ε.

Thus, the conclusion holds since the following conditions hold:

≤ 1 −

2α + 2d

< 1 − a

2γα + d

, and n

2α+2d

· ε & 1.

Case (iv):

log(n

)



log(n

)



2α+2d

2γα+d

and

(γ+1)α

< 1. Compared to case (iii), the

value is λ is multiplied by a factor δ

(

(γ+1)α

−1)∧0

> 1. Thus, the ﬁrst conclusion holds since

the value of the left-hand side increases. The third conclusion holds since the value of the

left-hand side decreases. For the second conclusion, by replacing all terms with equivalent

order of n

, it suﬃces to prove

+ n

−2γα+3d

2γα+d

. n

2γα

2γα+d

+ n

1+a−a

2γα

2γα+d

+ n

2a−a

4γα−d

2γα+d

We have n

. n

2γα+d

= n

1+a−a

2γα

2γα+d

and n

−2γα+3d

2γα+d

. n

3ad

2γα+d

= n

2a−a

4γα−d

2γα+d

B.5 Proofs of Results in Section B.4

Proof of Lemma 12 The conclusion follows from (2.18) in Wainwright (2019).

In the subsequent proof, we deﬁne the empirical measure D :=

i=1

)

given

samples D = {(X

, Y

), ··· , (X

, Y

)}, where δ is the Dirac function. To conduct our

analysis, we need to introduce the following fundamental descriptions of covering number

which enables an approximation of an inﬁnite set by ﬁnite subsets.

Deﬁnition 16 (Covering Number) Let (X, d) be a metric space and A ⊂ X. For ε > 0,

the ε-covering number of A is denoted as

N(A, d, ε) := min



n ≥ 1 : ∃x

, . . . , x

∈ X such that A ⊂

[

i=1

B(x

, ε)



where B(x, ε) := {x

∈ X : d(x, x

) ≤ ε}.

Optimal Locally Private Nonparametric Cla ssification with Public Data

To prove Lemma 17, we need the following fundamental lemma concerning the VC

dimension of random partitions in Section 3.2. To this end, let p ∈ N be ﬁxed and π

a partition of X with the number of splits p and π

(p)

denote the collection of all partitions

Lemma 17 Let

A be the collection of all cells ×

i=1

, b

] in R

. For all 0 < ε < 1, there

exists a universal constant C such that

N(1

, k · k

(Q)

, ε) ≤ C(2d + 1)(4e)

2d+1

(1/ε)

Proof of Lemma 17 The result follows directly from Theorem 9.2 in Kosorok (2008).

Before we proceed, we list the well-known Bernstein’s inequality that will be used fre-

quently in the proofs.

Lemma 18 (Bernstein’s inequality) Let B > 0 and σ > 0 be real numbers, and n ≥ 1 be

an integer. Furthermore, let ξ

, . . . , ξ

be independent random variables satisfying E

= 0,

kξ

∞

≤ B, and E

i=1

≤ σ

for all i = 1, . . . , n. Then for all τ > 0, we have



i=1

≥

2σ

2Bτ



≤ e

−τ

Proof of Lemma 13 Since P and Q share the same marginal distribution, it is equivalent

to prove



+ n

i=1

1{X

∈ A

(p)

} + λ

i=1

1{X

∈ A

(p)

}

−

+ λn

+ n

(p)

)



+ λ

) log(n

+ n

)

+ n

)

holds with probability P

at least 1 − 1/(n

+ n

)

for X

∼ P

. Let

A be the

collection of all cells ×

i=1

, b

] in R

. Applying Lemma 17 with (D

+ P

)/2, there exists

an ε-net {

}

k=1

⊂

A with

K ≤ C(2d + 1)(4e)

2d+1

(1/ε)

(39)

such that for any j ∈ I

, there exist some k ∈ {1, . . . , K} such that

k1{x ∈ A

(p)

} − 1{x ∈

((D

)/2)

≤ ε,

Since

k1{x ∈ A

(p)

} − 1{x ∈

((D

)/2)

=1/2 · k1{x ∈ A

(p)

} − 1{x ∈

)

+ 1/2 · k1{x ∈ A

(p)

} − 1{x ∈

)

Ma and Yang

we get

k1{x ∈ A

(p)

} − 1{x ∈

)

≤ 2ε, k1{x ∈ A

(p)

} − 1{x ∈

)

≤ 2ε. (40)

Consequently, by the deﬁnition of the covering number and the triangle inequality, for any

j ∈ I

, there holds



+ n

i=1

1{x ∈ A

(p)

}(X

) + λ

i=1

1{x ∈ A

(p)

}(X

)

−

(p)

)



≤



+ n

i=1

1{x ∈

}(X

) + λ

i=1

1{x ∈

}(X

)

−

)



+ k1{x ∈ A

(p)

} − 1{x ∈

)

+ k1{x ∈ A

(p)

} − 1{x ∈

)

≤



+ n

i=1

1{x ∈

}(X

) + λ

i=1

1{x ∈

}(X

)

−

)



+ 4ε.

Therefore, we get

sup

j∈I



+ n

i=1

1{x ∈ A

(p)

}(X

) + λ

i=1

1{x ∈ A

(p)

}(X

)

−

(p)

)



(41)

≤ sup

1≤k≤K



+ n

i=1

1{x ∈

}(X

) + λ

i=1

1{x ∈

}(X

)

−

)



+ 4ε.

(42)

For any ﬁxed 1 ≤ k ≤ K, let the random variable ξ

be deﬁned by ξ

:= 1{X

∈

} −

) for i = 1, ··· , n

and ξ

:= λ(1{X

∈

} −

)) for i = n

1, ··· , n

+ n

. Then we have E

= 0, kξk

∞

≤ 1 ∨ λ, and E

i=1

≤ (n

)

). According to Assumption 1, there holds E

i=1

≤ (n

+λ

)·

c · 2

−p

. Applying Bernstein’s inequality in Lemma 18, we obtain



+ n

i=1

1{X

∈

} −

+ λn

+ n

)



≤

c 2

1−p

τ(n

+ λ

)

+ n

)

2τ · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

with probability P

at least 1−2e

−τ

. Then the union bound together with the covering

number estimate (39) implies that

sup

1≤k≤K



+ n

i=1

1{X

∈

} −

+ λn

+ n

)



≤

c · 2

1−p

+ λ

)(τ + log(2K))

+ n

)

2(τ + log(2K)) · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

Optimal Locally Private Nonparametric Cla ssification with Public Data

with probability P

at least 1 − e

−τ

. Let τ = 2 log(n

+ n

) and ε = 1/(n

+ n

Then for any n

+ n

> N

:= (2C) ∧ (2d + 1) ∧ (4e), we have τ + log(2K) = 2 log n +

log(2C) + log(2d + 1) + (2d + 1) log(4e) + 2d log n ≤ (4d + 5) log(n

+ n

). Therefore, we

have

sup

1≤k≤K



+ n

i=1

1{X

∈

} −

+ λn

+ n

)



≤

c · 2

1−p

+ λ

)(4d + 5) log(n

+ n

)

+ n

)

2(4d + 5) · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

with probability P

at least 1 − 1/(n

+ n

)

. This together with (41) yields that

sup

j∈I



+ n

i=1

1{x ∈ A

(p)

} −

+ λn

+ n

(p)

)



≤

c · 2

1−p

(4d + 5)(n

+ λ

) log(n

+ n

)

+ n

)

2(4d + 5) · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

+ n

We now compare to see which term is dominating. Apply Lemma 11 (the ﬁrst conclusion),

we have

−p

+ λ

) · log(n

+ n

)

(1 ∨ λ

) · log

+ n

)

−

2α+d

· (1 ∨ λ

)

When γ ≤

− 1 and γ ≤

2α

−

2α+d

· (1 ∨ λ

)

= ε

−2

· δ

−

2α+d

−

(γ−1)α

> ε

−2

· δ

−

2α+3d

≥ε

−2

· (n

· ε

)

2α+3d

4α+4d

=(n

· ε

−

4α+2d

2α+3d

)

2α+3d

4α+4d

& 1.

When γ ≤

− 1 and γ >

2α

−

2α+d

· (1 ∨ λ

)

= ε

−2

· δ

−

2α+d

≥ ε

−2

· (n

· ε

)

2α+d

2α+2d

= (n

· ε

−

2α+d

)

2α+d

2α+2d

& 1.

When γ >

−1 and γ ≥ 1, the derivation is the same as the last situation. When γ >

−1

and γ < 1,

−

2α+d

· (1 ∨ λ

)

= ε

−2

· δ

−

2α+d

−

2γα−d

≥ ε

−2

· (n

· ε

)

2α

= (n

· ε

−

4α−2d

2α

)

2α

& 1.

These derivations imply that the ﬁrst term is dominating the upper bound, which implies

the desired result.

Ma and Yang

Proof of Lemma 14 For notation simplicity, let (X

, Y

) ∼ P for i = 1, ··· , n

and

, Y

) ∼ Q for i = n

+ 1, ··· , n

+ n

. Let

= Y

for i = 1, ··· , n

and

= λ · Y

for i = n

+ 1, ··· , n

+ n

. Let η

∗

P,Q

(x) =



∗

(x) + λn

∗

(x)



/(n

+ λn

). Let

be the collection of all cells ×

i=1

, b

] in R

. Then there exists an ε-net {

}

k=1

⊂

with K bounded by (39) such that for any j ∈ I

, (40) holds for some k ∈ {1, . . . , K}.

Consequently, by the deﬁnition of the covering number and the triangle inequality, for any

j ∈ I

, there holds



i=1

1{X

∈ A

(p)

}

−

(p)

∗

P,Q

)dP

)



≤



i=1

1{X

∈

}

−

∗

P,Q

)dP

)



1{x

∈ A

(p)

} − 1{x

∈

}



∗

P,Q

)



) +

i=1



1{X

∈

} − 1{X

∈ A

(p)

}



≤



i=1

1{X

∈

}

−

∗

P,Q

)dP

)



+ max

1≤i≤n

·k1{x ∈ A

(p)

} − 1{x ∈

)

+ k1{x ∈ A

(p)

} − 1{x ∈

)

≤



i=1

1{X

∈

}

−

∗

P,Q

)dP

)



+ 4ε. (43)

where the last inequality follow from the condition Y = {0, 1}.

For any ﬁxed 1 ≤ k ≤ K, let the random variable

be deﬁned by

:= 1{X

∈

}

−

∗

) dP

) for i = 1, ··· , n

and

:= 1{X

∈

}

−λ

∗

) dP

)

for i = n

+ 1, ··· , n

+ n

. Then we have E

= 0, kξk

∞

≤ 1 ∨ λ, and E

i=1

≤

+ λ

)

). According to Assumption 1, there holds E

i=1

≤ c ·(n

) · 2

−p

. Applying Bernstein’s inequality in Lemma 18, we obtain



+ n

i=1

1{X

∈ A

(p)

}

−

+ λn

+ n

(p)

∗

P,Q

)dP

)



≤

c · 2

1−p

τ(n

+ λ

)

+ n

)

2τ · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

with probability P

at least 1 −2e

−τ

. Similar to the proof of Lemma 13, one can show

that for any n ≥ N

, there holds

sup

1≤k≤K



+ n

i=1

1{X

∈ A

(p)

}

−

+ λn

+ n

(p)

∗

P,Q

)dP

)



≤

c · 2

1−p

τ(n

+ λ

)

+ n

)

2τ · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

Optimal Locally Private Nonparametric Cla ssification with Public Data

with probability P

at least 1 − 1/(n

+ n

)

. This together with (43) yields that



+ n

i=1

1{X

∈ A

(p)

}

−

+ λn

+ n

(p)

∗

P,Q

)dP

)



≤

c · 2

1−p

(4d + 5)(n

+ λ

) log(n

+ n

)

+ n

)

2(4d + 5) · 1 ∨ λ · log(n

+ n

)

3(n

+ n

)

+ n

This implies the desired result by analogous derivations as in the last part of the proof of

Lemma 13.

Proof of Lemma 15 According to the max-edge partition rule, when the depth of the

tree i is a multiple of dimension d, each cell of the tree partition is a high-dimensional cube

with a side length 2

−i/d

. On the other hand, when the depth of the tree p is not a multiple

of dimension d, we consider the max-edge tree partition with depth bi/dc and di/de, whose

corresponding side length of the higher dimensional cube is 2

−bi/dc

and 2

−di/de

. Note that in

the splitting procedure of max-edge partition, the side length of each sub-rectangle decreases

monotonically with the increase of p, so the side length of a random tree partition cell is

between 2

−di/de

and 2

−bi/dc

. This implies that

√

d · 2

−di/de

≤ diam(A

(i)

) ≤

√

d · 2

−bi/dc

Since i/d − 1 ≤ bi/dc ≤ di/de ≤ i/d + 1, we immediately get 2

−1

√

d · 2

−i/d

≤ diam(A

(i)

) ≤

√

d · 2

−i/d

B.6 Proof of Theorem 5

Proof of Theorem 5 We let

π,M

P,Q

:= (

P,Q

· 1

[0,1]

) ◦ M and X ∼ P

. The excess risk

can be decomposed by



π,M

P,Q



− R

∗

[0,1]

L(x, y,

π,M

P,Q

(x))dP(x, y) −

[0,1]

L(x, y, f

∗

(x))dP(x, y)

[0,1]

L(x, y,

π,M

P,Q

(x))dP(x, y) −

[0,1]

L(x, y, f

∗

(x))dP(x, y)

. δ

α(β+1)

+ P(M(X) /∈ ×

j=1

[0, 1])

where δ is deﬁned in Theorem 4. Note that

P(M(X) /∈ ×

j=1

[0, 1]) = P(X /∈ ×

j=1

[

]]) ≤

j=1

P(X

/∈ [

]).

Denote the CDF of X

as F

. Since





(

) − F

)



2 log n



= P



(

) >

2 log n





1 −

2 log n



Ma and Yang

and similar result holds for b

, we have P(X

/∈ [

]) .

log n

with probability 1 −1/n

Thus, there holds P(X /∈ ×

j=1

[

])) .

log n

with probability 1 − d/n

B.7 Proof of Proposition 6

Proof of Proposition 6 We prove the conclusion for k = p

ﬁrst. When k = p

and

x ∈ A

(k)

, the term ˜η

P,Q

(x) − E

Y |X

bη

P,Q

(x)

can be written as

i=1

+ λ

i=1



(I)

z }| {

i=1



− E

]



(II)

z }| {

i=1



− E

]



i=1

−

i=1

] + λ

i=1

]

i=1

+ λ

i=1

| {z }

(III)



We bound the three parts separately. Since E

] ≤ U

, there holds



i=1

] + λ

i=1

]

i=1

+ λ

i=1



≤ 1.

Applying Lemma 12, (III) is bounded by

|(III)| ≤ 2

6ε

−2

log(n

+ n

) (44)

with probability R at least 1 −1/(n

+ n

)

for all j. The proofs for (I) and (II) are analo-

gous to the proof of Lemma 14 which utilizes Bernstein’s inequality with the VC dimension

of possible squares on R

. (I) is an independent sum of Bernoulli random variables with at

most

i=1

non-zero elements. Thus by Bernstein’s inequality (Lemma 18), we have

|(I)| ≤

i=1

· τ

with probability P

at least 1 −e

−τ

. The order of p

yields that p

≤ log(n

+ n

Since there are at most 2

cells in the union of all π

, there holds

|(I)| ≤

i=1

· (τ + (p

+ 1) log 2)

≤

i=1

(2 log(n

+ n

) + log ε) (45)

≤

i=1

log(n

+ n

)

Optimal Locally Private Nonparametric Cla ssification with Public Data

with probability P

at least 1−1/(n

)

. By Lemma 12, for all j, with probability

R at least 1 − 1/(n

+ n

)

, we have

i=1

−

i=1

| ≤ 2

3ε

−2

log(n

+ n

i=1

≥ 8ε

−2

, there holds

i=1

≥ 8ε

−2

− 2

3ε

−2

log(n

+ n

) ≥ 6ε

−2

as well as

i=1

≥

i=1

− 2

3ε

−2

log(n

+ n

) +

i=1

≥

i=1

−

i=1

log(n

+ n

) +

i=1

≥

i=1

for suﬃciently large n

. In this case, we have

|(I)|+ |(III)| ≤

i=1

log(n

+ n

) + 2

6ε

−2

log(n

+ n

)

≤2

i=1

log(n

+ n

) + 2

i=1

log(n

+ n

)

i=1

log(n

+ n

). (46)

i=1

< 8ε

−2

, there holds

i=1

≤

i=1

+ 2

3ε

−2

log(n

+ n

) ≤ 8ε

−2

for suﬃciently large n

. In this case, we get

|(I)|+ |(III)| ≤

i=1

log(n

+ n

) + 2

6ε

−2

log(n

+ n

)

≤2

8ε

−2

log(n

+ n

) + 2

8ε

−2

log(n

+ n

)

8ε

−2

log(n

+ n

). (47)

Ma and Yang

Combinging (46) and (47), we get

|(I)|+ |(III)| ≤ 4

(8ε

−2

) ∨

i=1

log(n

+ n

). (48)

Similarly to (45), we have

|(II)| ≤ λ

i=1

· (τ + (p

+ 1) log 2)

≤λ

i=1

(2 log(n

+ n

) + log ε) (49)

≤λ

i=1

log(n

+ n

)

with probability Q

at least 1 − 1/(n

+ n

)

. Combining (48) and (49), we get

|(I) + (II) + (III)| ≤





(8ε

−2

) ∨

i=1

+ λ

i=1





log(n

+ n

)

≤

(8ε

−2

) ∨

i=1

+ C

i=1

with probability P

⊗Q

⊗R at least 1 −3/(n

+ n

)

. Here, we deﬁne constant

for notation simplicity

= 8 log(n

+ n

We now prove the conclusion for k < p

with a slight modiﬁcation on the above statement.

For the term (II), summing among all nodes with the same ancestor yields the sum of

−k

· n

Laplace random variables, for which we have

|(I) + (II) + (III)| ≤

−k+3

−2

) ∨





i=1





+ C

i=1

The proof is completed by bringing the bound into the decomposition in the beginning.

B.8 Proof of Theorem 7

In the proof, depth p

, p, k are all integers. Without loss of generality, we only write p

, p,

k equal some order which will only cause diﬀerence in constants.

Proof of Theorem 7 In the following proof, we give several version of upper bounds of



˜η

prune

P,Q

(x) − E

Y |X

bη

prune

P,Q

(x)



. b for each ﬁxed x ∈ Ω

. Then, we consider the region Ω

which is

Ω



x ∈ X



∗

(x) −



≥ c

Ω

· b



Optimal Locally Private Nonparametric Cla ssification with Public Data

For some c

Ω

, by techniques analogous to the proof of Theorem 4, we can show that all x in

Ω is assigned to the same label as the Bayes classiﬁer. For each ﬁxed x, let A

denote the

partition cell in π

that contains x. We will discuss the stopping time of the pruning process

and the resulting point-wise error. We prove the conclusion under two cases respectively.

Case (i):



log(n

)





log(n

)



2α+2d

2γα+d

. In this case, δ

−1

. n

2γα+d

. n

≤

2+2d

log



+ n

2+2d



c. Without loss of generality, we let η

∗

(x) −

≥ 0. Following

the analysis in Section A.2, the ﬁnal statistics will satisfy

|˜η

P,Q

(x) −



i=1





˜η

(x) −



i=1



bη

(x) −



≥

i=1



bη

(x) −



By Proposition 6, there exists a constant C

such that



bη

(x) − E

Y |X

bη

(x)



≤ C

log(n

+ n

)

i=1

(50)

with probability P

⊗ Q

⊗ R at least 1 − 3/(n

+ n

)

. Consider the region

Ω



x ∈ X



∗

(x) −



≥ C

Ω

· δ

α/d



Let 2

−k

= δ. For each x ∈ ∆

, by Lemma 10, there exist C

, C

such that,

min

x∈A

Y |X

bη

(x) −

≥ C



Y |X

bη

(x) −



≥ C

Ω

− C

)δ

γα/d

. (51)

Also, by Lemma 13, there exists constant C

such that

i=1

≥ C

δ · n

(52)

with probability P

⊗ Q

at least 1 − 1/(n

+ n

)

. (51) and (52) together lead to

i=1



Y |X

bη

(x) −



≥C

1/2

Ω

− C

)

√

· δ

2γα+d

≥2

2γα+d

1/2

Ω

− C

)

log(n

+ n

). (53)

Ma and Yang

For suﬃciently large constant C

Ω

, (53) yields that the choice of 2

−k

= δ leads to

i=1



Y |X

bη

(x) −



≥ 3C

log(n

+ n

). (54)

Combining (50) and (54), there holds the following

i=1



bη

(x) −



≥

i=1



Y |X

bη

(x) −



−

i=1



bη

(x) − E

Y |X

bη

(x)



log(n

+ n

for the choice of 2

= δ

−1

. Thus, for x ∈ Ω and x ∈ A

)

, the terminating condition will

be triggered at least at 2

= δ

−1

, indicating that 2

≥ δ

−1

with probability P

⊗Q

⊗R

at least 1 − 4/(n

+ n

)

. Thus, in this case, we have

Y |X

bη

P,Q

(x)

−

= E

Y |X

bη

P,Q

(x)

− ˜η

P,Q

(x) + ˜η

P,Q

(x) −

≥ r

− r

≥ 0, (55)

which implies that the signs of E

Y |X

bη

P,Q

(x)

and ˜η

P,Q

(x) are identical. By steps analogous

to proof of Proposition 10, there holds E

Y |X

bη

P,Q

(x)

−

> 0 for x ∈ Ω

. Thus we show

that, for each x ∈ Ω

and x ∈ A

)

, the label is assigned correctly as the Bayes classiﬁer.

Therefore, by Assumption 2, we have

(

prune

P,Q

) − R

∗



∗

(x) −



P,Q

(x) 6= f

∗

(x)

Ω



∗

(x) −



(x) . δ

(β+1)α/d

Case (ii)



log(n

)





log(n

)



2α+2d

2γα+d

. We ﬁrst note that, with n

−

2α+2d

. ε .

4+2d

, there holds 2

−k+3

−2

≥

i=1

with high probability, i.e. Algorithm 2

will determine true in ﬁrst if block. Since ε . n

4+2d

, there holds 2

≥





2+2d

& ε

By Lemma 13, when λ = 0, we have

i=1

 2

−k

. Together, we have

−k+3

−2

i=1

. (56)

For the Laplace noise, there holds

−k

−2

−k

−2

log(n

+ n

) &

i=1

(57)

Optimal Locally Private Nonparametric Cla ssification with Public Data

due to Lemma 12. Adding (56) and (57) shows 2

−k+3

−2

≥

i=1

holds with

probability 1 − 2/(n

+ n

)

. Under this case, if k

> b

2+2d

log

)c for all j ∈ I

, we

end up with ˜η

prune

P,Q

· 1{A

)

}. We deﬁne

Ω



x ∈ X||η

∗

(x) −

| ≥ C

Ω

· δ

α(α+d)/d(1+d)



for some C

Ω

large enough. Assume η

∗

(x) > 1/2. If x

belongs to the same A

as x, then

∗

) −

≥ η

∗

(x) −

− C

−k

α/d

≥ η

∗

(x) −

− C

α(α+d)/d(1+d)

≥ 0

for some C

. Analogous to (55), all x ∈ Ω

are correctly classiﬁed and thus

(

prune

P,Q

) − R

∗



∗

(x) −



P,Q

(x) 6= f

∗

(x)

Ω



∗

(x) −



(x) . δ

(β+1)α(α+d)/d(1+d)

If there exists some j ∈ I

such that the termination is triggered, i.e. we will end up with

, the proof becomes analogous to that of Theorem 4. Speciﬁcally, the trade oﬀ (30)

will have a larger approximation error δ

α(α+d)/d(1+d)

and smaller sample error as well as

privacy error. Thus, the ﬁnal rate is dominated by the approximation error, which becomes

(β+1)α(α+d)/d(1+d)

. In Case (ii), the overall incidents happen with probability P

⊗Q

⊗R

at least 1 − 4/(n

+ n

)

Appendix C. Experiment Details

C.1 Real Datasets Pre-processing

We present additional information on the data sets including data source and pre-processing

details. The data sets are sourced from UCI Machine Learning Repository (Dua and

Graﬀ, 2017) and Kaggle.

Anonymity: The Anonymity data set (piAI, 2019) contains poll results about Internet

privacy from a large number of people, including answers about whether the interviewee

believes United States law provides reasonable privacy protection for Internet users or their

conservativeness about Internet privacy. We use other variables to predict whether the

interviewee has ever tried to mask his/her identity when using the Internet. We randomly

select 50 samples as the public data.

Diabetes: This data set from Pore (2023) contains a diverse range of health-related

attributes, meticulously collected to aid in the development of predictive models for identi-

fying individuals at risk of diabetes. We randomly select 200 samples as the public data.

Email: The data set is used for spam or not-spam classiﬁcation (Biswas, 2019). For each

row, the count of each word (column) in that email (row) is stored in the respective cells.

Ma and Yang

Thus, information regarding all emails is stored in a compact data frame. We randomly

select 200 samples as the public data.

Employ: The data set (Hamoutn, 2022) was collected from diﬀerent agencies in the

Philippines consisting of mock interview performances of students. The ﬁnal task is to

decide whether the student got employed. We randomly select 200 samples as the public

data.

Rice: The Rice data set (Cammeo and Osmancik, 2019) contains morphological features

obtained for two grains of rice to classify them. We randomly select 300 samples as the

public data.

Census: This data set is the well-known census data (Becker and Kohavi, 1996) and

is used to predict whether income exceeds 50,000 per year. We select the samples whose

native country is the U.S. to be public data and others to be private data.

Election: This data set from (Winner and Thacker, 2022) contains county-level demo-

graphics and whether or not Bill Clinton won each county during the 1992 U.S. presidential

election. The goal of this data set is to predict if Clinton won a county. We use samples

whose state is TX as public data and the rest as private data.

Employee: The Employee data (Elmetwally, 2023) contains information about employees

in a company, including their educational backgrounds, work history, and employment-

related factors. We use these variables to predict whether the employee leaves or not. We

select the sample to be public if the employee joined the company before 2012.

Jobs: The data (Tankha, 2023)contains a comprehensive collection of information re-

garding job applicants and their respective employability scores to predict whether the

applicant has been hired. We select the samples whose native country is the U.S. to be

public data and others to be private data.

Landcover: The data is from Johnson and Iizuka (2016), which contains Landsat time-

series satellite imagery information on given pixels and their corresponding land cover class

labels (farm, forest, water, etc.) obtained from multiple sources. We classify whether the

image comes from a farm or a forest. Within this data set, there are two kinds of label

sources: labels obtained from Open-StreetMap and labels that are accurately labeled by

experts. We take the latter as public data and the former as private data.

Taxi: The Taxi data set was obtained from the Diﬀerential Privacy Temporal Map

Challenge (DrivenData, 2021), which aims to develop algorithms that preserve data util-

ity while guaranteeing individual privacy protection. The data set contains quantitative

and categorical information about taxi trips in Chicago, including time, distance, location,

payment, and service provider. These features include the identiﬁcation number of taxis

(taxi id), time of each trip (seconds), the distance of each trip (miles), index of the zone

where the trip starts (pca), index of the zone where the trip ends (dca), service provider

(company), the method used to pay for the trip (payment type), amount of tips (tips) and

fares (fare). We use the other variable to predict whether the passenger is paying by credit

card or by cash. We select data from one company as public data and others as private

data.

Optimal Locally Private Nonparametric Cla ssification with Public Data

References

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal

Talwar, and Li Zhang. Deep learning with diﬀerential privacy. In Proceedings of the 2016

ACM SIGSAC conference on computer and communications security, 2016.

Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public

data. Advances in neural information processing systems, 32, 2019.

Ehsan Amid, Arun Ganesh, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas

Steinke, Vinith M Suriyakumar, Om Thakkar, and Abhradeep Thakurta. Public data-

assisted mirror descent for private model training. In International Conference on Ma-

chine Learning. PMLR, 2022.

Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classiﬁers.

Annals of statistics, 35(2):608–633, 2007.

Raef Bassily, Om Thakkar, and Abhradeep Guha Thakurta. Model-agnostic private learn-

ing. Advances in Neural Information Processing Systems, 31, 2018.

Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven

Wu. Private query release assisted by public data. In International Conference on Ma-

chine Learning, pages 695–703. PMLR, 2020.

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. URL

https://doi.org/10.24432/C5XW20.

Shai Ben-David, Alex Bie, Clement Louis Canonne, Gautam Kamath, and Vikrant Singhal.

Private distribution learning with public data: The view from sample compression. In

Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Thomas Berrett and Cristina Butucea. Classiﬁcation under local diﬀerential privacy. arXiv

preprint arXiv:1912.04629, 2019.

Thomas B Berrett, L´aszl´o Gy¨orﬁ, and Harro Walk. Strongly universally consistent nonpara-

metric regression and classiﬁcation with privatised data. Electronic Journal of Statistics,

15:2430–2453, 2021.

Tom Berrett and Yi Yu. Locally private online change point detection. Advances in Neural

Information Processing Systems, 34:3425–3437, 2021.

Alex Bie, Gautam Kamath, and Vikrant Singhal. Private estimation with public data. In

Advances in Neural Information Processing Systems, 2022.

Balaka Biswas. Email spam classiﬁcation dataset csv, 2019. URL https://www.kaggle.

com/datasets/balaka18/email-spam-classification-dataset-csv/data.

L Breiman. Classiﬁcation and regression trees. The Wadsworth & Brooks/Cole, 1984.

Ma and Yang

Cristina Butucea, Amandine Dubois, Martin Kroll, and Adrien Saumard. Local diﬀerential

privacy: Elbow eﬀect in optimal density estimation and adaptation over besov ellipsoids.

Bernoulli, 26(3):1727–1764, 2020.

T Tony Cai and Hongji Wei. Transfer learning for nonparametric classiﬁcation: Minimax

rate and adaptive classiﬁer. The Annals of Statistics, 49(1), 2021.

Yuchao Cai, Yuheng Ma, Yiwei Dong, and Hanfang Yang. Extrapolated random tree for

regression. In Proceedings of the 40th International Conference on Machine Learning,

pages 3442–3468. PMLR, 2023.

Cammeo and Osmancik. Rice. UCI Machine Learning Repository, 2019. URL https:

//doi.org/10.24432/C5MW4Z.

Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor

classiﬁcation. Advances in Neural Information Processing Systems, 27, 2014.

Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for dif-

ferentially private machine learning. Advances in Neural Information Processing Systems,

26, 2013.

Yuval Dagan and Vitaly Feldman. Interaction is necessary for distributed learning with

privacy or communication constraints. In Proceedings of the 52nd Annual ACM SIGACT

Symposium on Theory of Computing, pages 450–462, 2020.

Amit Daniely and Vitaly Feldman. Locally private learning without interaction requires

separation. Advances in neural information processing systems, 32, 2019.

DrivenData. Diﬀerential privacy temporal map challenge: Sprint 3 (pre-

screened arena), 2021. URL https://www.drivendata.org/competitions/77/

deid2-sprint-3-prescreened/page/332/.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http://

archive.ics.uci.edu/ml.

John Duchi, Michael Jordan, and Martin Wainwright. Minimax optimal procedures for

locally private estimation. Journal of the American Statistical Association, 113(521):

182–201, 2018.

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to

sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284.

Springer, 2006.

Tawﬁk Elmetwally. Employee dataset, 2023. URL https://www.kaggle.com/datasets/

tawfikelmetwally/employee-dataset/data.

Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable

privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference

on computer and communications security, pages 1054–1067, 2014.

Optimal Locally Private Nonparametric Cla ssification with Public Data

Arun Ganesh, Mahdi Haghifam, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar,

Abhradeep Guha Thakurta, and Lun Wang. Why is public pretraining necessary for

private model training? In International Conference on Machine Learning, pages 10611–

10627. PMLR, 2023.

Xin Gu, Gautam Kamath, and Zhiwei Steven Wu. Choosing public datasets for private

machine learning via gradient subspace distance. arXiv preprint arXiv:2303.01256, 2023.

L´aszl´o Gy¨orﬁ and Martin Kroll. On rate optimal private regression under local diﬀerential

privacy. arXiv preprint arXiv:2206.00114, 2022.

Anas Hamoutn. Students’ employability dataset - philippines, 2022. URL https://www.

kaggle.com/datasets/anashamoutni/students-employability-dataset/data.

Brian A Johnson and Kotaro Iizuka. Integrating openstreetmap crowdsourced data and

landsat time-series imagery for rapid land use/land cover (lulc) mapping: Case study of

the laguna de bay area of the philippines. Applied Geography, 67:140–149, 2016.

Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private

gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.

Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local dif-

ferential privacy. Advances in neural information processing systems, 27, 2014.

Peter Kairouz, Monica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. (nearly) di-

mension independent private erm with adagrad rates via publicly estimated subspaces.

In Conference on Learning Theory, pages 2717–2746. PMLR, 2021.

Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference.

Springer Series in Statistics. Springer, New York, 2008.

Martin Kroll. On density estimation at a ﬁxed point under local diﬀerential privacy. Elec-

tronic Journal of Statistics, 15:1783–1813, 2021.

OV Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds. optimally

adaptive estimates. Theory of Probability & Its Applications, 36(4):682–697, 1992.

Xuechen Li, Florian Tram`er, Percy Liang, and Tatsunori Hashimoto. Large language models

can be strong diﬀerentially private learners. In The Tenth International Conference on

Learning Representations, 2022.

Chong Liu, Yuqing Zhu, Kamalika Chaudhuri, and Yu-Xiang Wang. Revisiting model-

agnostic private learning: Faster rates and active learning. The Journal of Machine

Learning Research, 22(1):11936–11979, 2021.

Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings

of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019.

Andrew Lowy, Zeman Li, Tianjian Huang, and Meisam Razaviyayn. Optimal diﬀerentially

private learning with public data. arXiv preprint arXiv:2306.15056, 2023.

Ma and Yang

Yuheng Ma, Han Zhang, Yuchao Cai, and Hanfang Yang. Decision tree for locally pri-

vate estimation with public data. In Thirty-seventh Conference on Neural Information

Processing Systems, 2023.

Shubhankar Mohapatra, Sajin Sasy, Xi He, Gautam Kamath, and Om Thakkar. The role

of adaptive optimizers for honest private hyperparameter selection. In Proceedings of the

aaai conference on artiﬁcial intelligence, volume 36, pages 7806–7813, 2022.

Milad Nasr, Saeed Mahloujifar, Xinyu Tang, Prateek Mittal, and Amir Houmansadr. Ef-

fectively using public data in privacy preserving machine learning. In Proceedings of the

40th International Conference on Machine Learning, volume 202. PMLR, 2023.

Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi diﬀerential pri-

vacy. In International Conference on Learning Representations, 2021.

Nicolas Papernot, Mart´ın Abadi,

Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar.

Semi-supervised knowledge transfer for deep learning from private training data. In

International Conference on Learning Representations, 2017.

Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and

Ulfar Erlingsson. Scalable private learning with pate. In International Conference on

Learning Representations, 2018.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,

M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research, 12:2825–2830, 2011.

piAI. Internet privacy poll, 2019. URL https://www.kaggle.com/datasets/econdata/

internet-privacy-poll.

Nandita Pore. Healthcare diabetes dataset, 2023. URL https://www.kaggle.com/

datasets/nanditapore/healthcare-diabetes/data.

Swaroop Ramaswamy, Om Thakkar, Rajiv Mathews, Galen Andrew, H Brendan McMahan,

and Fran¸coise Beaufays. Training production language models without memorizing user

data. arXiv preprint arXiv:2009.10031, 2020.

Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Youn`es Bennani. A

survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv

preprint arXiv:2004.11829, 2020.

Richard J Samworth. Optimal weighted nearest neighbour classiﬁers. The Annals of Statis-

tics, pages 2733–2763, 2012.

Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for dis-

tributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages

58–77. IEEE, 2017.

Optimal Locally Private Nonparametric Cla ssification with Public Data

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with

diﬀerentially private updates. In 2013 IEEE global conference on signal and information

processing, pages 245–248. IEEE, 2013.

Jinyan Su, Jinhui Xu, and Di Wang. On pac learning halfspaces in non-interactive local

privacy model with public unlabeled data. In Asian Conference on Machine Learning,

pages 927–941. PMLR, 2023.

Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Pri-

vacy loss in apple’s implementation of diﬀerential privacy on macos 10.12. arXiv preprint

arXiv:1709.02753, 2017.

Ayush Tankha. Employability classiﬁcation of over 70,000 job appli-

cants, 2023. URL https://www.kaggle.com/datasets/ayushtankha/

70k-job-applicants-data-human-resource/data.

Florian Tram`er, Gautam Kamath, and Nicholas Carlini. Considerations for diﬀerentially

private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470,

2022.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series

in Statistics. Springer, New York, 2009. ISBN 978-0-387-79051-0. doi: 10.1007/

b13794. URL http://hfhhc391f4815d8064db7sqbf60x0xf60c6kpo.faxb.libproxy.

ruc.edu.cn/10.1007/b13794. Revised and extended from the 2004 French original,

Translated by Vladimir Zaiats.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.

Cambridge University Press, 2019.

Di Wang and Jinhui Xu. Principal component analysis in the local diﬀerential privacy

model. Theoretical computer science, 809:296–312, 2020.

Di Wang, Marco Gaboardi, and Jinhui Xu. Empirical risk minimization in non-interactive

local diﬀerential privacy revisited. Advances in Neural Information Processing Systems,

31, 2018.

Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, and Jinhui Xu. Generalized linear

models in non-interactive local diﬀerential privacy with public data. Journal of Machine

Learning Research, 24(132):1–57, 2023a.

Hua Wang, Sheng Gao, Huanyu Zhang, Weijie J Su, and Milan Shen. Dp-hypo: An adaptive

private framework for hyperparameter optimization. In Thirty-seventh Conference on

Neural Information Processing Systems, 2023b.

Jun Wang and Zhi-Hua Zhou. Diﬀerentially private learning with small public data. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 2020.

Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer

bias. Journal of the American Statistical Association, 60(309):63–69, 1965.

Ma and Yang

Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics,

pages 196–202. Springer, 1992.

Larry Winner and Jared Thacker. 1992 u.s presidential election, 2022. URL https://www.

kaggle.com/datasets/jwt0024/1992-us-presidential-election.

Xiaotong Wu, Lianyong Qi, Jiaquan Gao, Genlin Ji, and Xiaolong Xu. An ensemble of

random decision trees with local diﬀerential privacy in edge computing. Neurocomputing,

485:181–195, 2022.

Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Do not let privacy overbill utility:

Gradient embedding perturbation for private learning. In International Conference on

Learning Representations, 2021a.

Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Large scale private learning

via low-rank reparametrization. In International Conference on Machine Learning, pages

12208–12218. PMLR, 2021b.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath,

Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin,

and Huishuai Zhang. Diﬀerentially private ﬁne-tuning of language models. In The Tenth

International Conference on Learning Representations, 2022.

Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz

Religa, Jian Yin, and Huishuai Zhang. Selective pre-training for private ﬁne-tuning.

arXiv preprint arXiv:2305.13865, 2023.

Kai Zheng, Wenlong Mou, and Liwei Wang. Collect at once, use eﬀectively: Making non-

interactive locally private learning possible. In International Conference on Machine

Learning, pages 4130–4139. PMLR, 2017.

Yingxue Zhou, Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Pri-

vate sgd with gradient subspace identiﬁcation. In International Conference on Learning

Representations, 2021.