Optimal Locally Private Nonparametric Cla ssification with Public Data
Table 1: Information of real data sets.
Dataset P = Q n
P
n
Q
d Area
Anonymity X 453 50 14 Social
Diabetes X 2,054 200 8 Medical
Email X 3,977 200 3,000 Text
Employ X 2,225 200 8 Business
Rice X 240 3,510 7 Image
Census × 41,292 3,144 46 Social
Election × 1,752 222 9 Social
Employee × 3,319 504 9 Business
Jobs × 46,218 14,318 11 Social
Landcover × 7,097 131 28 Image
Taxi × 3,297,639 117,367 93 Social
Section 4.3, we min-max scale each data to [0, 1]
d
according to public data. Detailed data
information and pre-processing procedures are provided in Appendix C.1.
7.2 Competitors
We compare LPCT with several state-of-the-art competitors. For each model, we report
the best result over its parameter grids based on an average of 20 replications. The tested
methods are
• LPCT is described in Section 6. In real data experiments, we enlarge the grids
to p ∈ {1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16} and λ ∈ {0.1 , 0.5, 1, 2, 5, 10, 50, 100, 200,
300, 400, 500, 750, 1000, 1250, 1500, 2000} due to the large sample size. Since the max-
edge rule is easily affected by useless features or categorical features (Ma et al., 2023),
we boost the performance of LPCT by incorporating the criterion reduction scheme
from the original CART (Breiman, 1984) to the tree construction. Since Proposition
2 holds for any partition, the method is also ε-LDP. We choose the Gini index as the
reduction criterion.
• LPCT-prune is the pruned locally private classification tree in Algorithm 2. We also
adopt the criterion reduction scheme for tree construction.
• CT is the original classification tree (Breiman, 1984) with the same grid as LPCT.
We define two approaches, CT-W and CT-Q. For CT-W, we fit CT on the whole
data, namely P and Q data together, with no privacy protection. Its accuracy will
serve as an upper bound. For CT-Q, we fit CT on Q data only. Its performance is
taken into comparison. We use implementation by sklearn (Pedregosa et al., 2011)
with default parameters while varying max depth in {1, 2, ··· , 16}.
• PHIST is the locally differentially private histogram estimator proposed by Berrett
and Butucea (2019). We fit PHIST on private data solely as it does not leverage
public data. We choose the number of splits h
−1
along each axis in {1, 2, 3, 4, 5, 6}.
• LPDT is the locally private decision tree originally proposed for regression (Ma et al.,
2023). The approach creates a partition on Q data and utilizes P data for estimation.
We adapt the strategy to classification. We select p in the same range as LPCT.
25