Published as a conference paper at ICLR 2024
THE IMPORTANCE OF FEATURE PREPROCESSING FOR
DIFFERENTIALLY PRIVATE LINEAR OPTIMIZATION
Ziteng Sun, Ananda Theertha Suresh, Aditya Krishna Menon
Google Research, New York
{zitengsun,theertha,adityakmenon}@google.com
ABSTRACT
Training machine learning models with differential privacy (DP) has received in-
creasing interest in recent years. One of the most popular algorithms for training
differentially private models is differentially private stochastic gradient descent
(DPSGD) and its variants, where at each step gradients are clipped and combined
with some noise. Given the increasing usage of DPSGD, we ask the question: is
DPSGD alone sufficient to find a good minimizer for every dataset under privacy
constraints? Towards answering this question, we show that even for the simple
case of linear classification, unlike non-private optimization, (private) feature pre-
processing is vital for differentially private optimization. In detail, we first show
theoretically that there exists an example where without feature preprocessing,
DPSGD incurs an optimality gap proportional to the maximum Euclidean norm
of features over all samples. We then propose an algorithm called DPSGD-F,
which combines DPSGD with feature preprocessing and prove that for classifica-
tion tasks, it incurs an optimality gap proportional to the diameter of the features
max
x,x
0
D
kx x
0
k
2
. We finally demonstrate the practicality of our algorithm on
image classification benchmarks.
1 INTRODUCTION
Differential privacy (DP) (Dwork et al., 2014) has emerged as one of the standards of privacy in
machine learning and statistics. In machine learning, differentially private methods have been used
to train models for language modelling (McMahan et al., 2018), image classification (De et al.,
2022), generative diffusion (Ghalebikesabi et al., 2023), and private fine-tuning Yu et al. (2021); Li
et al. (2021). We refer readers to Ponomareva et al. (2023) for a detailed survey of techniques and
current state of the art methods in private optimization. Following the influential work of Abadi
et al. (2016), differentially private stochastic gradient descent (DPSGD) has emerged as one of the
most popular algorithms for training private machine learning models and achieves state of the art
results in several datasets (De et al., 2022; Ghalebikesabi et al., 2023).
There has also been a recent line of work that focuses on analyzing the theoretical performance of
DPSGD and its variants, with a particular focus on convex models (Bassily et al., 2019; Feldman
et al., 2020; Bassily et al., 2020; 2021b;a; Song et al., 2021b; Arora et al., 2022). It has been
shown that DPSGD and its variants can achieve min-max optimal rates for the task of DP empirical
risk minimization and DP stochastic convex optimization under various geometries. Moreover, for
convex generalized linear models, DPSGD has also been shown to achieve dimension-independent
convergence rate (Song et al., 2021b; Arora et al., 2022). This observation can be extended to general
convex models under certain assumptions (Li et al., 2022).
Despite the practical success and appealing theoretical properties of DPSGD, recent empirical results
have shown that they may not learn good intermediate features in deep learning image classification
tasks (Tramer & Boneh, 2021). In fact, it has been observed in Abadi et al. (2016) that performing
a private PCA on the features before performing DPSGD can improve the performance of private
training, which highlights that learned features may not be good. This raises a fundamental question:
Can private feature preprocessing provably improve DPSGD?
There is no clear intuitive answer to this question. On the one hand, feature preprocessing accelerates
the convergence of gradient methods in optimization (LeCun et al., 2002), which can help private
1
Published as a conference paper at ICLR 2024
optimization since the number of steps is constrained by the privacy requirement. On the other hand,
private feature preprocessing will use a portion of the privacy budget, thus decreasing the privacy
budget for the DPSGD phase.
As a first step towards answering this question, we show that for the simple task of linear private
classification, unlike non-private convex optimization, feature preprocessing is necessary for private
optimization to achieve near instance-optimal results. Our findings are as follows.
1. We provide an example where DPSGD with any clipping norm, batch size, and learning
rate incurs an error proportional to the maximum Euclidean
1
norm of feature vectors (Sec-
tion 4).
2. We propose DPSGD-F, a new algorithm that combines both DPSGD and feature prepro-
cessing, and show that the leading term of the error degrades proportional to the diameter
of the dataset, which can be significantly smaller than the maximum norm (Section 5).
We also complement our result with a near-matching information-theoretic lower bound
(Section 6).
3. We empirically validate our findings on a few standard datasets and show that DPSGD-F
outperforms DPSGD on a subset of the datasets. For the task of privately finetuning the
last layer of a pretrained model (using ImageNet1K) on CIFAR-100 under ε = 1, our result
improves the previous accuracy from 70.6% (De et al., 2022) to 71.6% (Section 7).
The rest of the paper is organized as follows. In Section 2, we outline the problem setting and in
Section 3 discuss prior work and our contribution. In Section 4 we provide a counter-example for
DPSGD and in Section 5, we describe our new algorithm and its performance. In Section 6, we
provide an information theoretic lower bound. Finally, in Section 7, we demonstrate the practicality
of the proposed algorithm on image clssification tasks.
2 PRELIMINARIES
Classification. Let X denote the feature space and Y = {−1, 1} be the set of labels. Unless
otherwise specified, we set X to be a ball of radius R in d-dimensional space denoted by B
d
2
(R).
Let h be a classifier that takes parameter θ and feature x and computes a score h(θ, x) R. The
performance of the classifier is measured by a loss function ` : R × Y R
+
. We assume ` is a
margin loss (Bartlett et al., 2006) of the form
`(h(θ, x), y) = φ(y · h(θ, x)), (1)
where φ : R R is typically a convex, non-increasing function with a bounded value φ(0) <
at 0. Canonical examples of φ include the hinge loss φ(z) = [1 z]
+
, and the logistic loss φ(z) =
log(1 + e
z
).
Let D = {(x
i
, y
i
)}
i[n]
be a training set of size n. Typically, the goal is to learn a classifier by
minimizing the average loss over all samples given by
L(θ, D) ,
1
n
X
i[n]
`(h(θ, x
i
), y
i
).
We are interested in linear classification models that are described as follows. Let θ = (w, b)
R
d+1
, where w R
d
and b R, and h((w, b), x
i
) = w ·x
i
+b. We further assume that `(h(θ, x), y)
is convex in θ, and that y, `(·, y) is convex and G-Lipschitz in the first argument. The model can be
viewed as a generalized linear model (GLM), with additional restrictions on the loss function. We
denote the minimizer of the empirical loss by
θ
D
arg min L(θ, D).
We will drop the subscript when the dataset is clear from context. The following assumption on the
minimizer will be useful in our performance analysis. It assumes that the minimizer is not a trivial
solution which does not separate any two data points in the dataset. This is a reasonable assumption
as long as ` is a good proxy loss for classification.
Assumption 1 (Nontrivial minimizer). The dataset D and loss function ` satisfies that at the mini-
mizer θ
, there exists x, x
0
D such that
h(θ
, x) · h(θ
, x
0
) 0.
1
When unspecified, we use norm to refer to the Euclidean norm by default.
2
Published as a conference paper at ICLR 2024
Algorithm 1 Differentially private SGD (Abadi et al., 2016)
Input: Input: Dataset D = {(x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
)} of n points, privacy parameter ε, δ,
clipping norm C, step size η, number of iterations T , batch size B max{n
p
ε/4T , 1},
1: Set σ
2
=
8T C
2
log(1)
n
2
ε
2
.
2: Choose an inital point θ
0
.
3: for t = 0, 1, . . . , T 1 do
4: Sample a batch B
t
of B data points with replacement.
5: For all i B
t
, compute the gradient `(h(θ, x
i
), y
i
).
6: Compute a noisy clipped mean of the gradients by
ˆg
t
=
1
|B
t
|
X
iB
t
Clip(`(h(θ
t
, x
i
), y
i
)), C) + N(0, σ
2
I
d
)
(2)
7: Update the parameter by w
t+1
= w
t
ηˆg
t
.
8: end for
9: Return
¯
T =
1
T
P
T
t=1
w
t
.
Differential privacy (DP) (Dwork et al., 2014). DP requires the optimization algorithm to output
similar outputs for similar training datasets. More precisely, differential privacy is defined below.
Definition 1 (Differential privacy). Let D be a collection of datasets. An algorithm A : D R is
(ε, δ)-DP if for any datasets D and D
0
that differ by only one data point, denoted as |DD
0
| = 1,
and any potential outcome O R, the algorithm satisfies
Pr (A(D) O) e
ε
Pr (A(D
0
) O) + δ.
Our goal is to find a θ
prv
that is (, δ)-differentially private and minimizes the optimality gap w.r.t.
the empirical risk, which is also referred to as the privacy error,
E[L(θ
prv
, D)] min
θ
L(θ, D),
where the expectation is over the randomness of the private algorithm.
Notations. We use Clip(x, C) to denote the clipping operation with norm C, defined by
Clip(x, C):= min
1,
C
kxk
2
· x.
For a vector x of dimension d, we use (x, 1) to refer to the (d +1)-dimensional vector obtained from
augmenting 1 to x. We use diam(D):= max
x,x
0
D
kx x
0
k
2
to denote the diameter of features
in the dataset D. Given a set of features x
1
, x
2
, . . . , x
n
from a dataset D, let U(D) denote the
eigenbasis of
P
i
x
i
x
>
i
. Let M(D) be the projection operator to this eigenspace U (D), given by
M(D) = U (D)(U(D))
>
. Then, M(D) defines a seminorm
2
given by
kvk
M(D)
= kvM(D)v
>
k
2
.
3 RELATED WORK AND OUR CONTRIBUTION
Differentially private stochastic gradient descent (DPSGD). We start by describing the mini-
batch variant of the DPSGD algorithm in Algorithm 1. The DPSGD algorithm is a modification
of the popular SGD algorithm for training learning models. In each round, the individual sample
gradients are clipped and noised to bound the per-round privacy loss (the influence of one data
point). The overall privacy guarantee combines tools from privacy amplification via subsampling
and strong composition (see Abadi et al. (2016)).
DPSGD has shown to achieve dimension-independent convergence result for optimizing generalized
linear models. The upper bound below is implied by Song et al. (2020, Theorem 4.1).
2
Given a vector space V over real numbers, a seminorm on V is a nonnegative-valued function Q such that
for all x R and v V , Q(x · v) = |x| · Q(v) and for all u, v V , Q(u + v) Q(u) + Q(v).
3
Published as a conference paper at ICLR 2024
Lemma 1 (Song et al. (2020)). There exists an (ε, δ)-DP instance of DPSGD, whose output satisfies,
E[L(θ
prv
, D)] L(θ
D
, D) 2Gkθ
k
M
R
p
rank(M) log(1)
!
,
where G is the Lipschitz constant of the loss function, R = max
i
kx
i
k
2
and M = M(D
0
) with
D
0
= {(x, 1) | x D}.
Note that in the analysis in Song et al. (2020), the norm of the gradients is bounded by G
R
2
+ 1,
and the clip norm is chosen such that the gradients are not clipped. The above result is
shown to be minimax optimal in terms of the dependence on the stated parameters. Recall that
diam(D):= max
x,x
0
D
kx x
0
k
2
denotes the diameter of feautures in the dataset D. So a natural
question is to ask is whether the dependence on R can be improved to diam(D), which can be useful
in cases when dataset if more favorable e.g., when diam(D) R. If yes, can the improvement be
achieved by DPSGD?
We answer the first question by proposing an algorithm that combines feature preprocessing and
DPSGD, and improves the dependence on R to diam(D) in the leading term of the optimality gap,
stated below.
Theorem 1. There exists an (ε, δ)-differentially private algorithm DPSGD-F, which is a combina-
tion of private feature preprocessing and DPSGD, such that when n =
d log(1)log R log(d)
ε
and ε = O(1), the output θ
prv
satisfies
E[L(θ
prv
, D)] L(θ
, D)
= O
Gkθ
k
M
diam(D) +
R
n
2
p
rank(M) log(1) + log n
!
+
φ(0) log(n)
!
,
where M = M (D
0
) with D
0
= {(x, 1) | x D}. As discussed before Lemma 3,
R
n
2
can be reduced
to any inverse polynomial function of n by increasing the requirement on n by a constant factor.
We will describe the algorithm and disucss the proof in Section 5. The next theorem states that the
first term in the above result is tight.
Theorem 2. Let A be any (ε, δ)-DP optimization algorithm with ε c and δ cε/n for some
constant c > 0. There exists a dataset D = {(x
i
, y
i
), i [n]} and a loss function `(θ · x, y) that is
convex and G-Lipschitz loss functions for all y, which is of form Eq. (1), and
E [L(A(D), D)] L(θ
D
, D) = Ω
G · diam(D) · min
(
1,
kθ
k
M
p
rank(M)
)!
,
where M = M(D
0
) with D
0
= {(x, 1) | x D}.
Moreover, we show that DPSGD must incur an error proportional to R by providing a counter-
example in Section 4.
4 A COUNTER-EXAMPLE FOR DPSGD
We consider the simple case of binary classification with hinge loss. Let µ > 0. Let e
1
and e
2
denote two orthogonal basis vectors. Suppose dataset D
0
is partitioned into two equal parts D
0
1
and
D
0
1
, each with size n/2. D
0
1
contains n/2 samples with x = µe
1
+ e
2
and y = 1 and D
0
1
contains
n/2 samples with x = µe
1
e
2
and y = 1. A visualization of the dataset is provided in Fig. 1.
Similarly, let the dataset D
00
is partitioned into two equal parts D
00
1
and D
00
1
, each with size n/2.
D
00
1
contains n/2 samples with x = µe
1
+ e
2
and y = 1 and D
00
1
contains n/2 samples with
x = µe
1
e
2
and y = 1. We assume there is no bias term for simplicity, i.e., b = 0
3
. For any
w = (w(1), w(2)), let
`(w · x, y) = max{1 y(w · x), 0}.
3
The bias term can be added by assuming there is an additional entry with a 1 in the feature vector. The
proof will go through similarly as the optimal parameter has b = 0.
4
Published as a conference paper at ICLR 2024
Figure 1: Feature vectors.
Figure 2: Gradient vectors.
Further, the average empirical loss on D
0
will be
L(w, D
0
) =
1
2
max{1 (µw(1) + w(2)), 0}+
1
2
max{1 + (µw(1) w(2)), 0}.
Observe that w
0
= (0, 1) has zero empirical loss for for dataset D
0
and w
00
= (0, 1) has zero
empirical loss for for dataset D
00
. The next theorem states that DPSGD with any clipping norm,
batch size, number of steps, learning rate and any initialization will not obtain nontrivial error on
both D
0
and D
00
when µ > .
Theorem 3. Let A be a (ε, δ)-private DPSGD algorithm with any number of steps T , learning rate
η, clipping norm C, and (possibly randomized) initialization w
0
. We have when n < µ/ε,
max
D∈{D
0
,D
00
}
E [L(A(D), D)] min
w
L(w, D)
= Ω(1).
The crux of the proof involves showing that if the initialization is “uninformative” for a dataset,
then DPSGD does not obtain nontrivial error when µ > . Here by “uninformative”, we mean
w
0
= (w
0
(1), w
0
(2)) satisfy that w
0
(2) 0 for dataset D
0
and w
0
(2) 0 for dataset D
00
. Note that
at least one of the above two conditions must hold. Below we give a high-level idea of the proof.
Since D
0
1
and D
0
1
(and similarly D
00
1
and D
00
1
) have the same x(1), the performance of the final
optimizer will depend mostly on the second parameter w(2). The following lemma can be proved.
Lemma 2. For dataset D
0
, let w be any iterate of the DPSGD algorithm, then
E [L(w, D
0
)] min
w
L(w, D
0
) Pr (w(2) 0).
Similarly, for dataset D
00
,
E [L(w, D
00
)] min
w
L(w, D
00
) Pr (w(2) 0).
We will focus on the case when w
0
(2) 0 and consider dataset D
0
and loss L(w, D
0
)
4
. It would
be enough to show that for every iteration t, the parameter w
t
has a constant probability of being
negative. The gradient of each individual loss functions satisfies that for (x, y) D
0
, we have
w
`(w · x, y) =
{µw(1) + w(2) < 1}· (µ, 1) if (x, y) D
0
1
,
{−µw(1) + w(2) < 1}· (µ, 1) if (x, y) D
0
1
.
Hence the norm of the gradient vectors can be as large as
p
µ
2
+ 1. By clipping the gradient
to norm C, the ‘signal’ component of the gradient on the second coordinate will be decreased to
min{1, C/
p
µ
2
+ 1}. However, in each iteration, DPSGD will add a Gaussian noise with standard
deviation proportional to C/nε, which can be large compared to min{1, C/
p
µ
2
+ 1} when µ is
large. When n < µ/ε, the total noise will dominate the signal, making the probability of w
t
(2) < 0
nontrivial. We provide the complete proof in Appendix A.
DPSGD with better mean estimation algorithm. One attempt to solve the above issue is to
replace Eq. (2) by recently proposed private mean estimation algorithms, which can improve the
estimation error to scale with the diameter of the gradients (e.g., in Karwa & Vadhan (2017); Huang
et al. (2021)). However, it is unclear whether this will work. As shown in Fig. 2, the diameter of the
gradients can be proportional to the maximum norm of feature vectors instead of the diameter.
4
When w
0
(2) 0, we will consider D
00
and loss L(w, D
00
) instead.
5
Published as a conference paper at ICLR 2024
Algorithm 2 DPSGD with feature preprocessing (DPSGD-F).
Input: Input: Dataset D = {(x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
)} of n points, ε, δ. Private mean es-
timation algorithm PrivMean (Lemma 3). Private quantile esimation algorithm PrivQuantile
(Lemma 4), DPSGD for GLM (Lemma 1).
1: Private mean estimation. Compute ˆµ, the differentailly private mean of features in D with
Lemma 3 and privacy budget (ε/3, δ/3).
2: Private quantile estimation. Let S = {kx ˆµk
2
, x D} be the set of distances from the
feature vectors to the computed mean. Find a private quantile τ of set S using Lemma 4 with
privacy budget (ε/3, δ/3).
3: Translate and augment the dataset with bias. Preprocess the dataset by translation and ap-
pending a constant feature τ. Let D
0
= {(x
0
1
, y
1
), . . . , (x
0
n
, y
n
)} where
x
0
i
= (x
i
ˆµ, τ).
4: Feature clipping. Let D
00
= {(x
00
1
, y
1
), (x
00
2
, y
2
), . . . , (x
00
n
, y
n
)}, where
x
00
i
= Clip
x
0
i
,
2τ
.
5: DPSGD with preprocessed features. Compute an approximate minimizer θ
00
prv
R
d+1
of
L(θ, D
00
) using the DPSGD algorithm for GLM in Lemma 1 with (ε/3, δ/3).
6: Return θ
prv
= (w
prv
, b
prv
), where w
prv
= θ
00
prv
[1 : d] and b
prv
= θ
00
prv
[d + 1]τ ˆµ.
In the next section, we take advantage of the fact that the data points are close in the fea-
ture space, and design an algorithm that first privately preprocesses the features and then per-
form DPSGD to achieve better bounds. In particular, the bound in Theorem 1 states that when
max{
p
log(µ) log(1)/ε,
3
p
µ/ε} n < µ/ε, the new algorithm achieves a o(1) error, which is
strictly better than DPSGD.
5 A FEATURE PREPROCESSING AUGMENTED DPSGD ALGORITHM
We propose a new algorithm called DPSGD-F (Algorithm 2) that combines feature preprocessing
and DPSGD. We show that the new algorithm can achieve the performance described in Theorem 1.
We overview each step of the algorithm below and provide theoretical guarantees for each step. The
detailed proofs are listed in Appendix B.
Private mean estimation of features. We first compute a differentially private mean of features
ˆµ using the instance-optimal mean estimation algorithm from Huang et al. (2021), who showed
that the mean can be estimated efficiently with error proportional to the diameter of the dataset. The
following result on private mean estimation follows by combining Theorem 3.3 and Remark 3.2 from
Huang et al. (2021). Note that the result of Huang et al. (2021) is a high-probability bound, which
does not explicitly depend on R (or r(D) in their notation). To obtain an in-expectation bound, we
choose a failure probability of 1/n
2
in the proof and this results in the R/n
2
term in Lemma 3. This
term can be reduced to any inverse polynomial function of n by increasing the requirement on n
in the assumption by a constant factor. We choose 1/n
2
here mainly for presentation purpose. See
Appendix B.1 for the detailed analysis.
Lemma 3 (Appendix B.1). Let ε log(3) and n c ·
d log(1) log R log(d)
ε
for a sufficiently
large constant c, then there exists an (ε/3, δ/3)-DP algorithm whose output ˆµ B
d
2
(R) satisfies
E[kµ ˆµk
2
] diam(D) +
2R
n
2
.
The above lemma implies that if n is large enough, ˆµ is a good approximation for µ and hence
x
i
D, E[kx
i
ˆµk
2
] 2diam(D) +
2R
n
2
. (3)
Existing algorithms for differentially private ERM require the knowledge of diam(D), which is
unknown. Hence, we compute an estimate of diam(D) by private quantile estimation.
6
Published as a conference paper at ICLR 2024
Private quantile estimation. Changing one sample in a dataset can change the diameter by R.
Hence, the sensitivity of diam(D) is R and it is difficult to estimate it privately with good accuracy.
Instead we estimate the
˜
O(1/()) quantile of kx
i
ˆµk
2
using Dick et al. (2023, Theorem 2).
Lemma 4 (Appendix B.2). There exists a (ε/3, δ/3) algorithm that finds a threshold τ such that
E[τ|ˆµ] 2 max
i
kx
i
ˆµk
2
+
2R
n
2
.
Let ε = O(1) and Z
τ
be the number of points such that kx
i
ˆµk is larger than τ, then E[Z
τ
| ˆµ]
125
ε
log n.
Translated and augment dataset with bias. Another issue with the above mentioned approach is
the scale of the bias term. If θ = (w, b), then the gradient with respect to w scales linearly in x,
but is a fixed constant for b. This causes issues in the theoretical analysis. Hence, we construct an
augmented dataset D
0
where x
0
i
= (x
i
ˆµ, τ). Note that x
0
i
R
d+1
. We also remove the bias term
when optimizing over D
0
since we have already included a bias term in the augmented features.
This also guarantees that the norm of the gradient is bounded by Θ(τ ) with high probability. We
show that we are not losing any information in translation and augmentation and hence the minimum
value of empirical risk should remain the same. We formalize it in the next lemma.
Lemma 5. For any θ = (w, b), let θ
0
= (w, (b + w · ˆµ) ), we have
L(θ, D) = L(θ
0
, D
0
).
In particular, (w
, b
) is a minimizer of D if and only if (w
, (b
+ ˆµ ·w
)) is a minimizer of D
0
.
Proof. To prove the first result, observe that for any point (x, y) D,
`(h((w, b), x), y)) = φ(y · (w · x + b))
= φ(y · (w · (x ˆµ) + b + w · ˆµ))
= φ(y
0
· (w · x
0
+ (b + w · ˆµ) · τ ))
= `(h((w, (b + w · ˆµ) ), x
0
), y
0
).
The second result follows by taking the minima over all (w, b) in the first result.
Recall that results on privacy error involve the projector on the design matrix and the norm of the
minimizer. Hence, we next provide connections between the rank of M (D) and M(D
0
) and the
norms of the minimizers of D and D
0
. The proof mainly follows from the fact that the translating
and augmenting operations are performing the same linear operation on each data point, which won’t
change the minimizer up to scaling and the bias term.
Lemma 6 ((Appendix B.3)). Let M (D) and M(D
0
) be the projectors design matrices of non-
augmented dataset D and augmented dataset D
0
respectively, then
rank(M(D
0
)) rank(M(D)) + 5.
Furthermore, let θ
and θ
0∗
be the empirical risk minimizers of D and D
0
respectively. If Assump-
tion 1 holds then
E[kθ
0∗
k
2
· τ|ˆµ] 2kθ
k
2
max
x
i
D
kˆµ x
i
k
2
+
2kθ
k
2
R
n
2
.
Feature clipping. As mentioned above, due to errors in previous steps, the `
2
norm of the features
may not be bounded with a small probability. Hence, we clip the augmented features from dataset
D
0
to obtain a dataset D
00
. Similar to the Lemma 6, we relate the rank of M(D
0
) and M(D
00
) next.
The proof mainly follows from the fact that clipping will only change the scale of each feature vector
but not their direction.
Lemma 7 (Appendix B.4). Let M (D
0
) and M(D
00
) be the projectors to the design matrices of
datasets D
0
and D
00
respectively, then
rank(M(D
00
)) = rank(M(D
0
)).
7
Published as a conference paper at ICLR 2024
DPSGD with preprocessed features. We next solve ERM using DPSGD algorithm on D
00
. Since
norm of the features are bounded by
2τ, the gradients will be bounded by
2. Furthermore,
since the dataset is augmented, we don’t include the bias term and treat weights as a vector in R
d+1
.
The guarantee of DPSGD on these features follows from previous work Song et al. (2020, Theorem
4.1). Similar to Song et al. (2020), the clipping norm is chosen to be
2 so that no gradients are
clipped in all steps.
Reverting the solution to the original space. The output of the DPSGD algorithm is in R
d+1
, and
furthermore only works on the translated and augmented dataset D
00
. Hence, we translate it back to
the original space, by appropriately obtaining w
prv
and rescaling the bias.
The computation complexity of the algorithm is the same as DPSGD since the preprocessing algo-
rithms can all be implemented in time
˜
O(nd), which is less than that of the DPSGD phase. The
proof of privacy guarantee in Theorem 1 follows directly from the composition theorem (Dwork
et al., 2014, Theorem 3.16). The utility guarantee follows from the above stated results and we
provide the complete proof of in Appendix B.5.
6 LOWER BOUND
In this section, we prove Theorem 2, which shows that the performance of our algorithm is tight
up to logarithmic factors. The proof follows almost immediately from Theorem 3.3 in Song et al.
(2021a), with a slight modification to the label and loss function to make sure that it is a valid
classification problem of form Eq. (1). We first state the result in Song et al. (2021a) (a restated
version of Theorem 3.3) below.
Theorem 4 (Song et al. (2021a)). Let A be any (ε, δ)-DP algorithm with ε c and δ cε/n for
some constant c > 0. Let 1 d
0
d be an integer. Let ` be defined by `(θ ·x, y) = |θ ·x y|. Then
there exists a data set D = {(x
i
, y
i
), i [n]} such that the following holds. For all i [n], x
i
{(0, . . . , 0), (1, 0, . . . , 0), (0, 1, 0, . . . , 0), (0, . . . , 0, 1)} and y
i
{0, 1}. Let θ
= arg min
θR
d
L(θ; D)
(breaking ties towards lower kθk
2
). We have kθ
k
2
[0, 1]
d
, and
E
h
L
Proj
[0,1]
d
(A(D)); D
i
L (θ
; D)
min
1,
d
0
εn

,
where Proj
[0,1]
d
(A(D)) denotes the projection of A(D) to [0, 1]
d
.
5
Proof of Theorem 2: Take the dataset D = {(x
i
, y
i
), i [n]} from Theorem 4, we can construct
a dataset D
0
as following. i, let x
0
i
= x
i
, y
0
i
= 2y
i
1. Set `
0
(θ ·x, y) = max{1 y(2θ ·x 1), 0},
which is of the form Eq. (1). Then, for all θ [0, 1]
d
, we have θ · x [0, 1] and
`
0
(θ · x, y) =
2 2θ · x = |2θ · x 1 y| if y = +1,
2θ · x = |2θ · x 1 y| if y = 1.
Hence i [n], `
0
(θ · x
0
i
, y
0
i
) = |2θ · x
0
i
1 y
0
i
| = 2|θ · x
i
y
i
|. Moreover, it can be seen that
x [0, 1]
d
, y {+1, 1}
`
0
(Proj
[0,1]
d
(θ) · x, y) `
0
(θ · x, y). (4)
Hence the minimizer of L
0
(θ; D
0
) lies in [0, 1]
d
. The above facts imply that θ
is also the minimizer
of L
0
(θ; D
0
), and by Theorem 4,
E
h
L
0
Proj
[0,1]
d
(A(D
0
)), D
0
i
L
0
(θ
; D
0
)
min
1,
d
0
εn

.
Together with Eq. (4), we have
E [L
0
(A(D
0
), D
0
)] L
0
(θ
; D
0
)
min
1,
d
0
εn

.
5
Theorem 3.3 in Song et al. (2021a) is stated without the projection operator while in fact the projection
operator is used throughout the proof of Theorem B.1 and Theorem 3.3. Hence the stated result holds with no
change in the proof.
8
Published as a conference paper at ICLR 2024
Note that for all θ R
d
, i [n], k
θ
`
0
(h(θ, x
0
i
), y
0
i
)k
2
2kx
i
k
2
2. And the dataset satisfies
that diam(D)
2, rank(
P
n
i=1
x
i
x
>
i
) d
0
. Moreover, kθ
k
M
kθ
k
2
d
0
. Hence we have
E [L
0
(A(D
0
), D
0
)] L
0
(θ
; D
0
)
diam(D
0
) min
(
1,
kθ
k
M
p
rank(M)
)!
.
The proof can be generalized to a general Lipschitz constant G and diam(D) by rescaling.
7 EXPERIMENTS
We empirically demonstrate our findings by evaluating DPSGD and our proposed feature-
normalized DPSGD (DPSGD-F
6
) for the task of training a linear classifier on three popular image
classification datasets: (1) MNIST (Lecun et al., 1998); (2) Fashion-MNIST (Xiao et al., 2017); (3)
CIFAR-100 (Krizhevsky et al., 2009) with pretrained features. For MNIST and Fashion-MNIST, we
directly train a linear classifier with the pixel-format images as inputs. For CIFAR-100, we use the
features obtained from the representations in the penultimate layer in a WRN model pretrained on
ImageNet (De et al., 2022). In this case, training a linear classifier using these features is equivalent
to fine-tuning the last layer of the WRN model.
Table 1: Accuracy comparison on different datasets with different values of ε and δ = 10
5
.
CIFAR-100 is trained and tested using features pretrained on ImageNet (De et al., 2022). Each
number is an average over 10 independent runs. The number in the parenthesis represents the stan-
dard deviation of the accuracy under the optimal parameter settings.
Dataset
Accuracy (ε = 1) Accuracy (ε = 2) Non-private acc.
DPSGD DPSGD-F DPSGD DPSGD-F SGD
MNIST 87.4 (0.1) 92.0 (0.1) 89.6 (1.2) 92.3 (0.1) 92.9 (0.1)
FMNIST 77.2 (0.4) 84.0 (0.1) 78.7 (0.4) 84.5 (0.2) 85.0 (0.1)
CIFAR-100 (pretrained) 70.6 (0.3)
7
71.6 (0.2) 73.8 (0.2) 74.4 (0.2) 78.5 (0.1)
Similar to De et al. (2022), for each combination of dataset, privacy parameter, and algorithm, we
report the best accuracy obtained from a grid search over batch size, steps size, and number of
epochs
8
. For DPSGD-F, we also did a grid search over the allocation of privacy budget between the
preprocessing phase and the DPSGD phase. We leave choosing the parameters automatically as a
future direction. Detailed implementation and parameter settings are listed in Appendix C.
Our results are listed in Table 1. Our proposed algorithm consistently improves upon DPSGD for
all three datasets under ε = 1 and ε = 2, which empirically demonstrates our theoretical findings.
8 DISCUSSION
In this work, we ask the fundamental question of whether DPSGD alone is sufficient for obtaining
a good minimizer for linear classification. We partially answer this question by showing that the
DPSGD algorithm incurs a privacy error proportional to the maximum norm of the features over all
samples, where as DPSGD with a feature preprocessing step incurs a privacy error proportional to
the diameter of the features, which can be significantly small compared to the maximum norm of
features in several scenarios. Our preliminary experiments show that feature preprocessing helps in
practice on standard datasets. Investigating whether these results can be extended to bigger datasets
and beyond linear models remain interesting open questions.
6
We perform a modified version of Algorithm 2, which better aligns with existing practical implementations
of DPSGD. See Algorithm 3 for details. Note that the resulting algorithm is still a valid (, δ)-differentially
private.
7
The reported accuracy using DPSGD in De et al. (2022) is 70.3%. We note that the improvement to
70.6% is due to the fact that we are using the privacy accoutant based on Privacy Loss Distributions (Meiser &
Mohammadi, 2018; Sommer et al., 2019; Doroshenko et al., 2022), which leads to a smaller noise multiplier
under the same privacy budget.
8
The reported accuracy doesn’t take into account the privacy loss in the hyperparamter tuning phase.
9
Published as a conference paper at ICLR 2024
REFERENCES
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and
Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, pp. 308–318, 2016.
Raman Arora, Raef Bassily, Crist
´
obal Guzm
´
an, Michael Menart, and Enayat Ullah. Differentially
private generalized linear models revisited. Advances in Neural Information Processing Systems,
35:22505–22517, 2022.
Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy:
Analytical calibration and optimal denoising. In Jennifer Dy and Andreas Krause (eds.), Pro-
ceedings of the 35th International Conference on Machine Learning, volume 80 of Proceed-
ings of Machine Learning Research, pp. 394–403. PMLR, 10–15 Jul 2018. URL https:
//proceedings.mlr.press/v80/balle18a.html.
Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds.
Journal of the American Statistical Association, 101(473):138–156, 2006. ISSN 01621459. URL
http://www.jstor.org/stable/30047445.
Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic
convex optimization with optimal rates. In Advances in Neural Information Processing Systems,
pp. 11279–11288, 2019.
Raef Bassily, Vitaly Feldman, Crist
´
obal Guzm
´
an, and Kunal Talwar. Stability of stochastic gradient
descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:
4381–4391, 2020.
Raef Bassily, Crist
´
obal Guzm
´
an, and Michael Menart. Differentially private stochastic optimization:
New results in convex and non-convex settings. Advances in Neural Information Processing
Systems, 34:9317–9329, 2021a.
Raef Bassily, Crist
´
obal Guzm
´
an, and Anupama Nandi. Non-euclidean differentially private stochas-
tic convex optimization. In Conference on Learning Theory, pp. 474–499. PMLR, 2021b.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao
Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http:
//github.com/google/jax.
Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlock-
ing high-accuracy differentially private image classification through scale. arXiv preprint
arXiv:2204.13650, 2022.
Travis Dick, Alex Kulesza, Ziteng Sun, and Ananda Theertha Suresh. Subset-based instance opti-
mality in private estimation. arXiv preprint arXiv:2303.01262, 2023.
Vadym Doroshenko, Badih Ghazi, Pritish Kamath, Ravi Kumar, and Pasin Manurangsi. Connect
the dots: Tighter discrete approximations of privacy loss distributions. Proceedings on Privacy
Enhancing Technologies, 4:552–570, 2022.
Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations
and Trends
R
in Theoretical Computer Science, 9(3–4):211–407, 2014.
Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal
rates in linear time, 2020.
Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes,
Soham De, Samuel L Smith, Olivia Wiles, and Borja Balle. Differentially private diffusion models
generate useful synthetic images. arXiv preprint arXiv:2302.13861, 2023.
Google. Tensorflow privacy, 2018. URL https://github.com/google/
differential-privacy/.
10
Published as a conference paper at ICLR 2024
Ziyue Huang, Yuting Liang, and Ke Yi. Instance-optimal mean estimation under differential privacy.
Advances in Neural Information Processing Systems, 34:25993–26004, 2021.
Peter Kairouz, Ziyu Liu, and Thomas Steinke. The distributed discrete gaussian mechanism for
federated learning with secure aggregation. In International Conference on Machine Learning,
pp. 5201–5212. PMLR, 2021.
Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals, 2017.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
Yann LeCun, L
´
eon Bottou, Genevieve B Orr, and Klaus-Robert M
¨
uller. Efficient backprop. In
Neural networks: Tricks of the trade, pp. 9–50. Springer, 2002.
Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be
strong differentially private learners. In International Conference on Learning Representations,
2021.
Xuechen Li, Daogao Liu, Tatsunori B Hashimoto, Huseyin A. Inan, Janardhan Kulka-
rni, Yin-Tat Lee, and Abhradeep Guha Thakurta. When does differentially pri-
vate learning not suffer in high dimensions? In S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Infor-
mation Processing Systems, volume 35, pp. 28616–28630. Curran Associates, Inc.,
2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/
file/b75ce884441c983f7357a312ffa02a3c-Paper-Conference.pdf.
H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private
recurrent language models. In International Conference on Learning Representations, 2018.
Sebastian Meiser and Esfandiar Mohammadi. Tight on budget? tight bounds for r-fold approximate
differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’18, pp. 247–264, New York, NY, USA, 2018. Association for
Computing Machinery. ISBN 9781450356930. doi: 10.1145/3243734.3243765. URL https:
//doi.org/10.1145/3243734.3243765.
Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H Brendan
McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta. How to dp-fy ml: A
practical guide to machine learning with differential privacy. arXiv preprint arXiv:2303.00654,
2023.
David Sommer, Sebastian Meiser, and Esfandiar Mohammadi. Privacy loss classes: The central
limit theorem in differential privacy. Proceedings on Privacy Enhancing Technologies, 2019:
245–269, 04 2019. doi: 10.2478/popets-2019-0029.
Shuang Song, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient de-
scent on convex generalized linear problems. arXiv preprint arXiv:2006.06783, 2020.
Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimen-
sionality in unconstrained private glms. In Arindam Banerjee and Kenji Fukumizu (eds.), Pro-
ceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume
130 of Proceedings of Machine Learning Research, pp. 2638–2646. PMLR, 13–15 Apr 2021a.
URL https://proceedings.mlr.press/v130/song21a.html.
Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimen-
sionality in unconstrained private glms. In International Conference on Artificial Intelligence and
Statistics, pp. 2638–2646. PMLR, 2021b.
Florian Tramer and Dan Boneh. Differentially private learning needs better features (or much
more data). In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=YTWGvpFOQD-.
11
Published as a conference paper at ICLR 2024
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-
ing machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/
abs/1708.07747.
Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan
Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning
of language models. arXiv preprint arXiv:2110.06500, 2021.
A PROOF OF THEOREM 3
In the proof, we assume w
0
(2) 0 and consider D = D
0
. The proof will follow similarly when
w
0
(2) 0 and D = D
00
. We start by proving Lemma 2.
Proof of Lemma 2: Since max{1 x, 0} is a convex function in x, we have
L(w, D) =
1
2
max{1 (µw(1) + w(2)), 0}+
1
2
max{1 + (µw(1) w(2)), 0}
max{
1 (µw(1) + w(2)) + 1 + (µw(1) w(2))
2
, 0}
= max{1 w(2), 0}.
This implies that
E [L(w, D)] min
w
L(w, D) E[max{1 w(2), 0}] Pr (w(2) 0).
Note that the proof also applies to the final output A(D) =
1
T
P
T
t=1
w
t
. And hence to prove
Theorem 3, it would be enough to show that.
Pr
1
T
T
X
t=1
w
t
(2) 0
!
= Ω(1).
In the later analysis, we will focus on the update process of w
t
(2). We will prove the following
lemma:
Lemma 8. Let X be distributed as N
C
µ
2
+1
, σ
2
with σ
2
= Θ
C
2
n
2
ε
2
, we have
Pr
1
T
T
X
t=1
w
t
(2) 0
!
Pr (X 0).
Note that Lemma 8 would immediately imply the theorem by Chernoffs inequality. We then turn
to proving Lemma 8 below.
For (x, y) D, we have
w
`(w · x, y) =
{µw(1) + w(2) < 1}· (µ, 1) if (x, y) D
1
,
{−µw(1) + w(2) < 1}· (µ, 1) if (x, y) D
1
.
Let B
t
be the batch of users in iteration t with |B
t
| = B, the averaged clipped gradient at each
iteration will be
ˆg
t
=
1
B
X
(x,y)B
t
Clip(
w
`(w
t
· x, y), C)
=
1
B
min
(
1,
C
p
µ
2
+ 1
)
·
X
(x,y)B
t
D
1
{µw
t
(1) + w
t
(2) < 1} · (µ, 1)
+
1
B
min
(
1,
C
p
µ
2
+ 1
)
·
X
(x,y)B
t
D
1
{−µw(1) + w(2) < 1}· (µ, 1)
.
12
Published as a conference paper at ICLR 2024
Hence
ˆg
t
(2)
=
1
B
min
(
1,
C
p
µ
2
+ 1
)
(|B
t
D
1
| {µw
t
(1) + w
t
(2) < 1} + |B
t
D
1
| {−µw(1) + w(2) < 1})
1
B
min
(
1,
C
p
µ
2
+ 1
)
(|B
t
D
1
| + |B
t
D
1
|)
= min
(
1,
C
p
µ
2
+ 1
)
.
Standard privacy analysis in DP-SGD (the setting of parameters in Algorithm 1 used in Bassily
et al. (2019)) sets B max{n
p
ε/4T , 1} and adds a Gaussian noise N(0, σ
2
2
) with σ
2
=
8T C
2
log(1)
n
2
ε
2
9
to the averaged clipped gradient. Let N
t
denote the noise added at iteraction t, we
have
w
t+1
(2) w
t
(2) + η
min
(
1,
C
p
µ
2
+ 1
)
+ N
t
(2)
!
.
Hence we have
w
t
(2) w
0
(2) + η
t1
X
i=0
min
(
1,
C
p
µ
2
+ 1
)
+ N
i
(2)
!
η
t min
(
1,
C
p
µ
2
+ 1
)
+
t1
X
i=0
N
i
(2)
!
This implies:
1
T
T
X
t=1
w
t
(2) η
T + 1
2
min
(
1,
C
p
µ
2
+ 1
)
+
1
T
T 1
X
t=0
(T t)N
i
(2)
!
.
Note that the right hand side is a Gaussian distributed random variable with mean
η
T +1
2
min
1,
C
µ
2
+1
and variance η
2
4(T +1)(2T +1)C
2
log(1)
3n
2
ε
2
. Hence
Pr
1
T
T
X
t=1
w
t
(2) 0
!
Pr
N
η
T + 1
2
min
(
1,
C
p
µ
2
+ 1
)
, η
2
4(T + 1)(2T + 1)C
2
log(1)
3n
2
ε
2
!
0
!
Pr
N
min
(
1,
C
p
µ
2
+ 1
)
,
16(2T + 1)C
2
log(1)
3(T + 1)n
2
ε
2
!
0
!
This completes the proof of Lemma 8.
B MISSING PROOFS IN SECTION 5
B.1 PROOF OF LEMMA 3
Let ε log(3) and let ρ =
ε
2
49 log(3)
. Let α =
R
n
2
and γ(ζ) =
q
1
ρ
· log(2d
3/2
n
2
) log
(d log(2d
3/2
n
2
))
ζ
. If n c · γ(ζ)
d (for a sufficiently large constant c), then
9
A different privacy accounting method might give slightly improved result. But overall we will have
σ
2
= Θ
8T C
2
log(1)
n
2
ε
2
, and our claim in Lemma 8 won’t change up to constant factors.
13
Published as a conference paper at ICLR 2024
by Theorem 3 and Remark 1 from Huang et al. (2021), there exists a ρ-zCDP algorithm that outputs
ˆµ such that
kµ ˆµk
2
= O
s
d
ρ
+ γ(ζ)
!
·
diam(D)
p
log(nd/ζ)
n
+
R
n
2
,
with probability at least 1 ζ. Let f (ζ) = O
q
d
ρ
+ γ(ζ)
·
diam(D)
log(nd/ζ)
n
+
R
n
2
. Hence,
E[kµ ˆµk
2
] = E[kµ ˆµk
2
1
kµˆµk
2
f(ζ)
] + E[kµ ˆµk
2
1
kµˆµk
2
>f(ζ)
]
E[f(ζ)] + E[2R1
kµˆµk
2
>f(ζ)
]
f(ζ) + 2R Pr(kµ ˆµk
2
> f(ζ))
f(ζ) + 2,
where the first inequality follows by observing that both µ and ˆµ have norm at most R. Setting
ζ =
1
n
2
yields,
E[kµ ˆµk
2
] O
s
d
ρ
+ γ(1/n
2
)
!
·
diam(D)
p
log(n
5
d)
n
+
2R
n
2
O
s
d
ρ
+
log(dn)
ρ
!
·
diam(D)
p
log(n
5
d)
n
+
2R
n
2
O
d + log(dn)
·
diam(D)
p
log(n
5
d)
p
log(1)
n
+
2R
n
2
diam(D) +
2R
n
2
,
where the last inequality follows based on condition on n. Note that n c · γ(1/n
2
)
d based on
the assumption in the lemma. By (Kairouz et al., 2021, Lemma 4), this algorithm is also (/3, δ/3)-
differentially private and hence the lemma.
B.2 PROOF OF LEMMA 4
Throughout the proof, we assume ˆµ is fixed and prove the statement for any ˆµ. Applying (Dick
et al., 2023, Theorem 2) with a = 0, b = R, privacy budget /3, error probability ζ, α = R/n
2
,
r = n
100
log n, yields that the output ˆτ satisfies the following: with probability at least 1 ζ,
there exists a τ
0
such that
|ˆτ τ
0
|
R
n
2
,
and there are at most
100
log n +
6
log
R
ζ·R
/
n
2
=
100
log n +
6
log
n
2
ζ
points with kx
i
ˆµk
2
more
than τ
0
and at least
100
log n
6
log
n
2
ζ
points with kx
i
ˆµk
2
less than τ
0
. Let τ = ˆτ +
R
n
2
, then τ
satisfies the following: there are most
100
log n +
6
log
n
2
ζ
points with kx
i
ˆµk
2
more than τ and
at least
100
log n
6
log
n
2
ζ
points with kx
i
ˆµk
2
greater than τ . Let S be the set of points with
norm larger than τ. Let s
0
be a value we set later.
E[|S|] = E[|S|1
|S|≤s
0
] + E[|S|1
|S|≥s
0
]
E[s
0
1
|S|≤s
0
] + E[n1
|S|≥s
0
]
s
0
+ n Pr(|S| s
0
).
Setting ζ =
1
n
2
and s
0
=
100
log n +
24
log n yields that
E[|S|]
1
n
+
124
log n
125
log n.
14
Published as a conference paper at ICLR 2024
To prove the other result, let s
1
be a value which we set later. Observe that
E[τ] E

ˆτ +
R
n
2

E[ˆτ] +
R
n
2
E[ˆτ1
|S|≤s
1
] + 2E[ˆτ 1
|S|≥s
1
] +
R
n
2
RE[1
|S|≤s
1
] + E[max
i
kx
i
ˆµk
2
1
|S|≥s
1
] +
R
n
2
R Pr(1
|S|≤s
1
) + max
i
kx
i
ˆµk
2
+
R
n
2
.
Setting s
1
=
100
log n
24
log n yields that
E[τ] max
i
kx
i
ˆµk
2
+
2R
n
2
.
B.3 PROOF OF LEMMA 6
For the augmented dataset D
0
,
X
i
x
0
i
(x
0
i
)
>
=
P
i
(x
i
ˆµ)(x
i
ˆµ)
>
τ
P
i
(x
i
ˆµ)
τ
P
i
(x
i
ˆµ)
2
=
P
i
x
i
x
>
i
+ ˆµˆµ
>
ˆµ
>
nˆµµ
>
(µ ˆµ)
(µ ˆµ)
2
where µ =
1
n
P
i
x
i
. Hence we have rank(M(D
0
)) rank(M(D)) +rank(µˆµ
>
) + rank(ˆµµ
>
) +
rank(ˆµˆµ
>
) + 2, where we use the fact that adding one column/row will at most increase the rank of
a matrix by one.
To prove the second result, note that by triangle inequality,
kθ
0∗
k
2
τ kw
0
k
2
τ + |b
0
|τ
kw
k
2
τ + |b
0
|τ,
where the second inequality follows from Lemma 5. We now bound the second term in the right
hand side of the above equation. Observe that by Assumption 1, there exists x
i
and x
j
such that
w
x
i
+ b
τ 0 and w
x
j
+ b
τ 0,
and hence
w
x
i
b
τ w
x
j
.
Hence,
w
· (ˆµ x
i
) b
τ + w
· ˆµ w
· (ˆµ x
j
),
Hence by Lemma 5,
|b
0
|τ = |b
+ w
· ˆµ|
max
x
i
D
|w
· (ˆµ x
i
)|
kw
k
2
max
x
i
D
k(ˆµ x
i
)k
2
(5)
Combining the above two equations yield,
kθ
0∗
k
2
τ kw
k
2
τ + kw
k
2
max
x
i
D
k(ˆµ x
i
)k
2
kθ
k
2
τ + kθ
k
2
max
x
i
D
k(ˆµ x
i
)k
2
Hence by Lemma 4,
E[kθ
0∗
k
2
τ|ˆµ] kθ
k
2
E[τ|ˆµ] + kθ
k
2
max
x
i
D
k(ˆµ x
i
)k
2
2kθ
k
2
max
x
i
D
k(ˆµ x
i
)k
2
+
2kθ
k
2
R
n
2
.
15
Published as a conference paper at ICLR 2024
B.4 PROOF OF LEMMA 7
If v span(U(D
0
)), then v
>
P
i
x
0
i
x
0>
i
v > 0. Hence
P
i
kx
0
i
vk
2
2
> 0, which implies
P
i
kx
00
i
vk
2
2
>
0 and hence v
>
P
i
x
00
i
x
00>
i
v > 0 and hence v span(U (D
00
)). Similar analysis from U(D
00
) to
U(D
0
) yields that span(U(D
0
)) = span(U(D
00
)) and hence rank(M(D
0
)) = rank(M(D
00
)).
B.5 PROOF OF THEOREM 1
The proof of privacy guarantee follows directly from the composition theorem (Dwork et al., 2014,
Theorem 3.16). Note that for simplicity, we used composition theorem here. However in experi-
ments, we use the recent technique of PLD accountant (Meiser & Mohammadi, 2018; Sommer et al.,
2019; Doroshenko et al., 2022) implemented in Google (2018) to compute privary paramters and
δ.
We now focus on the utility guarantee.
Let θ
, θ
0∗
, θ
00∗
be the empirical risk minimizers of D, D
0
, D
00
respectively. Let
M(D), M(D
0
), M (D
00
) be the design matrices of datasets D, D
0
, D
00
respectively. Due to
feature-clipping, the gradient for any (x
00
, y
00
) D
00
and θ, the gradient of `(h(θ, x
00
), y
00
) is upper
bounded by
Gkx
00
k
2
2.
Hence, by (Song et al., 2021a, Theorem 3.1)
10
and Lemma 6,
E[L(θ
00
prv
, D
00
)|D
00
] L(θ
0∗
, D
00
) 2
2kθ
0∗
k
M(D
00
)
p
rank(M(D
00
)) log(3)
n
!
10
2kθ
0∗
k
2
p
rank(M) log(3)
n
!
. (6)
By Lemma 6 and Equation 3,
E[τkθ
0∗
k
2
] = E[E[τkθ
0∗
k
2
|ˆµ]
2kθ
k
2
E[max
x
i
D
k(ˆµ x
i
)k
2
] +
wkθ
k
2
R
n
2
4kθ
k
2
diam(D) +
6kθ
k
2
R
n
2
.
Substituting the above equation in equation 6 yields and taking expectation over D
00
yields
E[L(θ
00
prv
, D
00
)] E[L(θ
0∗
, D
00
)] 60
2Gkθ
k
2
diam(D) +
R
n
2
p
rank(M) log(3)
n
!
.
10
Note that(Song et al., 2021a, Theorem 3.1) states the bound in terms of empirical risk minimizer θ
00∗
, but
the bound holds for any θ.
16
Published as a conference paper at ICLR 2024
Let S be the set of samples which differ between D
0
and D
00
. By Lemma 6,
E[L(θ
0∗
, D
00
)] E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
`(h(θ
0∗
, x
00
i
), y
i
) `(h(θ
0∗
, x
0
i
), y
i
)
#
E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
|`(h(θ
0∗
, x
00
i
), y
i
) `(h(θ
0∗
, x
0
i
), y
i
)|
#
E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
G|h(θ
0∗
, x
00
i
) h(θ
0∗
, x
0
i
)|
#
E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
G
|w
0
· x
0
i
| + |b
0
|τ
#
(a)
E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
2G max
j
|w
0
· (x
j
ˆµ)|
#
E[L(θ
0∗
, D
0
)] + E
"
1
n
X
iS
2Gkw
0
k
2
· (diam(D) + kˆµ µk
2
)
#
(b)
= E[L(θ
0∗
, D
0
)] +
2G
n
E
"
X
iS
kw
k
2
· (diam(D) + kˆµ µk
2
)
#
E[L(θ
0∗
, D
0
)] +
2Gkw
k
2
n
E [E[|S| | ˆµ] · (diam(D) + kˆµ µk
2
)]
(c)
E[L(θ
0∗
, D
0
)] +
250G log nkw
k
2
n
E [(diam(D) + kˆµ µk
2
)]
(d)
E[L(θ
0∗
, D
0
)] +
250G log nkw
k
2
n
2diam(D) +
2R
n
2
E[L(θ
0∗
, D
0
)] +
500G log nkθ
k
2
n
·
2diam(D) +
2R
n
2
.
where (a) follows from Eq. (5), (b) follows from Lemma 5, (c) follows from Lemma 4, and (d)
follows from Lemma 3. We now bound in the other direction.
E[L(θ
00
prv
, D
00
)] = E[L(θ
00
prv
, D
0
)] +
1
n
E[
X
iS
`(h(θ
00
prv
, x
00
i
), y
i
) `(h(θ
00
prv
, x
0
i
), y
i
)].
Notice that if y
i
θ
00
prv
· x
00
i
0, then y
i
θ
00
prv
· x
0
i
y
i
θ
00
prv
· x
00
i
0. Hence,
E[L(θ
00
prv
, D
00
)] E[L(θ
00
prv
, D
0
)] +
1
n
E
X
iS:y
i
θ
00
prv
·x
00
i
>0
`(h(θ
00
prv
, x
00
i
), y
i
) `(h(θ
00
prv
, x
0
i
), y
i
)
E[L(θ
00
prv
, D
0
)]
1
n
E
X
iS:y
i
w
00
prv
·x
00
i
>0
`(h(θ
00
prv
, x
0
i
), y
i
)
E[L(θ
00
prv
, D
0
)]
1
n
E[|S|φ(0)]
E[L(θ
00
prv
, D
0
)]
125 log n
n
φ(0),
where the last inequality follows from Lemma 4. Hence we have
E[L(θ
00
prv
, D
0
)] L(θ
0
, D
0
)
500G log nkθ
k
2
n
·
2diam(D) +
2R
n
2
+
125 log n
n
φ(0).
By Lemma 5, we have E[L(θ
00
prv
, D
0
)] L(θ
0
, D
0
) = E[L(θ
prv
, D)] L(θ
, D). This yields that
E[L(θ
prv
, D)] L(θ
, D) cGkθ
k
2
diam(D) +
R
n
2
p
rank(M) log(3) + log n
n
!
+
(0)
n
log n,
17
Published as a conference paper at ICLR 2024
Algorithm 3 Modefied version of DPSGD with feature preprocessing (DPSGD-F
).
Input: Input: Dataset D = {(x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
)} of n points; overall privacy budget
ε, δ; privacy budget for the preprocessing step ε
F
(0, ε); feature norm bound C
F
; gradient
clipping norm C
G
; step size η, number of iterations T ; batch size B.
1: Private mean estimation. Compute the noise multiplier σ
F
for the Gaussian mechanism with
ε
F
and δ using analytical callibration for Gaussian mechanism Balle & Wang (2018), and com-
pute
ˆµ =
1
n
n
X
i=1
x
i
+ N(0, σ
F
C
f
I
d
).
2: Translate the features. Preprocess the dataset and obtain D
0
= {(x
0
1
, y
1
), . . . , (x
0
n
, y
n
)},
where
x
0
i
= x
i
ˆµ.
3: DPSGD with preprocessed features. Compute the privacy budget ε
0
for the DPSGD phase
using the PLD accountant in Google (2018) by setting (ε, δ) as the overall privacy budget for
the composition of both the mean estimation phase and DPSGD phase. Get an an approximate
minimizer θ
0
prv
of L(θ, D
0
) using the DPSGD algorithm with privacy budget (ε
0
, δ).
4: Return θ
prv
= (w
prv
, b
prv
), where w
prv
= θ
0
prv
[1 : d] and b
prv
= θ
0
prv
[d + 1] ˆµ · w
prv
.
for a sufficient large constant c. The theorem follows by observing that for the minimum norm
solution θ
, kθ
k
2
= kθ
k
M
.
C ADDITIONAL EXPERIMENT DETAILS.
In this section, we discuss the detailed implementation and parameter settings used in our experi-
ments. We implement all algorithms and experiments using the open-source JAX (Bradbury et al.,
2018) library. For privacy accounting, we use the PLD accountant implemented in Tensorflow Pri-
vacy (Google, 2018). We first describe the implementation details of our experiments.
Feature normalization. For all three datasets, we consider their feature-normalized version,
where we normalize the feature vectors to a fixed norm of C
F
by the transformation below
x
0
i
= x
i
·
C
F
kx
i
k
2
.
We treat C
F
as a fixed constant, and hence this normalization step doesn’t result in any privacy loss
about the dataset.
Modifications to DPSGD-F. The version of DPSGD-F listed in Algorithm 2 is mainly presented
for ease of strict theoretical analysis since existing analysis on DPSGD mainly assumse that the
gradients are bounded and no clipping is needed during the DPSGD phase (Bassily et al., 2019;
Feldman et al., 2020; Bassily et al., 2020; 2021b;a; Song et al., 2021b; Arora et al., 2022). However,
state-of-the-art results from DPSGD usually uses the clipping version of DPSGD where no gradient
norm bound is assumed (De et al., 2022; Ghalebikesabi et al., 2023).
In our experiments, instead of directly implementing Algorithm 2, we perform the following modi-
fications in the experiments. The implemented version of the algortihm is described in Algorithm 3.
1. For the feature preprocessing step, when computing the mean, instead of using the adaptive
algorithm in Lemma 3, the algorithm directly applies Gaussian mechanism to the empirical
mean since each feature is normalized to a fixed norm. We can use Gaussian mechanism
for mean estimation since now the `
2
sensitivity is bounded.
2. We skip the private quantile estimation and feature clipping step. Instead, we only shift the
features by the privately estimated mean and treat it as the input to the DPSGD phase. We
then perform `
2
-norm clipping on the gradients as in DPSGD.
18
Published as a conference paper at ICLR 2024
The privacy guarantee of Algorithm 3 directly follows from the use of PLD accountant (Meiser &
Mohammadi, 2018; Sommer et al., 2019).
Lemma 9. The output of Algorithm 3 is (, δ)-differentially private.
Each each combination of (ALGORITHM, ε, DATASET), we fix the clipping norm C
G
to be 1 as
suggested in De et al. (2022), and perform a grid search over C
F
, BATCH SIZE, LEARNING RATE,
and NUMBER OF EPOCHS from the list below and report the best-achieved accuracy, similar to De
et al. (2022).
1. C
F
: 1, 10, 100, 1000.
2. BATCH SIZE: 256, 512, 1024, 2048, 4096, 8192, 16384.
3. LEARNING RATE: 0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16.
4. NUMBER OF EPOCHS: 20, 40, 80, 160, 320.
For DPSGD-F, we further add a search dimension over the privacy budget ε
F
for the feature pre-
processing step. And the list of candidates is chosen to be [0.02, 0.05, 0.1, 0.15, 0.2]. In fact, for all
experiments, the best accuracy is achieved at either ε
F
= 0.02 or ε
F
= 0.05. This suggests that
the feature preprocessing step actually doesn’t use much of the privacy budget to improve the final
accuracy.
Remark on different concentration measures of the datasets. For the three considered datasets,
below we compute a few distance measures that characterize how the dataset D is concentrated,
listed in Table 2. It appears that compared to diam(D),
1
n
P
i
kx
i
µk
2
is a more indicative measure
on the performance improvement from DPSGD-F compared to DPSGD. The smaller
1
n
P
i
kx
i
µk
2
, the more improvement DPSGD-F has over DPSGD. This might be due to the fact that we are
considering the gradient clipping version of DPSGD (Algorithm 3) instead of the feature clipping
version (Algorithm 2) considered in the theoretical analysis. Better understanding of this phenomena
is an interesting future direction.
Table 2: Different concentration measures of the datasets. Features in datasets are normalized to be
of unit norm. µ denotes the mean of the dataset.
Dataset diam(D) max
i
kx
i
µk
2
1
n
P
i
kx
i
µk
2
CIFAR-100 (pretrained) 1.56 1.10 0.9
MNIST 1.41 1.03 0.77
FMNIST 1.41 1.10 0.62
19