THE IMPORTANCE OF FEATURE PREPROCESSING FOR

Published as a conference paper at ICLR 2024

DIFFERENTIALLY PRIVATE LINEAR OPTIMIZATION

Ziteng Sun, Ananda Theertha Suresh, Aditya Krishna Menon

Google Research, New York

{zitengsun,theertha,adityakmenon}@google.com

ABSTRACT

Training machine learning models with differential privacy (DP) has received in-

creasing interest in recent years. One of the most popular algorithms for training

differentially private models is differentially private stochastic gradient descent

(DPSGD) and its variants, where at each step gradients are clipped and combined

with some noise. Given the increasing usage of DPSGD, we ask the question: is

DPSGD alone sufﬁcient to ﬁnd a good minimizer for every dataset under privacy

constraints? Towards answering this question, we show that even for the simple

case of linear classiﬁcation, unlike non-private optimization, (private) feature pre-

processing is vital for differentially private optimization. In detail, we ﬁrst show

theoretically that there exists an example where without feature preprocessing,

DPSGD incurs an optimality gap proportional to the maximum Euclidean norm

of features over all samples. We then propose an algorithm called DPSGD-F,

which combines DPSGD with feature preprocessing and prove that for classiﬁca-

tion tasks, it incurs an optimality gap proportional to the diameter of the features

max

x,x

∈D

kx −x

. We ﬁnally demonstrate the practicality of our algorithm on

image classiﬁcation benchmarks.

1 INTRODUCTION

Differential privacy (DP) (Dwork et al., 2014) has emerged as one of the standards of privacy in

machine learning and statistics. In machine learning, differentially private methods have been used

to train models for language modelling (McMahan et al., 2018), image classiﬁcation (De et al.,

2022), generative diffusion (Ghalebikesabi et al., 2023), and private ﬁne-tuning Yu et al. (2021); Li

et al. (2021). We refer readers to Ponomareva et al. (2023) for a detailed survey of techniques and

current state of the art methods in private optimization. Following the inﬂuential work of Abadi

et al. (2016), differentially private stochastic gradient descent (DPSGD) has emerged as one of the

most popular algorithms for training private machine learning models and achieves state of the art

results in several datasets (De et al., 2022; Ghalebikesabi et al., 2023).

There has also been a recent line of work that focuses on analyzing the theoretical performance of

DPSGD and its variants, with a particular focus on convex models (Bassily et al., 2019; Feldman

et al., 2020; Bassily et al., 2020; 2021b;a; Song et al., 2021b; Arora et al., 2022). It has been

shown that DPSGD and its variants can achieve min-max optimal rates for the task of DP empirical

risk minimization and DP stochastic convex optimization under various geometries. Moreover, for

convex generalized linear models, DPSGD has also been shown to achieve dimension-independent

convergence rate (Song et al., 2021b; Arora et al., 2022). This observation can be extended to general

convex models under certain assumptions (Li et al., 2022).

Despite the practical success and appealing theoretical properties of DPSGD, recent empirical results

have shown that they may not learn good intermediate features in deep learning image classiﬁcation

tasks (Tramer & Boneh, 2021). In fact, it has been observed in Abadi et al. (2016) that performing

a private PCA on the features before performing DPSGD can improve the performance of private

training, which highlights that learned features may not be good. This raises a fundamental question:

Can private feature preprocessing provably improve DPSGD?

There is no clear intuitive answer to this question. On the one hand, feature preprocessing accelerates

the convergence of gradient methods in optimization (LeCun et al., 2002), which can help private

Published as a conference paper at ICLR 2024

optimization since the number of steps is constrained by the privacy requirement. On the other hand,

private feature preprocessing will use a portion of the privacy budget, thus decreasing the privacy

budget for the DPSGD phase.

As a ﬁrst step towards answering this question, we show that for the simple task of linear private

classiﬁcation, unlike non-private convex optimization, feature preprocessing is necessary for private

optimization to achieve near instance-optimal results. Our ﬁndings are as follows.

1. We provide an example where DPSGD with any clipping norm, batch size, and learning

rate incurs an error proportional to the maximum Euclidean

norm of feature vectors (Sec-

tion 4).

2. We propose DPSGD-F, a new algorithm that combines both DPSGD and feature prepro-

cessing, and show that the leading term of the error degrades proportional to the diameter

of the dataset, which can be signiﬁcantly smaller than the maximum norm (Section 5).

We also complement our result with a near-matching information-theoretic lower bound

(Section 6).

3. We empirically validate our ﬁndings on a few standard datasets and show that DPSGD-F

outperforms DPSGD on a subset of the datasets. For the task of privately ﬁnetuning the

last layer of a pretrained model (using ImageNet1K) on CIFAR-100 under ε = 1, our result

improves the previous accuracy from 70.6% (De et al., 2022) to 71.6% (Section 7).

The rest of the paper is organized as follows. In Section 2, we outline the problem setting and in

Section 3 discuss prior work and our contribution. In Section 4 we provide a counter-example for

DPSGD and in Section 5, we describe our new algorithm and its performance. In Section 6, we

provide an information theoretic lower bound. Finally, in Section 7, we demonstrate the practicality

of the proposed algorithm on image clssiﬁcation tasks.

2 PRELIMINARIES

Classiﬁcation. Let X denote the feature space and Y = {−1, 1} be the set of labels. Unless

otherwise speciﬁed, we set X to be a ball of radius R in d-dimensional space denoted by B

(R).

Let h be a classiﬁer that takes parameter θ and feature x and computes a score h(θ, x) ∈ R. The

performance of the classiﬁer is measured by a loss function ` : R × Y → R

. We assume ` is a

margin loss (Bartlett et al., 2006) of the form

`(h(θ, x), y) = φ(y · h(θ, x)), (1)

where φ : R → R is typically a convex, non-increasing function with a bounded value φ(0) < ∞

at 0. Canonical examples of φ include the hinge loss φ(z) = [1 − z]

, and the logistic loss φ(z) =

log(1 + e

−z

Let D = {(x

, y

)}

i∈[n]

be a training set of size n. Typically, the goal is to learn a classiﬁer by

minimizing the average loss over all samples given by

L(θ, D) ,

i∈[n]

`(h(θ, x

), y

We are interested in linear classiﬁcation models that are described as follows. Let θ = (w, b) ∈

d+1

, where w ∈ R

and b ∈ R, and h((w, b), x

) = w ·x

+b. We further assume that `(h(θ, x), y)

is convex in θ, and that ∀y, `(·, y) is convex and G-Lipschitz in the ﬁrst argument. The model can be

viewed as a generalized linear model (GLM), with additional restrictions on the loss function. We

denote the minimizer of the empirical loss by

∗

∈ arg min L(θ, D).

We will drop the subscript when the dataset is clear from context. The following assumption on the

minimizer will be useful in our performance analysis. It assumes that the minimizer is not a trivial

solution which does not separate any two data points in the dataset. This is a reasonable assumption

as long as ` is a good proxy loss for classiﬁcation.

Assumption 1 (Nontrivial minimizer). The dataset D and loss function ` satisﬁes that at the mini-

mizer θ

∗

, there exists x, x

∈ D such that

h(θ

∗

, x) · h(θ

∗

, x

) ≤ 0.

When unspeciﬁed, we use norm to refer to the Euclidean norm by default.

Published as a conference paper at ICLR 2024

Algorithm 1 Differentially private SGD (Abadi et al., 2016)

Input: Input: Dataset D = {(x

, y

), (x

, y

), . . . , (x

, y

)} of n points, privacy parameter ε, δ,

clipping norm C, step size η, number of iterations T , batch size B ≥ max{n

ε/4T , 1},

1: Set σ

8T C

log(1/δ)

2: Choose an inital point θ

3: for t = 0, 1, . . . , T − 1 do

4: Sample a batch B

of B data points with replacement.

5: For all i ∈ B

, compute the gradient ∇`(h(θ, x

), y

6: Compute a noisy clipped mean of the gradients by

ˆg

i∈B



Clip(∇`(h(θ

, x

), y

)), C) + N(0, σ

)



(2)

7: Update the parameter by w

t+1

= w

− ηˆg

8: end for

9: Return

T =

t=1

Differential privacy (DP) (Dwork et al., 2014). DP requires the optimization algorithm to output

similar outputs for similar training datasets. More precisely, differential privacy is deﬁned below.

Deﬁnition 1 (Differential privacy). Let D be a collection of datasets. An algorithm A : D → R is

(ε, δ)-DP if for any datasets D and D

that differ by only one data point, denoted as |D∆D

| = 1,

and any potential outcome O ∈ R, the algorithm satisﬁes

Pr (A(D) ∈ O) ≤ e

Pr (A(D

) ∈ O) + δ.

Our goal is to ﬁnd a θ

prv

that is (, δ)-differentially private and minimizes the optimality gap w.r.t.

the empirical risk, which is also referred to as the privacy error,

E[L(θ

prv

, D)] − min

L(θ, D),

where the expectation is over the randomness of the private algorithm.

Notations. We use Clip(x, C) to denote the clipping operation with norm C, deﬁned by

Clip(x, C):= min



kxk



· x.

For a vector x of dimension d, we use (x, 1) to refer to the (d +1)-dimensional vector obtained from

augmenting 1 to x. We use diam(D):= max

x,x

∈D

kx − x

to denote the diameter of features

in the dataset D. Given a set of features x

, x

, . . . , x

from a dataset D, let U(D) denote the

eigenbasis of

. Let M(D) be the projection operator to this eigenspace U (D), given by

M(D) = U (D)(U(D))

. Then, M(D) deﬁnes a seminorm

given by

kvk

M(D)

= kvM(D)v

3 RELATED WORK AND OUR CONTRIBUTION

Differentially private stochastic gradient descent (DPSGD). We start by describing the mini-

batch variant of the DPSGD algorithm in Algorithm 1. The DPSGD algorithm is a modiﬁcation

of the popular SGD algorithm for training learning models. In each round, the individual sample

gradients are clipped and noised to bound the per-round privacy loss (the inﬂuence of one data

point). The overall privacy guarantee combines tools from privacy ampliﬁcation via subsampling

and strong composition (see Abadi et al. (2016)).

DPSGD has shown to achieve dimension-independent convergence result for optimizing generalized

linear models. The upper bound below is implied by Song et al. (2020, Theorem 4.1).

Given a vector space V over real numbers, a seminorm on V is a nonnegative-valued function Q such that

for all x ∈ R and v ∈ V , Q(x · v) = |x| · Q(v) and for all u, v ∈ V , Q(u + v) ≤ Q(u) + Q(v).

Published as a conference paper at ICLR 2024

Lemma 1 (Song et al. (2020)). There exists an (ε, δ)-DP instance of DPSGD, whose output satisﬁes,

E[L(θ

prv

, D)] − L(θ

∗

, D) ≤ 2Gkθ

∗

rank(M) log(1/δ)

nε

where G is the Lipschitz constant of the loss function, R = max

and M = M(D

) with

= {(x, 1) | x ∈ D}.

Note that in the analysis in Song et al. (2020), the norm of the gradients is bounded by G

√

+ 1,

and the clip norm is chosen such that the gradients are not clipped. The above result is

shown to be minimax optimal in terms of the dependence on the stated parameters. Recall that

diam(D):= max

x,x

∈D

kx − x

denotes the diameter of feautures in the dataset D. So a natural

question is to ask is whether the dependence on R can be improved to diam(D), which can be useful

in cases when dataset if more favorable e.g., when diam(D)  R. If yes, can the improvement be

achieved by DPSGD?

We answer the ﬁrst question by proposing an algorithm that combines feature preprocessing and

DPSGD, and improves the dependence on R to diam(D) in the leading term of the optimality gap,

stated below.

Theorem 1. There exists an (ε, δ)-differentially private algorithm DPSGD-F, which is a combina-

tion of private feature preprocessing and DPSGD, such that when n = Ω



√

d log(1/δ)log R log(d)



and ε = O(1), the output θ

prv

satisﬁes

E[L(θ

prv

, D)] − L(θ

∗

, D)

= O

Gkθ

∗



diam(D) +



rank(M) log(1/δ) + log n

nε

φ(0) log(n)

nε

where M = M (D

) with D

= {(x, 1) | x ∈ D}. As discussed before Lemma 3,

can be reduced

to any inverse polynomial function of n by increasing the requirement on n by a constant factor.

We will describe the algorithm and disucss the proof in Section 5. The next theorem states that the

ﬁrst term in the above result is tight.

Theorem 2. Let A be any (ε, δ)-DP optimization algorithm with ε ≤ c and δ ≤ cε/n for some

constant c > 0. There exists a dataset D = {(x

, y

), i ∈ [n]} and a loss function `(θ · x, y) that is

convex and G-Lipschitz loss functions for all y, which is of form Eq. (1), and

E [L(A(D), D)] − L(θ

∗

, D) = Ω

G · diam(D) · min

(

kθ

∗

rank(M)

nε

where M = M(D

) with D

= {(x, 1) | x ∈ D}.

Moreover, we show that DPSGD must incur an error proportional to R by providing a counter-

example in Section 4.

4 A COUNTER-EXAMPLE FOR DPSGD

We consider the simple case of binary classiﬁcation with hinge loss. Let µ > 0. Let e

and e

denote two orthogonal basis vectors. Suppose dataset D

is partitioned into two equal parts D

and

−1

, each with size n/2. D

contains n/2 samples with x = µe

+ e

and y = 1 and D

−1

contains

n/2 samples with x = µe

− e

and y = −1. A visualization of the dataset is provided in Fig. 1.

Similarly, let the dataset D

is partitioned into two equal parts D

and D

−1

, each with size n/2.

contains n/2 samples with x = µe

+ e

and y = −1 and D

−1

contains n/2 samples with

x = µe

− e

and y = 1. We assume there is no bias term for simplicity, i.e., b = 0

. For any

w = (w(1), w(2)), let

`(w · x, y) = max{1 − y(w · x), 0}.

The bias term can be added by assuming there is an additional entry with a 1 in the feature vector. The

proof will go through similarly as the optimal parameter has b = 0.

Published as a conference paper at ICLR 2024

Figure 1: Feature vectors.

Figure 2: Gradient vectors.

Further, the average empirical loss on D

will be

L(w, D

) =

max{1 − (µw(1) + w(2)), 0}+

max{1 + (µw(1) − w(2)), 0}.

Observe that w

∗

= (0, 1) has zero empirical loss for for dataset D

and w

∗

= (0, −1) has zero

empirical loss for for dataset D

. The next theorem states that DPSGD with any clipping norm,

batch size, number of steps, learning rate and any initialization will not obtain nontrivial error on

both D

and D

when µ > nε.

Theorem 3. Let A be a (ε, δ)-private DPSGD algorithm with any number of steps T , learning rate

η, clipping norm C, and (possibly randomized) initialization w

. We have when n < µ/ε,

max

D∈{D

}



E [L(A(D), D)] − min

L(w, D)



= Ω(1).

The crux of the proof involves showing that if the initialization is “uninformative” for a dataset,

then DPSGD does not obtain nontrivial error when µ > nε. Here by “uninformative”, we mean

= (w

(1), w

(2)) satisfy that w

(2) ≤ 0 for dataset D

and w

(2) ≥ 0 for dataset D

. Note that

at least one of the above two conditions must hold. Below we give a high-level idea of the proof.

Since D

and D

−1

(and similarly D

and D

−1

) have the same x(1), the performance of the ﬁnal

optimizer will depend mostly on the second parameter w(2). The following lemma can be proved.

Lemma 2. For dataset D

, let w be any iterate of the DPSGD algorithm, then

E [L(w, D

)] − min

L(w, D

) ≥ Pr (w(2) ≤ 0).

Similarly, for dataset D

E [L(w, D

)] − min

L(w, D

) ≥ Pr (w(2) ≥ 0).

We will focus on the case when w

(2) ≤ 0 and consider dataset D

and loss L(w, D

)

. It would

be enough to show that for every iteration t, the parameter w

has a constant probability of being

negative. The gradient of each individual loss functions satisﬁes that for (x, y) ∈ D

, we have

∇

`(w · x, y) =



{µw(1) + w(2) < 1}· (−µ, −1) if (x, y) ∈ D

{−µw(1) + w(2) < 1}· (µ, −1) if (x, y) ∈ D

−1

Hence the norm of the gradient vectors can be as large as

+ 1. By clipping the gradient

to norm C, the ‘signal’ component of the gradient on the second coordinate will be decreased to

min{1, C/

+ 1}. However, in each iteration, DPSGD will add a Gaussian noise with standard

deviation proportional to C/nε, which can be large compared to min{1, C/

+ 1} when µ is

large. When n < µ/ε, the total noise will dominate the signal, making the probability of w

(2) < 0

nontrivial. We provide the complete proof in Appendix A.

DPSGD with better mean estimation algorithm. One attempt to solve the above issue is to

replace Eq. (2) by recently proposed private mean estimation algorithms, which can improve the

estimation error to scale with the diameter of the gradients (e.g., in Karwa & Vadhan (2017); Huang

et al. (2021)). However, it is unclear whether this will work. As shown in Fig. 2, the diameter of the

gradients can be proportional to the maximum norm of feature vectors instead of the diameter.

When w

(2) ≥ 0, we will consider D

and loss L(w, D

) instead.

Published as a conference paper at ICLR 2024

Algorithm 2 DPSGD with feature preprocessing (DPSGD-F).

Input: Input: Dataset D = {(x

, y

), (x

, y

), . . . , (x

, y

)} of n points, ε, δ. Private mean es-

timation algorithm PrivMean (Lemma 3). Private quantile esimation algorithm PrivQuantile

(Lemma 4), DPSGD for GLM (Lemma 1).

1: Private mean estimation. Compute ˆµ, the differentailly private mean of features in D with

Lemma 3 and privacy budget (ε/3, δ/3).

2: Private quantile estimation. Let S = {kx − ˆµk

, x ∈ D} be the set of distances from the

feature vectors to the computed mean. Find a private quantile τ of set S using Lemma 4 with

privacy budget (ε/3, δ/3).

3: Translate and augment the dataset with bias. Preprocess the dataset by translation and ap-

pending a constant feature τ. Let D

= {(x

, y

), . . . , (x

, y

)} where

= (x

− ˆµ, τ).

4: Feature clipping. Let D

= {(x

, y

), (x

, y

), . . . , (x

, y

)}, where

= Clip



√

2τ



5: DPSGD with preprocessed features. Compute an approximate minimizer θ

prv

∈ R

d+1

L(θ, D

) using the DPSGD algorithm for GLM in Lemma 1 with (ε/3, δ/3).

6: Return θ

prv

= (w

prv

, b

prv

), where w

prv

= θ

prv

[1 : d] and b

prv

= θ

prv

[d + 1]τ − ˆµ.

In the next section, we take advantage of the fact that the data points are close in the fea-

ture space, and design an algorithm that ﬁrst privately preprocesses the features and then per-

form DPSGD to achieve better bounds. In particular, the bound in Theorem 1 states that when

max{

log(µ) log(1/δ)/ε,

µ/ε}  n < µ/ε, the new algorithm achieves a o(1) error, which is

strictly better than DPSGD.

5 A FEATURE PREPROCESSING AUGMENTED DPSGD ALGORITHM

We propose a new algorithm called DPSGD-F (Algorithm 2) that combines feature preprocessing

and DPSGD. We show that the new algorithm can achieve the performance described in Theorem 1.

We overview each step of the algorithm below and provide theoretical guarantees for each step. The

detailed proofs are listed in Appendix B.

Private mean estimation of features. We ﬁrst compute a differentially private mean of features

ˆµ using the instance-optimal mean estimation algorithm from Huang et al. (2021), who showed

that the mean can be estimated efﬁciently with error proportional to the diameter of the dataset. The

following result on private mean estimation follows by combining Theorem 3.3 and Remark 3.2 from

Huang et al. (2021). Note that the result of Huang et al. (2021) is a high-probability bound, which

does not explicitly depend on R (or r(D) in their notation). To obtain an in-expectation bound, we

choose a failure probability of 1/n

in the proof and this results in the R/n

term in Lemma 3. This

term can be reduced to any inverse polynomial function of n by increasing the requirement on n

in the assumption by a constant factor. We choose 1/n

here mainly for presentation purpose. See

Appendix B.1 for the detailed analysis.

Lemma 3 (Appendix B.1). Let ε ≤ log(3/δ) and n ≥ c ·



√

d log(1/δ) log R log(d)



for a sufﬁciently

large constant c, then there exists an (ε/3, δ/3)-DP algorithm whose output ˆµ ∈ B

(R) satisﬁes

E[kµ − ˆµk

] ≤ diam(D) +

The above lemma implies that if n is large enough, ˆµ is a good approximation for µ and hence

∀x

∈ D, E[kx

− ˆµk

] ≤ 2diam(D) +

. (3)

Existing algorithms for differentially private ERM require the knowledge of diam(D), which is

unknown. Hence, we compute an estimate of diam(D) by private quantile estimation.

Published as a conference paper at ICLR 2024

Private quantile estimation. Changing one sample in a dataset can change the diameter by R.

Hence, the sensitivity of diam(D) is R and it is difﬁcult to estimate it privately with good accuracy.

Instead we estimate the

O(1/(nε)) quantile of kx

− ˆµk

using Dick et al. (2023, Theorem 2).

Lemma 4 (Appendix B.2). There exists a (ε/3, δ/3) algorithm that ﬁnds a threshold τ such that

E[τ|ˆµ] ≤ 2 max

− ˆµk

Let ε = O(1) and Z

be the number of points such that kx

− ˆµk is larger than τ, then E[Z

| ˆµ] ≤

125

log n.

Translated and augment dataset with bias. Another issue with the above mentioned approach is

the scale of the bias term. If θ = (w, b), then the gradient with respect to w scales linearly in x,

but is a ﬁxed constant for b. This causes issues in the theoretical analysis. Hence, we construct an

augmented dataset D

where x

= (x

− ˆµ, τ). Note that x

∈ R

d+1

. We also remove the bias term

when optimizing over D

since we have already included a bias term in the augmented features.

This also guarantees that the norm of the gradient is bounded by Θ(τ ) with high probability. We

show that we are not losing any information in translation and augmentation and hence the minimum

value of empirical risk should remain the same. We formalize it in the next lemma.

Lemma 5. For any θ = (w, b), let θ

= (w, (b + w · ˆµ)/τ ), we have

L(θ, D) = L(θ

, D

In particular, (w

∗

, b

∗

) is a minimizer of D if and only if (w

∗

, (b

∗

+ ˆµ ·w

∗

)/τ) is a minimizer of D

Proof. To prove the ﬁrst result, observe that for any point (x, y) ∈ D,

`(h((w, b), x), y)) = φ(y · (w · x + b))

= φ(y · (w · (x − ˆµ) + b + w · ˆµ))

= φ(y

· (w · x

+ (b + w · ˆµ)/τ · τ ))

= `(h((w, (b + w · ˆµ)/τ ), x

), y

The second result follows by taking the minima over all (w, b) in the ﬁrst result.

Recall that results on privacy error involve the projector on the design matrix and the norm of the

minimizer. Hence, we next provide connections between the rank of M (D) and M(D

) and the

norms of the minimizers of D and D

. The proof mainly follows from the fact that the translating

and augmenting operations are performing the same linear operation on each data point, which won’t

change the minimizer up to scaling and the bias term.

Lemma 6 ((Appendix B.3)). Let M (D) and M(D

) be the projectors design matrices of non-

augmented dataset D and augmented dataset D

respectively, then

rank(M(D

)) ≤ rank(M(D)) + 5.

Furthermore, let θ

∗

and θ

0∗

be the empirical risk minimizers of D and D

respectively. If Assump-

tion 1 holds then

E[kθ

0∗

· τ|ˆµ] ≤ 2kθ

∗

max

∈D

kˆµ − x

2kθ

∗

Feature clipping. As mentioned above, due to errors in previous steps, the `

norm of the features

may not be bounded with a small probability. Hence, we clip the augmented features from dataset

to obtain a dataset D

. Similar to the Lemma 6, we relate the rank of M(D

) and M(D

) next.

The proof mainly follows from the fact that clipping will only change the scale of each feature vector

but not their direction.

Lemma 7 (Appendix B.4). Let M (D

) and M(D

) be the projectors to the design matrices of

datasets D

and D

respectively, then

rank(M(D

)) = rank(M(D

)).

Published as a conference paper at ICLR 2024

DPSGD with preprocessed features. We next solve ERM using DPSGD algorithm on D

. Since

norm of the features are bounded by

√

2τ, the gradients will be bounded by

√

2Gτ. Furthermore,

since the dataset is augmented, we don’t include the bias term and treat weights as a vector in R

d+1

The guarantee of DPSGD on these features follows from previous work Song et al. (2020, Theorem

4.1). Similar to Song et al. (2020), the clipping norm is chosen to be

√

2Gτ so that no gradients are

clipped in all steps.

Reverting the solution to the original space. The output of the DPSGD algorithm is in R

d+1

, and

furthermore only works on the translated and augmented dataset D

. Hence, we translate it back to

the original space, by appropriately obtaining w

prv

and rescaling the bias.

The computation complexity of the algorithm is the same as DPSGD since the preprocessing algo-

rithms can all be implemented in time

O(nd), which is less than that of the DPSGD phase. The

proof of privacy guarantee in Theorem 1 follows directly from the composition theorem (Dwork

et al., 2014, Theorem 3.16). The utility guarantee follows from the above stated results and we

provide the complete proof of in Appendix B.5.

6 LOWER BOUND

In this section, we prove Theorem 2, which shows that the performance of our algorithm is tight

up to logarithmic factors. The proof follows almost immediately from Theorem 3.3 in Song et al.

(2021a), with a slight modiﬁcation to the label and loss function to make sure that it is a valid

classiﬁcation problem of form Eq. (1). We ﬁrst state the result in Song et al. (2021a) (a restated

version of Theorem 3.3) below.

Theorem 4 (Song et al. (2021a)). Let A be any (ε, δ)-DP algorithm with ε ≤ c and δ ≤ cε/n for

some constant c > 0. Let 1 ≤ d

≤ d be an integer. Let ` be deﬁned by `(θ ·x, y) = |θ ·x −y|. Then

there exists a data set D = {(x

, y

), i ∈ [n]} such that the following holds. For all i ∈ [n], x

∈

{(0, . . . , 0), (1, 0, . . . , 0), (0, 1, 0, . . . , 0), (0, . . . , 0, 1)} and y

∈ {0, 1}. Let θ

∗

= arg min

θ∈R

L(θ; D)

(breaking ties towards lower kθk

). We have kθ

∗

∈ [0, 1]

, and



Proj

[0,1]

(A(D)); D

i

− L (θ

∗

; D) ≥ Ω



min



εn



where Proj

[0,1]

(A(D)) denotes the projection of A(D) to [0, 1]

Proof of Theorem 2: Take the dataset D = {(x

, y

), i ∈ [n]} from Theorem 4, we can construct

a dataset D

as following. ∀i, let x

= x

, y

= 2y

−1. Set `

(θ ·x, y) = max{1 −y(2θ ·x −1), 0},

which is of the form Eq. (1). Then, for all θ ∈ [0, 1]

, we have θ · x ∈ [0, 1] and

(θ · x, y) =



2 − 2θ · x = |2θ · x − 1 − y| if y = +1,

2θ · x = |2θ · x − 1 − y| if y = −1.

Hence ∀i ∈ [n], `

(θ · x

, y

) = |2θ · x

− 1 − y

| = 2|θ · x

− y

|. Moreover, it can be seen that

∀x ∈ [0, 1]

, y ∈ {+1, −1}

(Proj

[0,1]

(θ) · x, y) ≤ `

(θ · x, y). (4)

Hence the minimizer of L

(θ; D

) lies in [0, 1]

. The above facts imply that θ

∗

is also the minimizer

of L

(θ; D

), and by Theorem 4,



Proj

[0,1]

(A(D

)), D

i

− L

(θ

∗

; D

) ≥ Ω



min



εn



Together with Eq. (4), we have

E [L

(A(D

), D

)] − L

(θ

∗

; D

) ≥ Ω



min



εn



Theorem 3.3 in Song et al. (2021a) is stated without the projection operator while in fact the projection

operator is used throughout the proof of Theorem B.1 and Theorem 3.3. Hence the stated result holds with no

change in the proof.

Published as a conference paper at ICLR 2024

Note that for all θ ∈ R

, ∀i ∈ [n], k∂

(h(θ, x

), y

≤ 2kx

≤ 2. And the dataset satisﬁes

that diam(D) ≤

√

2, rank(

i=1

) ≤ d

. Moreover, kθ

∗

≤ kθ

∗

≤

√

. Hence we have

E [L

(A(D

), D

)] − L

(θ

∗

; D

) ≥ Ω

diam(D

) min

(

kθ

∗

rank(M)

nε

The proof can be generalized to a general Lipschitz constant G and diam(D) by rescaling.

7 EXPERIMENTS

We empirically demonstrate our ﬁndings by evaluating DPSGD and our proposed feature-

normalized DPSGD (DPSGD-F

) for the task of training a linear classiﬁer on three popular image

classiﬁcation datasets: (1) MNIST (Lecun et al., 1998); (2) Fashion-MNIST (Xiao et al., 2017); (3)

CIFAR-100 (Krizhevsky et al., 2009) with pretrained features. For MNIST and Fashion-MNIST, we

directly train a linear classiﬁer with the pixel-format images as inputs. For CIFAR-100, we use the

features obtained from the representations in the penultimate layer in a WRN model pretrained on

ImageNet (De et al., 2022). In this case, training a linear classiﬁer using these features is equivalent

to ﬁne-tuning the last layer of the WRN model.

Table 1: Accuracy comparison on different datasets with different values of ε and δ = 10

−5

. †

CIFAR-100 is trained and tested using features pretrained on ImageNet (De et al., 2022). Each

number is an average over 10 independent runs. The number in the parenthesis represents the stan-

dard deviation of the accuracy under the optimal parameter settings.

Dataset

Accuracy (ε = 1) Accuracy (ε = 2) Non-private acc.

DPSGD DPSGD-F DPSGD DPSGD-F SGD

MNIST 87.4 (0.1) 92.0 (0.1) 89.6 (1.2) 92.3 (0.1) 92.9 (0.1)

FMNIST 77.2 (0.4) 84.0 (0.1) 78.7 (0.4) 84.5 (0.2) 85.0 (0.1)

CIFAR-100 (pretrained)† 70.6 (0.3)

71.6 (0.2) 73.8 (0.2) 74.4 (0.2) 78.5 (0.1)

Similar to De et al. (2022), for each combination of dataset, privacy parameter, and algorithm, we

report the best accuracy obtained from a grid search over batch size, steps size, and number of

epochs

. For DPSGD-F, we also did a grid search over the allocation of privacy budget between the

preprocessing phase and the DPSGD phase. We leave choosing the parameters automatically as a

future direction. Detailed implementation and parameter settings are listed in Appendix C.

Our results are listed in Table 1. Our proposed algorithm consistently improves upon DPSGD for

all three datasets under ε = 1 and ε = 2, which empirically demonstrates our theoretical ﬁndings.

8 DISCUSSION

In this work, we ask the fundamental question of whether DPSGD alone is sufﬁcient for obtaining

a good minimizer for linear classiﬁcation. We partially answer this question by showing that the

DPSGD algorithm incurs a privacy error proportional to the maximum norm of the features over all

samples, where as DPSGD with a feature preprocessing step incurs a privacy error proportional to

the diameter of the features, which can be signiﬁcantly small compared to the maximum norm of

features in several scenarios. Our preliminary experiments show that feature preprocessing helps in

practice on standard datasets. Investigating whether these results can be extended to bigger datasets

and beyond linear models remain interesting open questions.

We perform a modiﬁed version of Algorithm 2, which better aligns with existing practical implementations

of DPSGD. See Algorithm 3 for details. Note that the resulting algorithm is still a valid (, δ)-differentially

private.

The reported accuracy using DPSGD in De et al. (2022) is 70.3%. We note that the improvement to

70.6% is due to the fact that we are using the privacy accoutant based on Privacy Loss Distributions (Meiser &

Mohammadi, 2018; Sommer et al., 2019; Doroshenko et al., 2022), which leads to a smaller noise multiplier

under the same privacy budget.

The reported accuracy doesn’t take into account the privacy loss in the hyperparamter tuning phase.

Published as a conference paper at ICLR 2024

REFERENCES

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and

Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC

Conference on Computer and Communications Security, pp. 308–318, 2016.

Raman Arora, Raef Bassily, Crist

obal Guzm

an, Michael Menart, and Enayat Ullah. Differentially

private generalized linear models revisited. Advances in Neural Information Processing Systems,

35:22505–22517, 2022.

Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy:

Analytical calibration and optimal denoising. In Jennifer Dy and Andreas Krause (eds.), Pro-

ceedings of the 35th International Conference on Machine Learning, volume 80 of Proceed-

ings of Machine Learning Research, pp. 394–403. PMLR, 10–15 Jul 2018. URL https:

//proceedings.mlr.press/v80/balle18a.html.

Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classiﬁcation, and risk bounds.

Journal of the American Statistical Association, 101(473):138–156, 2006. ISSN 01621459. URL

http://www.jstor.org/stable/30047445.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic

convex optimization with optimal rates. In Advances in Neural Information Processing Systems,

pp. 11279–11288, 2019.

Raef Bassily, Vitaly Feldman, Crist

obal Guzm

an, and Kunal Talwar. Stability of stochastic gradient

descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:

4381–4391, 2020.

Raef Bassily, Crist

obal Guzm

an, and Michael Menart. Differentially private stochastic optimization:

New results in convex and non-convex settings. Advances in Neural Information Processing

Systems, 34:9317–9329, 2021a.

Raef Bassily, Crist

obal Guzm

an, and Anupama Nandi. Non-euclidean differentially private stochas-

tic convex optimization. In Conference on Learning Theory, pp. 474–499. PMLR, 2021b.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal

Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao

Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http:

//github.com/google/jax.

Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlock-

ing high-accuracy differentially private image classiﬁcation through scale. arXiv preprint

arXiv:2204.13650, 2022.

Travis Dick, Alex Kulesza, Ziteng Sun, and Ananda Theertha Suresh. Subset-based instance opti-

mality in private estimation. arXiv preprint arXiv:2303.01262, 2023.

Vadym Doroshenko, Badih Ghazi, Pritish Kamath, Ravi Kumar, and Pasin Manurangsi. Connect

the dots: Tighter discrete approximations of privacy loss distributions. Proceedings on Privacy

Enhancing Technologies, 4:552–570, 2022.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations

and Trends

 in Theoretical Computer Science, 9(3–4):211–407, 2014.

Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal

rates in linear time, 2020.

Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes,

Soham De, Samuel L Smith, Olivia Wiles, and Borja Balle. Differentially private diffusion models

generate useful synthetic images. arXiv preprint arXiv:2302.13861, 2023.

Google. Tensorﬂow privacy, 2018. URL https://github.com/google/

differential-privacy/.

Published as a conference paper at ICLR 2024

Ziyue Huang, Yuting Liang, and Ke Yi. Instance-optimal mean estimation under differential privacy.

Advances in Neural Information Processing Systems, 34:25993–26004, 2021.

Peter Kairouz, Ziyu Liu, and Thomas Steinke. The distributed discrete gaussian mechanism for

federated learning with secure aggregation. In International Conference on Machine Learning,

pp. 5201–5212. PMLR, 2021.

Vishesh Karwa and Salil Vadhan. Finite sample differentially private conﬁdence intervals, 2017.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.

2009.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recog-

nition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.

Yann LeCun, L

eon Bottou, Genevieve B Orr, and Klaus-Robert M

uller. Efﬁcient backprop. In

Neural networks: Tricks of the trade, pp. 9–50. Springer, 2002.

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be

strong differentially private learners. In International Conference on Learning Representations,

2021.

Xuechen Li, Daogao Liu, Tatsunori B Hashimoto, Huseyin A. Inan, Janardhan Kulka-

rni, Yin-Tat Lee, and Abhradeep Guha Thakurta. When does differentially pri-

vate learning not suffer in high dimensions? In S. Koyejo, S. Mohamed,

A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Infor-

mation Processing Systems, volume 35, pp. 28616–28630. Curran Associates, Inc.,

2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/

file/b75ce884441c983f7357a312ffa02a3c-Paper-Conference.pdf.

H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private

recurrent language models. In International Conference on Learning Representations, 2018.

Sebastian Meiser and Esfandiar Mohammadi. Tight on budget? tight bounds for r-fold approximate

differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and

Communications Security, CCS ’18, pp. 247–264, New York, NY, USA, 2018. Association for

Computing Machinery. ISBN 9781450356930. doi: 10.1145/3243734.3243765. URL https:

//doi.org/10.1145/3243734.3243765.

Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H Brendan

McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta. How to dp-fy ml: A

practical guide to machine learning with differential privacy. arXiv preprint arXiv:2303.00654,

2023.

David Sommer, Sebastian Meiser, and Esfandiar Mohammadi. Privacy loss classes: The central

limit theorem in differential privacy. Proceedings on Privacy Enhancing Technologies, 2019:

245–269, 04 2019. doi: 10.2478/popets-2019-0029.

Shuang Song, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient de-

scent on convex generalized linear problems. arXiv preprint arXiv:2006.06783, 2020.

Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimen-

sionality in unconstrained private glms. In Arindam Banerjee and Kenji Fukumizu (eds.), Pro-

ceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume

130 of Proceedings of Machine Learning Research, pp. 2638–2646. PMLR, 13–15 Apr 2021a.

URL https://proceedings.mlr.press/v130/song21a.html.

Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the curse of dimen-

sionality in unconstrained private glms. In International Conference on Artiﬁcial Intelligence and

Statistics, pp. 2638–2646. PMLR, 2021b.

Florian Tramer and Dan Boneh. Differentially private learning needs better features (or much

more data). In International Conference on Learning Representations, 2021. URL https:

//openreview.net/forum?id=YTWGvpFOQD-.

Published as a conference paper at ICLR 2024

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-

ing machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/

abs/1708.07747.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan

Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private ﬁne-tuning

of language models. arXiv preprint arXiv:2110.06500, 2021.

A PROOF OF THEOREM 3

In the proof, we assume w

(2) ≤ 0 and consider D = D

. The proof will follow similarly when

(2) ≥ 0 and D = D

. We start by proving Lemma 2.

Proof of Lemma 2: Since max{1 − x, 0} is a convex function in x, we have

L(w, D) =

max{1 − (µw(1) + w(2)), 0}+

max{1 + (µw(1) − w(2)), 0}

≥ max{

1 − (µw(1) + w(2)) + 1 + (µw(1) − w(2))

, 0}

= max{1 − w(2), 0}.

This implies that

E [L(w, D)] − min

L(w, D) ≥ E[max{1 − w(2), 0}] ≥ Pr (w(2) ≤ 0).

Note that the proof also applies to the ﬁnal output A(D) =

t=1

. And hence to prove

Theorem 3, it would be enough to show that.

t=1

(2) ≤ 0

= Ω(1).

In the later analysis, we will focus on the update process of w

(2). We will prove the following

lemma:

Lemma 8. Let X be distributed as N



√

, σ



with σ

= Θ





, we have

t=1

(2) ≤ 0

≥ Pr (X ≤ 0).

Note that Lemma 8 would immediately imply the theorem by Chernoff’s inequality. We then turn

to proving Lemma 8 below.

For (x, y) ∈ D, we have

∇

`(w · x, y) =



{µw(1) + w(2) < 1}· (−µ, −1) if (x, y) ∈ D

{−µw(1) + w(2) < 1}· (µ, −1) if (x, y) ∈ D

−1

Let B

be the batch of users in iteration t with |B

| = B, the averaged clipped gradient at each

iteration will be

ˆg

(x,y)∈B

Clip(∇

`(w

· x, y), C)

min

(

+ 1

)





(x,y)∈B

∩D

{µw

(1) + w

(2) < 1} · (−µ, −1)





min

(

+ 1

)





(x,y)∈B

∩D

−1

{−µw(1) + w(2) < 1}· (µ, −1)





Published as a conference paper at ICLR 2024

Hence

ˆg

(2)

= −

min

(

+ 1

)

(|B

∩ D

| {µw

(1) + w

(2) < 1} + |B

∩ D

−1

| {−µw(1) + w(2) < 1})

≥ −

min

(

+ 1

)

(|B

∩ D

| + |B

∩ D

−1

= −min

(

+ 1

)

Standard privacy analysis in DP-SGD (the setting of parameters in Algorithm 1 used in Bassily

et al. (2019)) sets B ≥ max{n

ε/4T , 1} and adds a Gaussian noise N(0, σ

) with σ

8T C

log(1/δ)

to the averaged clipped gradient. Let N

denote the noise added at iteraction t, we

have

t+1

(2) ≤ w

(2) + η

min

(

+ 1

)

+ N

(2)

Hence we have

(2) ≤ w

(2) + η

t−1

i=0

min

(

+ 1

)

+ N

(2)

≤ η

t min

(

+ 1

)

t−1

i=0

(2)

This implies:

t=1

(2) ≤ η

T + 1

min

(

+ 1

)

T −1

t=0

(T − t)N

(2)

Note that the right hand side is a Gaussian distributed random variable with mean

T +1

min



√



and variance η

4(T +1)(2T +1)C

log(1/δ)

. Hence

t=1

(2) ≤ 0

≤ Pr

T + 1

min

(

+ 1

)

, η

4(T + 1)(2T + 1)C

log(1/δ)

≤ 0

≤ Pr

min

(

+ 1

)

16(2T + 1)C

log(1/δ)

3(T + 1)n

≤ 0

This completes the proof of Lemma 8.

B MISSING PROOFS IN SECTION 5

B.1 PROOF OF LEMMA 3

Let ε ≤ log(3/δ) and let ρ =

49 log(3/δ)

. Let α =

and γ(ζ) =

· log(2d

3/2

) log

(d log(2d

3/2

))

. If n ≥ c · γ(ζ)

√

d (for a sufﬁciently large constant c), then

A different privacy accounting method might give slightly improved result. But overall we will have

= Θ



8T C

log(1/δ)



, and our claim in Lemma 8 won’t change up to constant factors.

Published as a conference paper at ICLR 2024

by Theorem 3 and Remark 1 from Huang et al. (2021), there exists a ρ-zCDP algorithm that outputs

ˆµ such that

kµ − ˆµk

= O

+ γ(ζ)

diam(D)

log(nd/ζ)

with probability at least 1 − ζ. Let f (ζ) = O



+ γ(ζ)



diam(D)

√

log(nd/ζ)

. Hence,

E[kµ − ˆµk

] = E[kµ − ˆµk

kµ−ˆµk

≤f(ζ)

] + E[kµ − ˆµk

kµ−ˆµk

>f(ζ)

]

≤ E[f(ζ)] + E[2R1

kµ−ˆµk

>f(ζ)

]

≤ f(ζ) + 2R Pr(kµ − ˆµk

> f(ζ))

≤ f(ζ) + 2Rζ,

where the ﬁrst inequality follows by observing that both µ and ˆµ have norm at most R. Setting

ζ =

yields,

E[kµ − ˆµk

] ≤ O

+ γ(1/n

)

diam(D)

log(n

≤ O

log(dn)

√

diam(D)

log(n

≤ O



√

d + log(dn)



diam(D)

log(n

log(1/δ)

n

≤ diam(D) +

where the last inequality follows based on condition on n. Note that n ≥ c · γ(1/n

)

√

d based on

the assumption in the lemma. By (Kairouz et al., 2021, Lemma 4), this algorithm is also (/3, δ/3)-

differentially private and hence the lemma.

B.2 PROOF OF LEMMA 4

Throughout the proof, we assume ˆµ is ﬁxed and prove the statement for any ˆµ. Applying (Dick

et al., 2023, Theorem 2) with a = 0, b = R, privacy budget /3, error probability ζ, α = R/n

r = n −

100



log n, yields that the output ˆτ satisﬁes the following: with probability at least 1 − ζ,

there exists a τ

such that

|ˆτ − τ

| ≤

and there are at most

100



log n +



log

ζ·R

100



log n +



log

points with kx

− ˆµk

than τ

and at least

100



log n −



log

points with kx

− ˆµk

less than τ

. Let τ = ˆτ +

, then τ

satisﬁes the following: there are most

100



log n +



log

points with kx

− ˆµk

more than τ and

at least

100



log n −



log

points with kx

− ˆµk

greater than τ . Let S be the set of points with

norm larger than τ. Let s

be a value we set later.

E[|S|] = E[|S|1

|S|≤s

] + E[|S|1

|S|≥s

]

≤ E[s

|S|≤s

] + E[n1

|S|≥s

]

≤ s

+ n Pr(|S| ≥ s

Setting ζ =

and s

100



log n +



log n yields that

E[|S|] ≤

124



log n ≤

125



log n.

Published as a conference paper at ICLR 2024

To prove the other result, let s

be a value which we set later. Observe that

E[τ] ≤ E



ˆτ +



≤ E[ˆτ] +

≤ E[ˆτ1

|S|≤s

] + 2E[ˆτ 1

|S|≥s

] +

≤ RE[1

|S|≤s

] + E[max

− ˆµk

|S|≥s

] +

≤ R Pr(1

|S|≤s

) + max

− ˆµk

Setting s

100



log n −



log n yields that

E[τ] ≤ max

− ˆµk

B.3 PROOF OF LEMMA 6

For the augmented dataset D

)



− ˆµ)(x

− ˆµ)

− ˆµ) nτ





+ ˆµˆµ

− nµˆµ

− nˆµµ

nτ(µ − ˆµ)

nτ(µ − ˆµ) nτ



where µ =

. Hence we have rank(M(D

)) ≤ rank(M(D)) +rank(µˆµ

) + rank(ˆµµ

) +

rank(ˆµˆµ

) + 2, where we use the fact that adding one column/row will at most increase the rank of

a matrix by one.

To prove the second result, note that by triangle inequality,

kθ

0∗

τ ≤ kw

∗

τ + |b

∗

|τ

≤ kw

∗

τ + |b

∗

|τ,

where the second inequality follows from Lemma 5. We now bound the second term in the right

hand side of the above equation. Observe that by Assumption 1, there exists x

and x

such that

∗

+ b

∗

τ ≥ 0 and w

∗

+ b

∗

τ ≤ 0,

and hence

−w

∗

≤ b

∗

τ ≤ −w

∗

Hence,

∗

· (ˆµ − x

) ≤ b

∗

τ + w

∗

· ˆµ ≤ w

∗

· (ˆµ − x

Hence by Lemma 5,

∗

|τ = |b

∗

+ w

∗

· ˆµ|

≤ max

∈D

∗

· (ˆµ − x

≤ kw

∗

max

∈D

k(ˆµ − x

(5)

Combining the above two equations yield,

kθ

0∗

τ ≤ kw

∗

τ + kw

∗

max

∈D

k(ˆµ − x

≤ kθ

∗

τ + kθ

∗

max

∈D

k(ˆµ − x

Hence by Lemma 4,

E[kθ

0∗

τ|ˆµ] ≤ kθ

∗

E[τ|ˆµ] + kθ

∗

max

∈D

k(ˆµ − x

≤ 2kθ

∗

max

∈D

k(ˆµ − x

2kθ

∗

Published as a conference paper at ICLR 2024

B.4 PROOF OF LEMMA 7

If v ∈ span(U(D

)), then v

v > 0. Hence

> 0, which implies

0 and hence v

00>

v > 0 and hence v ∈ span(U (D

)). Similar analysis from U(D

) to

U(D

) yields that span(U(D

)) = span(U(D

)) and hence rank(M(D

)) = rank(M(D

)).

B.5 PROOF OF THEOREM 1

The proof of privacy guarantee follows directly from the composition theorem (Dwork et al., 2014,

Theorem 3.16). Note that for simplicity, we used composition theorem here. However in experi-

ments, we use the recent technique of PLD accountant (Meiser & Mohammadi, 2018; Sommer et al.,

2019; Doroshenko et al., 2022) implemented in Google (2018) to compute privary paramters  and

δ.

We now focus on the utility guarantee.

Let θ

∗

, θ

0∗

, θ

00∗

be the empirical risk minimizers of D, D

, D

respectively. Let

M(D), M(D

), M (D

) be the design matrices of datasets D, D

, D

respectively. Due to

feature-clipping, the gradient for any (x

, y

) ∈ D

and θ, the gradient of `(h(θ, x

), y

) is upper

bounded by

Gkx

≤

√

2Gτ.

Hence, by (Song et al., 2021a, Theorem 3.1)

and Lemma 6,

E[L(θ

prv

, D

)|D

] − L(θ

0∗

, D

) ≤ 2

√

2Gτkθ

0∗

M(D

)

rank(M(D

)) log(3/δ)

n

≤ 10

√

2Gτkθ

0∗

rank(M) log(3/δ)

n

. (6)

By Lemma 6 and Equation 3,

E[τkθ

0∗

] = E[E[τkθ

0∗

|ˆµ]

≤ 2kθ

∗

E[max

∈D

k(ˆµ − x

] +

wkθ

∗

≤ 4kθ

∗

diam(D) +

6kθ

∗

Substituting the above equation in equation 6 yields and taking expectation over D

yields

E[L(θ

prv

, D

)] − E[L(θ

0∗

, D

)] ≤ 60

√

2Gkθ

∗



diam(D) +



rank(M) log(3/δ)

n

Note that(Song et al., 2021a, Theorem 3.1) states the bound in terms of empirical risk minimizer θ

00∗

, but

the bound holds for any θ.

Published as a conference paper at ICLR 2024

Let S be the set of samples which differ between D

and D

. By Lemma 6,

E[L(θ

0∗

, D

)] ≤ E[L(θ

0∗

, D

)] + E

i∈S

`(h(θ

0∗

, x

), y

) − `(h(θ

0∗

, x

), y

)

≤ E[L(θ

0∗

, D

)] + E

i∈S

|`(h(θ

0∗

, x

), y

) − `(h(θ

0∗

, x

), y

≤ E[L(θ

0∗

, D

)] + E

i∈S

G|h(θ

0∗

, x

) − h(θ

0∗

, x

≤ E[L(θ

0∗

, D

)] + E

i∈S



∗

· x

| + |b

∗

|τ



(a)

≤ E[L(θ

0∗

, D

)] + E

i∈S

2G max

∗

· (x

− ˆµ)|

≤ E[L(θ

0∗

, D

)] + E

i∈S

2Gkw

∗

· (diam(D) + kˆµ − µk

)

(b)

= E[L(θ

0∗

, D

)] +

i∈S

∗

· (diam(D) + kˆµ − µk

)

≤ E[L(θ

0∗

, D

)] +

2Gkw

∗

E [E[|S| | ˆµ] · (diam(D) + kˆµ − µk

)]

(c)

≤ E[L(θ

0∗

, D

)] +

250G log nkw

∗

n

E [(diam(D) + kˆµ − µk

)]

(d)

≤ E[L(θ

0∗

, D

)] +

250G log nkw

∗

n



2diam(D) +



≤ E[L(θ

0∗

, D

)] +

500G log nkθ

∗

n



2diam(D) +



where (a) follows from Eq. (5), (b) follows from Lemma 5, (c) follows from Lemma 4, and (d)

follows from Lemma 3. We now bound in the other direction.

E[L(θ

prv

, D

)] = E[L(θ

prv

, D

)] +

i∈S

`(h(θ

prv

, x

), y

) − `(h(θ

prv

, x

), y

)].

Notice that if y

prv

· x

≤ 0, then y

prv

· x

≤ y

prv

· x

≤ 0. Hence,

E[L(θ

prv

, D

)] ≥ E[L(θ

prv

, D

)] +





i∈S:y

prv

·x

`(h(θ

prv

, x

), y

) − `(h(θ

prv

, x

), y

)





≥ E[L(θ

prv

, D

)] −





i∈S:y

prv

·x

`(h(θ

prv

, x

), y

)





≥ E[L(θ

prv

, D

)] −

E[|S|φ(0)]

≥ E[L(θ

prv

, D

)] −

125 log n

n

φ(0),

where the last inequality follows from Lemma 4. Hence we have

E[L(θ

prv

, D

)] − L(θ

∗

, D

) ≤

500G log nkθ

∗

n



2diam(D) +



125 log n

n

φ(0).

By Lemma 5, we have E[L(θ

prv

, D

)] − L(θ

∗

, D

) = E[L(θ

prv

, D)] − L(θ

∗

, D). This yields that

E[L(θ

prv

, D)] − L(θ

∗

, D) ≤ cGkθ

∗



diam(D) +



rank(M) log(3/δ) + log n

n

cφ(0)

n

log n,

Published as a conference paper at ICLR 2024

Algorithm 3 Modeﬁed version of DPSGD with feature preprocessing (DPSGD-F

∗

Input: Input: Dataset D = {(x

, y

), (x

, y

), . . . , (x

, y

)} of n points; overall privacy budget

ε, δ; privacy budget for the preprocessing step ε

∈ (0, ε); feature norm bound C

; gradient

clipping norm C

; step size η, number of iterations T ; batch size B.

1: Private mean estimation. Compute the noise multiplier σ

for the Gaussian mechanism with

and δ using analytical callibration for Gaussian mechanism Balle & Wang (2018), and com-

pute

ˆµ =

i=1

+ N(0, σ

2: Translate the features. Preprocess the dataset and obtain D

= {(x

, y

), . . . , (x

, y

)},

where

= x

− ˆµ.

3: DPSGD with preprocessed features. Compute the privacy budget ε

for the DPSGD phase

using the PLD accountant in Google (2018) by setting (ε, δ) as the overall privacy budget for

the composition of both the mean estimation phase and DPSGD phase. Get an an approximate

minimizer θ

prv

of L(θ, D

) using the DPSGD algorithm with privacy budget (ε

, δ).

4: Return θ

prv

= (w

prv

, b

prv

), where w

prv

= θ

prv

[1 : d] and b

prv

= θ

prv

[d + 1] − ˆµ · w

prv

for a sufﬁcient large constant c. The theorem follows by observing that for the minimum norm

solution θ

∗

, kθ

∗

= kθ

∗

C ADDITIONAL EXPERIMENT DETAILS.

In this section, we discuss the detailed implementation and parameter settings used in our experi-

ments. We implement all algorithms and experiments using the open-source JAX (Bradbury et al.,

2018) library. For privacy accounting, we use the PLD accountant implemented in Tensorﬂow Pri-

vacy (Google, 2018). We ﬁrst describe the implementation details of our experiments.

Feature normalization. For all three datasets, we consider their feature-normalized version,

where we normalize the feature vectors to a ﬁxed norm of C

by the transformation below

= x

We treat C

as a ﬁxed constant, and hence this normalization step doesn’t result in any privacy loss

about the dataset.

Modiﬁcations to DPSGD-F. The version of DPSGD-F listed in Algorithm 2 is mainly presented

for ease of strict theoretical analysis since existing analysis on DPSGD mainly assumse that the

gradients are bounded and no clipping is needed during the DPSGD phase (Bassily et al., 2019;

Feldman et al., 2020; Bassily et al., 2020; 2021b;a; Song et al., 2021b; Arora et al., 2022). However,

state-of-the-art results from DPSGD usually uses the clipping version of DPSGD where no gradient

norm bound is assumed (De et al., 2022; Ghalebikesabi et al., 2023).

In our experiments, instead of directly implementing Algorithm 2, we perform the following modi-

ﬁcations in the experiments. The implemented version of the algortihm is described in Algorithm 3.

1. For the feature preprocessing step, when computing the mean, instead of using the adaptive

algorithm in Lemma 3, the algorithm directly applies Gaussian mechanism to the empirical

mean since each feature is normalized to a ﬁxed norm. We can use Gaussian mechanism

for mean estimation since now the `

sensitivity is bounded.

2. We skip the private quantile estimation and feature clipping step. Instead, we only shift the

features by the privately estimated mean and treat it as the input to the DPSGD phase. We

then perform `

-norm clipping on the gradients as in DPSGD.

Published as a conference paper at ICLR 2024

The privacy guarantee of Algorithm 3 directly follows from the use of PLD accountant (Meiser &

Mohammadi, 2018; Sommer et al., 2019).

Lemma 9. The output of Algorithm 3 is (, δ)-differentially private.

Each each combination of (ALGORITHM, ε, DATASET), we ﬁx the clipping norm C

to be 1 as

suggested in De et al. (2022), and perform a grid search over C

, BATCH SIZE, LEARNING RATE,

and NUMBER OF EPOCHS from the list below and report the best-achieved accuracy, similar to De

et al. (2022).

1. C

: 1, 10, 100, 1000.

2. BATCH SIZE: 256, 512, 1024, 2048, 4096, 8192, 16384.

3. LEARNING RATE: 0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16.

4. NUMBER OF EPOCHS: 20, 40, 80, 160, 320.

For DPSGD-F, we further add a search dimension over the privacy budget ε

for the feature pre-

processing step. And the list of candidates is chosen to be [0.02, 0.05, 0.1, 0.15, 0.2]. In fact, for all

experiments, the best accuracy is achieved at either ε

= 0.02 or ε

= 0.05. This suggests that

the feature preprocessing step actually doesn’t use much of the privacy budget to improve the ﬁnal

accuracy.

Remark on different concentration measures of the datasets. For the three considered datasets,

below we compute a few distance measures that characterize how the dataset D is concentrated,

listed in Table 2. It appears that compared to diam(D),

−µk

is a more indicative measure

on the performance improvement from DPSGD-F compared to DPSGD. The smaller

−

µk

, the more improvement DPSGD-F has over DPSGD. This might be due to the fact that we are

considering the gradient clipping version of DPSGD (Algorithm 3) instead of the feature clipping

version (Algorithm 2) considered in the theoretical analysis. Better understanding of this phenomena

is an interesting future direction.

Table 2: Different concentration measures of the datasets. Features in datasets are normalized to be

of unit norm. µ denotes the mean of the dataset.

Dataset diam(D) max

− µk

CIFAR-100 (pretrained) † 1.56 1.10 0.9

MNIST 1.41 1.03 0.77

FMNIST 1.41 1.10 0.62