Differentially Private Bayesian Optimization
Matt J. Kusner MKUSNER@WUSTL.EDU
Jacob R. Gardner GARDNER.JAKE@WUSTL.EDU
Roman Garnett GARNETT@WUSTL.EDU
Kilian Q. Weinberger KILIAN@WUSTL.EDU
Washington University in St. Louis, 1 Brookings Dr., St. Louis, MO 63130
Abstract
Bayesian optimization is a powerful tool for fine-
tuning the hyper-parameters of a wide variety of
machine learning models. The success of ma-
chine learning has led practitioners in diverse
real-world settings to learn classifiers for prac-
tical problems. As machine learning becomes
commonplace, Bayesian optimization becomes
an attractive method for practitioners to automate
the process of classifier hyper-parameter tuning.
A key observation is that the data used for tun-
ing models in these settings is often sensitive.
Certain data such as genetic predisposition, per-
sonal email statistics, and car accident history, if
not properly private, may be at risk of being in-
ferred from Bayesian optimization outputs. To
address this, we introduce methods for releas-
ing the best hyper-parameters and classifier ac-
curacy privately. Leveraging the strong theoreti-
cal guarantees of differential privacy and known
Bayesian optimization convergence bounds, we
prove that under a GP assumption these private
quantities are often near-optimal. Finally, even if
this assumption is not satisfied, we can use dif-
ferent smoothness guarantees to protect privacy.
1. Introduction
Machine learning is increasingly used in application areas
with sensitive data. For example, hospitals use machine
learning to predict if a patient is likely to be readmitted
soon (Yu et al., 2013), webmail providers classify spam
emails from non-spam (Weinberger et al., 2009), and in-
surance providers forecast the extent of bodily injury in car
crashes (Chong et al., 2005).
Proceedings of the 32
nd
International Conference on Machine
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
right 2015 by the author(s).
In these scenarios data cannot be shared legally, but compa-
nies and hospitals may want to share hyper-parameters and
validation accuracies through publications or other means.
However, data-holders must be careful, as even a small
amount of information can compromise privacy.
Which hyper-parameter setting yields the highest accuracy
can reveal sensitive information about individuals in the
validation or training data set, reminiscent of reconstruc-
tion attacks described by Dwork & Roth (2013) and Dinur
& Nissim (2003). For example, imagine updated hyper-
parameters are released right after a prominent public fig-
ure is admitted to a hospital. If a hyper-parameter is known
to correlate strongly with a particular disease the patient is
suspected to have, an attacker could make a direct correla-
tion between the hyper-parameter value and the individual.
To prevent these sorts of attacks, we develop a set of al-
gorithms that automatically fine-tune the hyper-parameters
of a machine learning algorithm while provably preserving
differential privacy (Dwork et al., 2006b). Our approach
leverages recent results on Bayesian optimization (Snoek
et al., 2012; Hutter et al., 2011; Bergstra & Bengio,
2012; Gardner et al., 2014), training a Gaussian process
(GP) (Rasmussen & Williams, 2006) to accurately predict
and maximize the validation gain of hyper-parameter set-
tings. We show that the GP model in Bayesian optimization
allows us to release noisy final hyper-parameter settings to
protect against aforementioned privacy attacks, while only
sacrificing a tiny, bounded amount of validation gain.
Our privacy guarantees hold for releasing the best hyper-
parameters and best validation gain. Specifically our con-
tributions are as follows: 1. We derive, to the best of our
knowledge, the first framework for Bayesian optimization
with differential privacy guarantees, with/without oberser-
vation noise, 2. We show that even if our validation gain is
not drawn from a Gaussian process, we can guarantee dif-
ferential privacy under different smoothness assumptions.
We begin with background on Bayesian optimization and
differential privacy we will use to prove our guarantees.
Differentially Private Bayesian Optimization
2. Background
In general, our aim will be to protect the privacy of a val-
idation dataset of sensitive records V X (where X is
the collection of all possible records) when the results of
Bayesian optimization depends on V.
Bayesian optimization. Our goal is to maximize an un-
known function f
V
: Λ R that depends on some valida-
tion dataset V X:
max
λΛ
f
V
(λ). (1)
It is important to point out that all of our results hold for the
general setting of eq. (1), but throughout the paper, we use
the vocabulary of a common application: that of machine
learning hyper-parameter tuning. In this case f
V
(λ) is the
gain of a learning algorithm evaluated on validation dataset
V that was trained with hyper-parameters λ Λ R
d
.
As evaluating f
V
is expensive (e.g., each evaluation re-
quires training a learning algorithm), Bayesian optimiza-
tion gives a procedure for selecting a small number of loca-
tions to sample f
V
: [λ
1
, . . . , λ
T
] = λ
T
R
d×T
. Specif-
ically, given a current sample λ
t
, we observe a valida-
tion gain v
t
such that v
t
= f
V
(λ
t
) + α
t
, where α
t
N(0, σ
2
) is Gaussian noise with possibly non-zero vari-
ance σ
2
. Then, given v
t
and previously observed values
v
1
, . . . , v
t1
, Bayesian optimization updates its belief of
f
V
and samples a new hyper-parameter λ
t+1
. Each step of
the optimization proceeds in this way.
To decide which hyper-parameter to sample next, Bayesian
optimization places a prior distribution over f
V
and updates
it after every (possibly noisy) function observation. One
popular prior distribution over functions is the Gaussian
process GP
µ(·), k(·, ·)
(Rasmussen & Williams, 2006),
parameterized by a mean function µ(·) (we set µ = 0,
w.l.o.g.) and a kernel covariance function k(·, ·). Functions
drawn from a Gaussian process have the property that any
finite set of values of the function are normally distributed.
Additionally, given samples λ
T
= [λ
1
, . . . , λ
T
] and ob-
servations v
T
= [v
1
, . . . , v
T
], the GP posterior mean and
variance has a closed form:
µ
T
(λ) = k(λ, λ
T
)(K
T
+ σ
2
I)
1
v
T
k
T
(λ, λ
0
) = k(λ, λ
0
) k(λ, λ
T
)(K
T
+ σ
2
I)
1
k(λ
T
, λ
0
)
σ
2
T
(λ) = k
T
(λ, λ), (2)
where k(λ, λ
T
)R
1×T
is evaluated element-wise on each
of the T columns of λ
T
, and λ Λ is any hyper-parameter.
As well, K
T
= k(λ
T
, λ
T
) R
T ×T
is evaluated on all
hyper-parameter pairs from λ
T
. As more samples are ob-
served, the mean function µ
T
(λ) approaches f
V
(λ).
One well-known method to select hyper-parameters λ max-
imizes the upper-confidence bound (UCB) of the posterior
GP model of f
V
(Auer et al., 2002; Srinivas et al., 2010):
λ
t+1
, arg max
λΛ
µ
t
(λ) +
p
β
t+1
σ
t
(λ), (3)
where β
T +1
is a parameter that trades off the exploita-
tion of maximizing µ
t
(λ) and the exploration of maximiz-
ing σ
t
(λ). Srinivas et al. (2010) proved that given cer-
tain assumptions on f
V
and fixed, non-zero observation
noise (σ
2
> 0), selecting hyper-parameters λ to maximize
eq. (3) is a no-regret Bayesian optimization procedure:
lim
T →∞
1
T
P
T
t=1
f
V
(λ
) f
V
(λ
t
) = 0, where f
V
(λ
) is
the maximizer of eq. (1). For the no-noise setting, de Fre-
itas et al. (2012) give a UCB-based no-regret algorithm.
Contributions. Alongside maximizing f
V
, we would
like to guarantee that if f
V
depends on (sensitive) valida-
tion data, we can release information about f
V
so that the
data V remains private. Specifically, we may wish to re-
lease (a) our best guess
ˆ
λ , arg max
tT
f
V
(λ
t
) of the true
(unknown) maximizer λ
and (b) our best guess f
V
(
ˆ
λ) of
the true (also unknown) maximum objective f
V
(λ
). The
primary question this work aims to answer is: how can we
release private versions of
ˆ
λ and f
V
(
ˆ
λ) that are close to
their true values, or better, the values λ
and f
V
(λ
)? We
give two answers to these questions. The first will make
a Gaussian process assumption on f
V
, which we describe
immediately below. The second, described in Section 5,
will utilize Lipschitz and convexity assumptions to guaran-
tee privacy in the event the GP assumption does not hold.
Setting. For our first answer to this question, let us de-
fine a Gaussian process over hyper-parameters λ, λ
0
Λ
and datasets V, V
0
X as follows: GP
0, k
1
(V, V
0
)
k
2
(λ, λ
0
)
. A prior of this form is known as a multi-task
Gaussian process (Bonilla et al., 2008). Many choices for
k
1
and k
2
are possible. The function k
1
(V, V
0
) defines a
set kernel (e.g., a function of the number of records that
differ between V and V
0
). For k
2
, we focus on either the
squared exponential: k
2
(λ, λ
0
) = exp
−kλ λ
0
k
2
2
/(2`
2
)
or Mat
´
ern kernels: (e.g., for ν = 5/2, k
2
(λ, λ
0
) = (1 +
5r/` + (5r
2
)/(3`
2
)) exp(
5r/`), for r = kλ λ
0
k
2
),
for a fixed `, as they have known bounds on the maximum
information gain (Srinivas et al., 2010). Note that as de-
fined, the kernel k
2
is normalized (i.e., k
2
(λ, λ) = 1).
Assumption 1. We have a problem of type (1), where
all possible dataset functions [f
1
, . . . , f
2
|X |
] are GP dis-
tributed GP
0, k
1
(V, V
0
) k
2
(λ, λ
0
)
for known kernels
k
1
, k
2
, for all V, V
0
X and λ, λ
0
Λ, where |Λ|< .
Similar Gaussian process assumptions have been made in
previous work (Srinivas et al., 2010). For a result in the no-
noise observation setting, we will make use of the assump-
tions of de Freitas et al. (2012) for our privacy guarantees,
as described in Section 4.
Differentially Private Bayesian Optimization
2.1. Differential Privacy
One of the most widely accepted frameworks for private
data release is differential privacy (Dwork et al., 2006b),
which has been shown to be robust to a variety of pri-
vacy attacks (Sweeney, 1997; Dinur & Nissim, 2003; Ganta
et al., 2008; Narayanan & Shmatikov, 2008). Given an al-
gorithm A that outputs a value λ when run on dataset V,
the goal of differential privacy is to ‘hide’ the effect of a
small change in V on the output of A. Equivalently, an at-
tacker should not be able to tell if a single private record
was changed in V just by looking at the output of A. If two
datasets V, V
0
differ in the value of a single individual, we
will refer to them as neighboring datasets. Note that any
non-trivial algorithm (i.e., an algorithm A that outputs dif-
ferent values on V and V
0
for some pair V, V
0
X) must
include some amount of randomness to guarantee such a
change in V is ‘unobservable’ in the output λ of A (Dwork
& Roth, 2013). The level of privacy we wish to guaran-
tee decides the amount of randomness we need to add to λ
(better privacy requires increased randomness). Formally,
the definition of differential privacy is stated below.
Definition 1. A randomized algorithm A is (, δ)-
differentially private for , δ 0 if for all λ Range(A)
and for all neighboring datasets V, V
0
(i.e., such that V and
V
0
differ in the value of one record) we have that
Pr
A(V) = λ
e
Pr
A(V
0
) = λ
+ δ. (4)
The parameters , δ guarantee how private Ais; the smaller,
the more private. The maximum privacy is = δ = 0 in
which case eq. (4) holds with equality. This can be seen by
the fact that V and V
0
can be swapped in the definition, and
thus the inequality holds in both directions. If δ = 0, we
say the algorithm is simply -differentially private. For a
survey on differential privacy we refer the interested reader
to Dwork & Roth (2013).
There are two popular methods for making an algorithm -
differentially private: (a) the Laplace mechanism (Dwork
et al., 2006b), in which we add random noise to λ and (b)
the exponential mechanism (McSherry & Talwar, 2007),
which draws a random output
˜
λ such that
˜
λ λ. For
each mechanism we must define an intermediate quan-
tity called the global sensitivity describing how much A
changes when V changes.
Definition 2. (Laplace mechanism) The global sensitivity
of an algorithm A over all neighboring datasets V, V
0
(i.e.,
V, V
0
differ by the value of one record) is
A
, max
V,V
0
⊆X
kA(V) A(V
0
)k
1
.
(Exponential mechanism) The global sensitivity of a func-
tion q : X ×Λ R over all neighboring datasets V, V
0
is
q
, max
V,V
0
⊆X
λΛ
kq(V, λ) q(V
0
, λ)k
1
.
The Laplace mechanism hides the output of A by perturb-
ing its output with some amount of random noise.
Definition 3. Given a dataset V and an algorithm A, the
Laplace mechanism returns A(V) + ω, where ω is a noise
variable drawn from Lap(∆
A
/), the Laplace distribution
with scale parameter
A
/ (and location parameter 0).
The exponential mechanism draws a slightly different
˜
λ
that is ‘close’ to λ, the output of A.
Definition 4. Given a dataset V and an algorithm
A(V) = arg max
λΛ
q(V, λ), the exponential mecha-
nism returns
˜
λ, where
˜
λ is drawn from the distribution
1
Z
exp
q(V, λ)/(2∆
q
)
, and Z is a normalizing constant.
Given Λ, a possible set of hyper-parameters, we derive
methods for privately releasing the best hyper-parameters
and the best function values f
V
, approximately solving
eq. (1). We first address the setting with observation noise
(σ
2
> 0) in eq. (2) and then describe small modifications
for the no-noise setting. For each setting we use the UCB
sampling technique in eq. (3) to derive our private results.
3. With observation noise
In general cases of Bayesian optimization, observation
noise occurs in a variety of real-world modeling settings
such as sensor measurements (Krause et al., 2008). In
hyper-parameter tuning, noise in the validation gain may
be as a result of noisy validation or training features.
Algorithm 1 Private Bayesian Opt. (noisy observations)
1: Input: V; Λ R
d
; T ; (, δ); σ
2
V,0
; γ
T
2: µ
V,0
= 0
3: for t = 1 . . . T do
4: β
t
=2 log(|Λ|t
2
π
2
/(3δ))
5: λ
t
, arg max
λΛ
µ
V,t1
(λ) +
β
t
σ
V,t1
(λ)
6: Observe validation gain v
V,t
, given λ
t
7: Update µ
V,t
and σ
2
V,t
according to (2)
8: end for
9: c = 2
q
1k(V, V
0
)
log
3|Λ|
10: q = σ
p
8 log(3)
11: C
1
= 8/ log(1 + σ
2
)
12: Draw
˜
λ Λ w.p. Pr[λ] exp
µ
V,T
(λ)
2(2
β
T +1
+c)
13: v
=max
tT
v
V,t
14: Draw θ Lap
h
C
1
β
T
γ
T
T
+
c
+
q
i
15: ˜v = v
+ θ
16: Return:
˜
λ, ˜v
In the sections that follow, although the quantities f, µ, σ, v
all depend on the validation dataset V, for notational sim-
plicity we will occasionally omit the subscript V. Similarly,
for V
0
we will often write: f
0
, µ
0
, σ
0
2
, v
0
.
Differentially Private Bayesian Optimization
3.1. Private near-maximum hyper-parameters
In this section we guarantee that releasing
˜
λ in Algorithm
1 is private (Theorem 1) and has high utility (Theorem 2).
Our proof strategy is as follows: we will first demonstrate
the global sensitivity of µ
T
(λ) with probability at least 1
δ. Then we will show that releasing
˜
λ via the exponential
mechanism is (, δ)-differentially private. Finally, we show
that µ
T
(
˜
λ) is close to max
λΛ
µ
T
(λ).
Global sensitivity. As a first step we bound the global
sensitivity of µ
T
(λ) as follows:
Theorem 1. Given Assumption 1, for any two neighboring
datasets V, V
0
and for all λ Λ with probability at least
1 δ there is an upper bound on the global sensitivity (in
the exponential mechanism sense) of µ
T
:
|µ
0
T
(λ) µ
T
(λ)| 2
p
β
T +1
+ σ
1
p
2 log (3|Λ|),
for σ
1
=
q
2
1k
1
(V, V
0
)
, β
t
=2 log
|Λ|t
2
π
2
/(3δ)
.
Proof. Note that, by applying the triangle inequality twice,
for all λ Λ,
|µ
0
T
(λ) µ
T
(λ)| |µ
0
T
(λ) f
0
(λ)| + |f
0
(λ) µ
T
(λ)|
|µ
0
T
(λ) f
0
(λ)| + |f
0
(λ) f(λ)| + |f (λ) µ
T
(λ)|.
We can now bound each one of the terms in the summation
on the right hand side (RHS) with probability at least
δ
3
.
According to Srinivas et al. (2010), Lemma 5.1, we obtain
|µ
0
T
(λ)f
0
(λ)|
p
β
T +1
σ
0
T
(λ). The same can be applied
to |f(λ) µ
T
(λ)|. As σ
0
T
(λ) 1, because k(λ, λ) = 1,
we can upper bound both terms by 2
p
β
T +1
. In order to
bound the remaining (middle) term on the RHS recall that
for a random variable Z N(0, 1) we have: Pr
|Z|> γ
e
γ
2
/2
. For variables Z
1
, . . . Z
n
N(0, 1), we have, by
the union bound, that Pr
i, |Z
i
|γ
1 ne
γ
2
/2
,
1
δ
3
. If we set Z =
|f(λ)f
0
(λ)|
σ
1
and n = |Λ|, we obtain
γ =
q
2 log
3|Λ|
, which completes the proof.
We remark that all of the quantities in Theorem 1 are either
given or selected by the modeler (e.g, δ, T ). Given this
upper bound we can apply the exponential mechanism to
release
˜
λ privately, as per Definition 1:
Corollary 1. Let A(V) denote Algorithm 1 applied on
dataset V. Given Assumption 1,
˜
λ is (, δ)-differentially
private, i.e., Pr
A(V)=
˜
λ
e
Pr
A(V
0
)=
˜
λ
+δ, for any
pair of neighboring datasets V, V
0
.
We leave the proof of Corollary 1 to the supplementary ma-
terial. As described in (Srinivas et al., 2010), lines 3-8
(UCB) of Algorithm 1 is a no-regret Bayesian optimiza-
tion procedure: lim
T →∞
P
T
t=1
(f(λ
) f(λ
t
))/T = 0.
Further, with high probability (as given by the Gaussian
tail probability bounds) for selected hyperparameters α
i
{α
1
, . . . , α
T
} we have that µ
T
(α
i
) f (λ
i
). Therefore,
it is reasonable to select λ = arg max
λΛ
µ
T
(λ), to opti-
mize (1). McSherry & Talwar (2007) show that exponential
mechanism (line 12) selects a noisy maximum as follows.
Theorem 2. (McSherry & Talwar, 2007) The exponential
mechanism selects
˜
λ that has value µ
T
(
˜
λ) that is close to
the maximum max
λΛ
µ
T
(λ),
max
λΛ
µ
T
(λ) µ
T
(
˜
λ)
2∆
(log |Λ|+ a), (5)
w.p. 1 (δ + e
a
), where ∆ = 2
p
β
T +1
+ c (for β
T +1
and c defined as in Algorithm 1).
Notice that as T increases, so does in the above inequal-
ity, as the distribution in line 12 becomes more uniform.
However, because increases much more slowly than T
(i.e. is asymptotically O(
p
log(T +1)
2
)), T should be
selected as large as possible to achieve small average regret.
3.2. Private near-maximum validation gain
In this section we demonstrate releasing the validation gain
˜v in Algorithm 1 is private (Theorem 3) and that the noise
we add to ensure privacy is bounded with high probabil-
ity (Theorem 4). As in the previous section our approach
will be to first derive the global sensitivity of the maxi-
mum v found by Algorithm 1. Then we show releasing ˜v
is (, δ)-differentially private via the Laplace mechanism.
Surprisingly, we can also show that ˜v is close to f(λ
).
Global sensitivity. We bound the global sensitivity of the
maximum v found with Bayesian optimization and UCB:
Theorem 3. Given Assumption 1, and neighboring V, V
0
,
we have the following global sensitivity bound (in the
Laplace mechanism sense) for the maximum v, w.p. 1δ
|max
tT
v
0
t
max
tT
v
t
|
C
1
β
T
γ
T
T
+ c + q.
(for c, q, β
T
in Algorithm 1) where the maximum GP infor-
mation gain γ
T
is bounded above for the squared exponen-
tial and Mat
´
ern kernels (Srinivas et al., 2010).
Proof. For notational simplicity let us denote the regret
term as ,
C
1
T β
T
γ
T
. Then from Theorem 1 in Srini-
vas et al. (2010) we have that
T
f(λ
)
1
T
T
X
t=1
f(λ
t
) f(λ
) max
tT
f(λ
t
). (6)
This implies f(λ
) max
tT
f(λ
t
) +
T
with probability
at least 1
δ
3
(with appropriate choice of β
T
).
Recall that in the proof of Theorem 1 we showed that
|f(λ) f
0
(λ)| c with probability at least 1
δ
3
(for
Differentially Private Bayesian Optimization
c given in Algorithm 1). This along with the above ex-
pression imply the following two sets of inequalities with
probability greater than 1
2δ
3
:
f
0
(λ
) c f(λ
) < max
tT
f(λ
t
) +
T
;
f(λ
) c f
0
(λ
) < max
tT
f
0
(λ
t
) +
T
.
These, in turn, imply the two sets of inequalities:
max
tT
f
0
(λ
t
) f
0
(λ
) < max
tT
f(λ
t
) +
T
+ c;
max
tT
f(λ
t
) f(λ
) < max
tT
f
0
(λ
t
) +
T
+ c.
This implies |max
tT
f
0
(λ
t
) max
tT
f(λ
t
)|
T
+ c.
That is, the global sensitivity of max
tT
f(λ
t
) is bounded.
Given the sensitivity of the maximum f, we can readily
derive the sensitivity of maximum v. First note that we can
use the triangle inequality to derive
|max
tT
v
0
t
max
tT
v
t
| |max
tT
v
t
max
tT
f(λ
t
)|
+ |max
tT
v
0
t
max
tT
f
0
(λ
t
)|
+ |max
tT
f
0
(λ
t
) max
tT
f(λ
t
)|.
We can immediately bound the final term on the right
hand side. For the first term, let t
v,max
= arg max
tT
v
t
be the index of the best observed validation gain. Let
t
f,max
= arg max
tT
f(λ
t
) be the index of the best true
validation gain. The first term is upper bounded by |α| ,
max{|α
t
v,max
|, |α
t
f,max
|} (this can be easily seen by enu-
merating the cases: (a) t
v,max
= t
f,max
and (b) t
v,max
6=
t
f,max
). Similarly, |α
0
| upper bounds the second term (de-
fined in the same way). Let ˆα=max{α, α
0
} so we have,
|max
tT
v
0
t
max
tT
v
t
|
T
+ c + |2ˆα|.
Although |ˆα| can be arbitrarily large, recall that for Z
N(0, 1) we have: Pr[|Z| γ] 1 e
γ
2
/2
, 1
δ
3
.
Therefore if we set Z =
ˆα
σ
we have γ =
p
2 log(3). This
implies that |2ˆα| σ
p
8 log(3) = q with probability at
least 1
δ
3
. Therefore, if Theorem 1 from Srinivas et al.
(2010) and the bound on |f(λ) f
0
(λ)| hold together with
probability at least 1
2δ
3
as described above, the theorem
follows directly.
As in Theorem 1 each quantity in the above bound is given
in Algorithm 1 (β, c, q), given in previous results (Srinivas
et al., 2010) (γ
T
, C
1
) or specified by the modeler (T , δ).
Now that we have a bound on the sensitivity of the max-
imum v we will use the Laplace mechanism to prove our
privacy guarantee (proof in supplementary material):
Corollary 2. Let A(V) denote Algorithm 1 run on dataset
V. Given Assumption 1, releasing ˜v is (, δ)-differentially
private, i.e., Pr[A(V)= ˜v]e
Pr[A(V
0
)= ˜v]+δ.
Further, as the Laplace distribution has exponential tails,
the noise we add to obtain ˜v is small and shrinks as T in-
creases. In fact, we can show that this noisy validation error
is close to the optimal noise-free validation error.
Theorem 4. Given Assumption 1 we have,
|˜v f(λ
)| σ
p
8 log(1) +
T
+ a
T
+
c
+
q
,
with probability at least 1(δ+e
a
) for Ω =
C
1
T β
T
γ
T
.
Proof. Let Z be a Laplace random variable with scale pa-
rameter b and location parameter 0; Z Lap(b). Then
Pr
|Z| ab
= 1 e
a
. Thus, in Algorithm 1,
|˜v max
tT
v
t
| ab for b =
T
+
c
+
q
with prob-
ability at least 1 e
a
. Let t
f,max
= arg max
tT
f(λ
t
).
Further observe,
ab max
tT
v
t
˜v (max
tT
f(λ
t
) α
t
f,max
) ˜v
f(λ
)
T
α
t
f,max
˜v (7)
where the second and third inequality follow from the
proof of Theorem 3 (using the regret bound of Srinivas
et al. (2010): Theorem 1). Note that the third inequal-
ity holds with probability greater than 1
δ
2
(given β
t
in
Algorithm 1). The final inequality implies f(λ
) ˜v
α
t
f,max
+
T
+ ab. Let t
v,max
=arg max
tT
v
t
. Also note,
ab ˜v max
tT
v
t
˜v (max
tT
f(λ
t
) + α
t
v,max
)
˜v f(λ
)
T
α
t
v,max
(8)
This implies that f(λ
) ˜v α
t
v,max
T
ab. Thus
we have that |˜v f(λ
)| α
t
f,max
+ α
t
v,max
+
T
+ ab.
Finally, because α
t
f,max
and α
t
v,max
could be arbitrarily
large we give a high probability upper bound on these
quantities. Let ˆα , max{α
t
f,max
, α
t
v,max
}. Again recall
that for Z N(0, 1), we have the tail probability bound
Pr
Z γ] 1(1/2)e
γ
2
/2
, 1
δ
2
. Therefore if we
set Z =
ˆα
σ
we arrive at γ = σ
p
2 log(1). This entails
that 2ˆα σ
p
8 log(1). Therefore, our proposed bound
in Theorem 4 holds.
The right side of Theorem 4 becomes smaller as T in-
creases, so T should be set as large as possible. We note
that, because releasing either
˜
λ or ˜v is (, δ)-differentially
private, by Corollaries 1 and 2, releasing both private quan-
tities in Algorithm 1 guarantees (2, 2δ)-differential pri-
vacy for validation dataset V. This is due to the compo-
sition properties of (, δ)-differential privacy (Dwork et al.,
2006a) (in fact stronger composition results can be demon-
strated, (Dwork & Roth, 2013)).
Differentially Private Bayesian Optimization
4. Without observation noise
In hyper-parameter tuning it may be reasonable to assume
that we can observe function evaluations exactly: v
V,t
=
f
V
(λ
t
). First note that we can use the same algorithm to
report the maximum λ in the no-noise setting. Indeed,
Theorems 1 and 2 still hold. However, we cannot read-
ily report a private maximum f as the information gain
γ
T
in Theorems 3 and 4 approaches infinity as σ
2
0.
Therefore, we extend results from the previous section to
the exact observation case via the regret bounds of de Fre-
itas et al. (2012). In this setting, the authors show that they
can achieve exponentially-decreasing regret for each sam-
ple: f (λ
) f (λ
t
) < O(e
t/ log(t)
). Therefore our strat-
egy will be to select the last sample and add enough noise to
cover a change in a validation record. Algorithm 2 demon-
strates this strategy for privatizing the maximum f.
Algorithm 2 Private Bayesian Opt. (noise free obs.)
1: Input: V; Λ R
d
; T ; (, δ); A, τ; assumptions on f
V
in de Freitas et al. (2012)
2: Observe noise-free samples: f
V
(λ
1
), . . . , f
V
(λ
T
) us-
ing method of de Freitas et al. (2012)
3: c = 2
q
1 k(V, V
0
)
log(2|Λ|)
4: Draw θ Lap
h
A
e
T τ
(log T )
d/4
+
c
i
5: Return:
˜
f = f
V
(λ
T
) + θ
4.1. Private near-maximum validation gain
We demonstrate that releasing
˜
f in Algorithm 2 is private
(Corollary 3) and that a small amount of noise is added
to make
˜
f private (Theorem 6). To do so, we derive the
global sensitivity of f
V
(λ
T
) in Algorithm 2 independent
of the maximum information gain γ
T
via de Freitas et al.
(2012). Then we prove releasing
˜
f is (, δ)-differentially
private and that
˜
f is close to the optimal f (λ
).
Global sensitivity. The following Theorem gives a
bound on the global sensitivity of the maximum f.
Theorem 5. Given Assumption 1 and the assumptions
in Theorem 2 of de Freitas et al. (2012), for neighbor-
ing datasets V, V
0
we have the following global sensitivity
bound (in the Laplace mechanism sense),
|f
0
(λ
T
) f(λ
T
)| Ae
T τ
(log T )
d/4
+ c
w.p. at least 1 δ for c = 2
q
1k(V, V
0
)
log(2|Λ|),
given constants A and τ in de Freitas et al. (2012).
We leave the proof to the supplementary material.
Given this sensitivity, we may apply the Laplace mecha-
nism to release
˜
f.
Corollary 3. Let A(V) denote Algorithm 2 run on dataset
V. Given Assumption 1 and that f satisfies the assumptions
of de Freitas et al. (2012),
˜
f is (, δ)-differentially private,
with respect to any neighboring dataset V
0
, i.e.,
Pr
A(V) =
˜
f
e
Pr
A(V
0
) =
˜
f
+ δ.
Even though we must add noise to the maximum f we show
that
˜
f is still close to the optimal f (λ
).
Theorem 6. Given the assumptions of Corollary 3, we
have the utility guarantee for Algorithm 2:
|
˜
f f(λ
)| + a
+
c
w.p. at least 1(δ + e
a
) for = Ae
T τ
(log T )
d/4
.
We prove Corollary 3 and Theorem 6 in the supplementary
material. As in the noisy setting, selecting T as large as
possible is the best strategy, as this reduces the right side of
Theorem 6 (less Laplacian noise is required). We have thus
shown that in noisy and noise-free settings it is possible to
release private, high-quality hyper-parameter settings
˜
λ as
well as private, near-optimal function evaluations ˜v,
˜
f.
5. Without the GP assumption
Even if our our true validation score f is not drawn from
a Gaussian process (Assumption 1), we can still guarantee
differential privacy for releasing its value after Bayesian
optimization f
BO
= max
tT
f(λ
t
). In this section we
describe a different functional assumption on f that also
yields differentially private Bayesian optimization for the
case of machine learning hyper-parameter tuning.
Assume we have a (nonsensitive) training set T =
{(x
i
, y
i
)}
n
i=1
, which, given a hyperparameter λ produces
a model w(λ) from the following optimization,
w
λ
= arg min
w
O
λ
(w)
z }| {
λ
2
kwk
2
2
+
1
n
n
X
i=1
`(w, x
i
, y
i
), (9)
The function ` is a training loss function (e.g., logistic
loss, hinge loss). Given a (sensitive) validation set V =
{(x
i
, y
i
)}
m
i=1
X we would like to use Bayesian opti-
mization to maximize a validation score f
V
.
Assumption 2. Our true validation score f
V
is
f
V
(w
λ
) =
1
m
m
X
i=1
g(w
λ
, x
i
, y
i
),
where g(·) is a validation loss function that is L-Lipschitz
in w
λ
(e.g., ramp loss, normalized sigmoid (Huang et al.,
2014)). Additionally, the training model w
λ
is the mini-
mizer of eq. (9) for a training loss `(·) that is 1-Lipschitz in
w
λ
and convex (e.g., logistic loss, hinge loss).
Differentially Private Bayesian Optimization
Algorithm 3 describes a procedure for privately releasing
the best validation accuracy f
BO
given Assumption 2. Dif-
ferent from previous algorithms, we may run Bayesian op-
timization in Algorithm 3 with any acquisition function
(e.g., expected improvement (Mockus et al., 1978), UCB)
and privacy is still guaranteed.
Algorithm 3 Private Bayesian Opt. (Lipschitz and convex)
1: Input: T size n; V size m; Λ; λ
min
; λ
max
; ; T ; L; d
2: Run Bayesian optimization for T timesteps, observing:
f
V
(w
λ
1
), . . . , f
V
(w
λ
T
) for {λ
1
, . . . , λ
T
}=Λ
V,T
Λ
3: f
BO
= max
tT
f
V
(w
λ
t
)
4: g
= max
(x,y)∈X,w∈R
p
g(w, x, y)
5: Draw θ Lap
h
1
min{
g
m
,
L
min
} +
(λ
max
λ
min
)L
λ
max
λ
min
i
6: Return:
˜
f
L
= f
BO
+ θ
Similar to Algorithms 1 and 2 we use the Laplace mech-
anism to mask the possible change in validation accuracy
when a single record is changed in V. Different from the re-
lated work of Chaudhuri & Vinterbo (2013) changing V to
neighboring dataset V
0
may also lead to Bayesian optimiza-
tion searching different hyper-parameters, Λ
V,T
vs. Λ
V
0
,T
.
Therefore, we must bound the total global sensitivity of f
with respect to V and λ,
Definition 5. The total global sensitivity of f over all
neighboring datasets V, V
0
is
f
, max
V,V
0
⊆X
λ,λ
0
Λ
|f
V
(w
λ
) f
V
0
(w
λ
0
)|.
In the following theorem we demonstrate that we can
bound the change in f for arbitrary λ < λ
0
(w.l.o.g.).
Theorem 7. Given Assumption 2, for neighboring V, V
0
and arbitrary λ < λ
0
we have that,
|f
V
(w
λ
)f
V
0
(w
λ
0
)|
(λ
0
λ)L
λ
0
λ
+ min
g
m
,
L
min
where L is the Lipschitz constant of f, m is the size of V,
and g
=max
(x,y)∈X,w∈R
p
g(w, x, y).
Proof. Applying the triangle inequality yields
|f
V
(w
λ
) f
V
0
(w
λ
0
)| |f
V
(w
λ
) f
V
(w
λ
0
)|
+ |f
V
(w
λ
0
) f
V
0
(w
λ
0
)|.
This second term is bounded by Chaudhuri & Vinterbo
(2013) in the proof of Theorem 4. The only difference is,
as we are not adding random noise to w
λ
0
we have that
|f
V
(w
λ
0
) f
V
0
(w
λ
0
)| min{g
/m, L/(
min
) }.
To bound the first term, let O
λ
(w) be the value of the ob-
jective in eq. (9) for a particular λ. Note that O
λ
(w) and
O
λ
0
(w) are λ and λ
0
-strongly convex. Define
h(w) = O
λ
0
(w) O
λ
(w) =
λ
0
λ
2
kwk
2
2
. (10)
Further, define the minimizers w
λ
= arg min
w
O
λ
(w) and
w
λ
0
=arg min
w
[O
λ
(w) + h(w)]. This implies that
O
λ
(w
λ
) = O
λ
(w
λ
0
) + h(w
λ
0
) = 0. (11)
Given that O
λ
is λ-strongly convex (Shalev-Shwartz,
2007), and by the Cauchy-Schwartz inequality,
λ kw
λ
w
λ
0
k
2
2
h
O
λ
(w
λ
) O
λ
(w
λ
0
)
i
>
h
w
λ
w
λ
0
i
h(w
λ
0
)
>
h
w
λ
w
λ
0
i
k∇h(w
λ
0
)k
2
kw
λ
w
λ
0
k
2
.
Rearranging,
1
λ
k∇h(w
λ
0
)k
2
=
λ
0
λ
2
∇kw
λ
0
k
2
2
2
kw
λ
w
λ
0
k
2
(12)
Now as w
λ
0
is the minimizer of O
λ
0
we have,
∇kw
λ
0
k
2
2
=
2
λ
0
1
n
P
n
i=1
`(w
λ
0
, x
i
, y
i
)
.
Substituting this value of w
λ
0
into eq. (12) and noting that
we can pull the positive constant term (λ
0
λ)/2 out of the
norm and drop the negative sign in the norm gives us
1
λ
k∇h(w
λ
0
)k
2
=
λ
0
λ
λλ
0
1
n
P
n
i=1
`(w
λ
0
, x
i
, y
i
)
2
=
λ
0
λ
λλ
0
.
The last equality follows from the fact that the loss ` is
1-Lipschitz by Assumption 2 and the triangle inequality.
Thus, along with eq. (12), we have
kw
λ
w
λ
0
k
2
1
λ
k∇h(w
λ
0
)k
2
λ
0
λ
λλ
0
.
Finally, as f is L-Lipschitz in w,
|f
V
(w
λ
) f
V
(w
λ
0
)| Lkw
λ
w
λ
0
k
2
L
λ
0
λ
λλ
0
Combining the result of Chaudhuri & Vinterbo (2013) with
the above expression completes the proof.
Given a finite set of hyperparameters Λ, in order to bound
the total global sensitivity of f note that, by Theorem 7,
f
(λ
max
λ
min
)L
λ
max
λ
min
+ min
n
g
m
,
L
min
o
,
as (λ
0
λ)/(λ
0
λ) is strictly increasing in λ
0
and is strictly
decreasing in λ. Given the total global sensitivity of f we
can use the Laplace mechanism to ‘hide’ the validation set.
Corollary 4. Let A(V) denote Algorithm 3 applied on
dataset V. Given Assumption 2,
˜
f
L
is -differentially pri-
vate, i.e., Pr
A(V) =
˜
f
L
e
Pr
A(V
0
) =
˜
f
L
We leave the proof to the supplementary material. Further,
by the exponential tails of the Laplace mechanism we have
the following utility guarantee,
Differentially Private Bayesian Optimization
Theorem 8. Given the assumptions of Theorem 7, we have
the following utility guarantee for
˜
f
L
w.r.t. f
BO
,
|
˜
f
L
f
BO
| a
h
1
m
min{g
,
L
λ
min
} +
(λ
max
λ
min
)L
λ
max
λ
min
i
with probability at least 1 e
a
.
Proof. This follows exactly from the tail bound on Laplace
random variables, given in the proof of Theorem 6.
One should set T as large as possible as Bayesian optimiza-
tion may only possibly find better parameter settings when
T is increased, and the noise added is independent of T .
6. Related work
There has been much work towards differentially pri-
vate convex optimization (Chaudhuri et al., 2011; Kifer
et al., 2012; Duchi et al., 2013; Song et al., 2013; Jain &
Thakurta, 2014; Bassily et al., 2014). The work of Bass-
ily et al. (2014) established upper and lower bounds for
the excess empirical risk of and (, δ)-differentially pri-
vate algorithms for many settings including convex and
strongly convex risk functions that may or may not be
smooth. There is also related work towards private high-
dimensional regression, where the dimensions outnum-
ber the number of instances (Kifer et al., 2012; Smith &
Thakurta, 2013a). In such cases the Hessian becomes sin-
gular and so the loss is nonconvex. However, it is possible
to use the restricted strong convexity of the loss in the re-
gression case to guarantee privacy.
Differential privacy has been shown to be achievable in
online and interactive kernel learning settings (Jain et al.,
2012; Smith & Thakurta, 2013b; Jain & Thakurta, 2013;
Mishra & Thakurta, 2014). In general, non-private online
algorithms are closest in spirit to the methods of Bayesian
optimization. However, all of the previous work in dif-
ferentially private online learning represents a dataset as
a sequence of bandit arm pulls (the equivalent notion in
Bayesian optimization is function evaluations f(λ
t
)). In-
stead, we consider functions in which changing a single
dataset entry possibly affects all future function evalua-
tions. Closest to our work is that of Chaudhuri & Vin-
terbo (2013), who show that given a fixed set of hyper-
parameters which are always evaluated for any validation
set, they can return a private version of the index of the
best hyper-parameter, as well as a private model trained
with that hyper-parameter. In one sense, their setting is
more difficult as they guarantee privacy for the training and
validation sets (we assume the training set is fixed). On the
other hand, our selection strategy is more general in that,
if the validation set changes, Bayesian optimization could
search completely different hyper-parameters.
Bayesian optimization, largely due to its principled han-
dling of the exploration/exploitation trade-off of global,
black-box function optimization, is quickly becoming
the global optimization paradigm of choice. Alongside
promising empirical results there is a wealth of recent work
on convergence guarantees for Bayesian optimization, sim-
ilar to those used in this work (Srinivas et al., 2010; de Fre-
itas et al., 2012). Vazquez & Bect (2010) and Bull (2011)
give regret bounds for optimization with the expected im-
provement acquisition function. BayesGap (Hoffman et al.,
2014) gives a convergence guarantee for Bayesian opti-
mization with budget constraints. Bayesian optimization
has also been extended to multi-task optimization (Bar-
denet et al., 2013; Swersky et al., 2013), the parallel ex-
periment setting (Azimi et al., 2012; Snoek et al., 2012),
and to constrained optimization (Gardner et al., 2014).
7. Conclusion
We have introduced methods for privately releasing the
best hyper-parameters and validation accuracies in the case
of exact and noisy observations. Our work makes use
of the differential privacy framework, which has become
commonplace in private machine learning (Dwork & Roth,
2013). We believe we are the first to demonstrate differen-
tially private quantities in the setting of global optimization
of expensive (possibly nonconvex) functions, through the
lens of Bayesian optimization.
One key future direction is to design techniques to release
each sampled hyper-parameter and validation accuracy pri-
vately (during the run of Bayesian optimization). This
requires analyzing how the maximum upper-confidence
bound changes as the validation dataset changes. Another
interesting direction is extending our guarantees in Sections
3 and 4 to other acquisition functions.
For the case of machine learning hyper-parameter tuning
our results are designed to guarantee privacy of the valida-
tion set only (it is equivalent to guarantee that the training
set is never allowed to change). To simultaneously protect
the privacy of the training set it may be possible to use tech-
niques similar to the training stability results of Chaudhuri
& Vinterbo (2013). Training stability could be guaranteed,
for example, by assuming an additional training set kernel
that bounds the effect of altering the training set on f. We
leave developing these guarantees for future work.
Acknowledgments
KQW, MJK, and JRG are supported by NSF grants IIA-
1355406, IIS-1149882, EFRI-1137211. This work was
also supported by Yahoo! Labs, where MJK was a re-
search intern working on this project. The authors thank
Abhradeep Thakurta and Avishek Saha for helpful guid-
ance, discussions, and suggestions.
Differentially Private Bayesian Optimization
References
Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul.
Finite-time analysis of the multiarmed bandit problem.
Machine learning, 47(2-3):235–256, 2002.
Azimi, Javad, Jalali, Ali, and Fern, Xiaoli Z. Hybrid batch
bayesian optimization. In ICML 2012, pp. 1215–1222.
ACM, 2012.
Bardenet, R
´
emi, Brendel, M
´
aty
´
as, K
´
egl, Bal
´
azs, and Se-
bag, Mich
`
ele. Collaborative hyperparameter tuning. In
ICML, 2013.
Bassily, Raef, Smith, Adam, and Thakurta, Abhradeep.
Private empirical risk minimization, revisited. arXiv
preprint arXiv:1405.7085, 2014.
Bergstra, James and Bengio, Yoshua. Random search
for hyper-parameter optimization. JMLR, 13:281–305,
2012.
Bonilla, Edwin, Chai, Kian Ming, and Williams, Christo-
pher. Multi-task gaussian process prediction. In NIPS,
2008.
Bull, Adam D. Convergence rates of efficient global opti-
mization algorithms. JMLR, 12:2879–2904, 2011.
Chaudhuri, Kamalika and Vinterbo, Staal A. A stability-
based validation procedure for differentially private ma-
chine learning. In Advances in Neural Information Pro-
cessing Systems, pp. 2652–2660, 2013.
Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate,
Anand D. Differentially private empirical risk minimiza-
tion. JMLR, 12:1069–1109, 2011.
Chong, Miao M, Abraham, Ajith, and Paprzycki,
Marcin. Traffic accident analysis using machine learning
paradigms. Informatica (Slovenia), 29(1):89–98, 2005.
Cortes, Corinna and Vapnik, Vladimir. Support-vector net-
works. Machine learning, 20(3):273–297, 1995.
de Freitas, Nando, Smola, Alex, and Zoghi, Masrour. Ex-
ponential regret bounds for gaussian process bandits
with deterministic observations. In ICML, 2012.
Dinur, Irit and Nissim, Kobbi. Revealing information while
preserving privacy. In Proceedings of the SIGMOD-
SIGACT-SIGART symposium on principles of database
systems, pp. 202–210. ACM, 2003.
Duchi, John C, Jordan, Michael I, and Wainwright, Mar-
tin J. Local privacy and statistical minimax rates. In
FOCS, pp. 429–438. IEEE, 2013.
Dwork, Cynthia and Roth, Aaron. The algorithmic founda-
tions of differential privacy. Theoretical Computer Sci-
ence, 9(3-4):211–407, 2013.
Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry,
Frank, Mironov, Ilya, and Naor, Moni. Our data, our-
selves: Privacy via distributed noise generation. In Ad-
vances in Cryptology-EUROCRYPT 2006, pp. 486–503.
Springer, 2006a.
Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and
Smith, Adam. Calibrating noise to sensitivity in private
data analysis. In Theory of Cryptography, pp. 265–284.
Springer, 2006b.
Ganta, Srivatsava Ranjit, Kasiviswanathan, Shiva Prasad,
and Smith, Adam. Composition attacks and auxiliary in-
formation in data privacy. In KDD, pp. 265–273. ACM,
2008.
Gardner, Jacob, Kusner, Matt, Xu, Zhixiang, Weinberger,
Kilian, and Cunningham, John. Bayesian optimization
with inequality constraints. In ICML, pp. 937–945, 2014.
Hoffman, Matthew, Shahriari, Bobak, and de Freitas,
Nando. On correlation and budget constraints in model-
based bandit optimization with application to automatic
machine learning. In AISTATS, pp. 365–374, 2014.
Huang, Xiaolin, Shi, Lei, and Suykens, Johan AK. Ramp
loss linear programming support vector machine. The
Journal of Machine Learning Research, 15(1):2185–
2211, 2014.
Hutter, Frank, Hoos, H. Holger, and Leyton-Brown, Kevin.
Sequential model-based optimization for general algo-
rithm configuration. In Learning and Intelligent Opti-
mization, pp. 507–523. Springer, 2011.
Jain, Prateek and Thakurta, Abhradeep. Differentially pri-
vate learning with kernels. In ICML, pp. 118–126, 2013.
Jain, Prateek and Thakurta, Abhradeep Guha. (near) di-
mension independent risk bounds for differentially pri-
vate learning. In ICML, pp. 476–484, 2014.
Jain, Prateek, Kothari, Pravesh, and Thakurta, Abhradeep.
Differentially private online learning. COLT, 2012.
Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep.
Private convex empirical risk minimization and high-
dimensional regression. JMLR, 1:41, 2012.
Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-
optimal sensor placements in gaussian processes: The-
ory, efficient algorithms and empirical studies. JMLR, 9:
235–284, 2008.
Differentially Private Bayesian Optimization
McSherry, Frank and Talwar, Kunal. Mechanism design via
differential privacy. In FOCS, pp. 94–103. IEEE, 2007.
Mishra, Nikita and Thakurta, Abhradeep. Private stochastic
multi-arm bandits: From theory to practice. In ICML
Workshop on Learning, Security, and Privacy, 2014.
Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas.
The application of bayesian methods for seeking the ex-
tremum. Towards Global Optimization, 2(117-129):2,
1978.
Narayanan, Arvind and Shmatikov, Vitaly. Robust de-
anonymization of large sparse datasets. In IEEE Sympo-
sium on Security and Privacy, pp. 111–125. IEEE, 2008.
Rasmussen, Carl Edward and Williams, Christopher K. I.
Gaussian processes for machine learning. 2006.
Sch
¨
olkopf, Bernhard and Smola, Alexander J. Learning
with kernels: support vector machines, regularization,
optimization, and beyond. MIT press, 2001.
Shalev-Shwartz, Shai. Online learning: Theory, algo-
rithms, and applications. 2007.
Smith, Adam and Thakurta, Abhradeep Guha. Differen-
tially private feature selection via stability arguments,
and the robustness of the lasso. In COLT, pp. 819–850,
2013a.
Smith, Adam and Thakurta, Abhradeep Guha. (nearly)
optimal algorithms for private online learning in full-
information and bandit settings. In NIPS, pp. 2733–
2741, 2013b.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Practical bayesian optimization of machine learning al-
gorithms. In NIPS, pp. 2951–2959, 2012.
Song, Shuang, Chaudhuri, Kamalika, and Sarwate,
Anand D. Stochastic gradient descent with differentially
private updates. In IEEE Global Conference on Signal
and Information Processing, 2013.
Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M, and
Seeger, Matthias. Gaussian process optimization in the
bandit setting: No regret and experimental design. In
ICML, 2010.
Sweeney, Latanya. Weaving technology and policy to-
gether to maintain confidentiality. The Journal of Law,
Medicine & Ethics, 25(2-3):98–110, 1997.
Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multi-
task bayesian optimization. In NIPS, pp. 2004–2012,
2013.
Vazquez, Emmanuel and Bect, Julien. Convergence prop-
erties of the expected improvement algorithm with fixed
mean and covariance functions. Journal of Statistical
Planning and Inference, 140(11):3088–3095, 2010.
Weinberger, Kilian, Dasgupta, Anirban, Langford, John,
Smola, Alex, and Attenberg, Josh. Feature hashing for
large scale multitask learning. In ICML, pp. 1113–1120.
ACM, 2009.
Yu, Shipeng, Esbroeck, Alexander van, Farooq, Faisal,
Fung, Glenn, Anand, Vikram, and Krishnapuram, Bal-
aji. Predicting readmission risk with institution specific
prediction models. In IEEE International Conference
on Healthcare Informatics (ICHI), pp. 415–420. IEEE,
2013.