A NEAR-OPTIMAL ALGORITHM FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENTS
the sample complexity depends on the extrinsic dimension d rather than the intrinsic dimension k.
However, we believe this is a consequence of the fact that we make minimal assumptions on the
data; our results imply that, absent additional limitations on the data set, the sample complexity
differentially private PCA must grow linearly with the data dimension.
There are several interesting open questions suggested by this work. One set of issues is com-
putational. Differentially privacy is a mathematical definition, but algorithms must be implemented
using finite precision machines. Privacy and computation interact in many places, including pseu-
dorandomness, numerical stability, optimization, and in the MCMC procedure we use to implement
PPCA; investigating the impact of approximate sampling is an avenue for future work. A second set
of issues is theoretical—while the privacy guarantees of PPCA hold for all k, our theoretical anal-
ysis of sample complexity applies only to k = 1 in which the distance and angles between vectors
are related. An interesting direction is to develop theoretical bounds for general k; challenges here
are providing the right notion of approximation of PCA, and extending the theor y using packings of
Grassmann or Stiefel manifolds. Finally, in this work we assume k is given to the algorithm, but in
many applications k is chosen after looking at the data. Under differential privacy, the selection of
k itself must be done in a differentially private manner.
1.1 Related Work
Differential privacy was first proposed by Dwork et al. (2006b). There has been an extensive liter-
ature following this work in the computer science theory, machine learning, and databases commu-
nities. A survey of some of the theoretical work can be found in the survey by Dwork and Smith
(2009). Differential privacy has been shown to have strong semantic guarantees (Dwork et al.,
2006b; Kasiviswanathan and Smith, 2008) and is resistant to many attacks (Ganta et al., 2008) that
succeed against alternative definitions of privacy. In particular, so-called syntactic definitions of
privacy (Sweeney, 2002; Machanavajjhala et al., 2006; Li et al., 2010) may be susceptible to attacks
based on side-information about individuals in the database.
There are several general approaches to constructing differentially private approximations to
some desired algorithm or computation. I nput perturbation (Blum et al., 2005) adds noise to the
data prior to performing the desired computation, whereas output perturbation (Dwork et al., 2006b)
adds noise to the output of the desired computation. The exponential mechanism (McSherry and
Talwar, 2007) can be used to perform differentially private selection based on a score function that
measures the quality of different outputs. Objective perturbation (Chaudhuri et al., 2011) adds noise
to the objective function for algorithms which are convex optimizations. These approaches and
related ideas such as Nissim et al. (2007) and Dwork and Lei (2009) have been used to approximate
a variety of statistical, machine learning, and data mining tasks under differential privacy (Barak
et al., 2007; Wasserman and Zhou, 2010; Smith, 2011; McSherry and Mironov, 2009; Williams and
McSherry, 2010; Chaudhuri et al., 2011; Rubinstein et al., 2012; Nis sim et al., 2007; Blum et al.,
2008; McSherry and Talwar, 2007; Friedman and Schuster, 2010; Hardt and Roth, 2012).
This paper deals with the problem of differentially private approximations to PCA. Prior to our
work, the only proposed method for PCA was the Sub-Linear Queries (SULQ) method of Blum
et al. (2005). This approach adds noise to the second moment matrix of the data before calcu-
lating the singular value decomposition. By contrast, our algorithm, PPCA, uses the exponential
mechanism (McSherry and Talwar, 2007) to choose a k-dimensional subspace biased toward those
which capture more of “energy” of the matrix. Subsequent to our work, Kapralov and Talwar (2013)
2907