This post is about the definition of Differential Privacy. DP ensures privacy by randomness. So before the definition of DP, we need to know what is randomized algorithms.
In this post, Domain is represented as $\mathbb{N}^{|X|}$, where $|X|$ means the element number of the domain; Range is represented as $\mathbb{R}^{|k|}$, where $k$ means the element is $k$ dimentions.
Range(A) means the output of the mechanism A.
Randomized algorithms
In general, a randomized algorithm with domain $A$ and (discrete) range $B$ will be associated with a mapping from $A$ to the probability simplex over $B$, denoted $\Delta(B)$:
Differential Privacy
Remarks:
Assuming the existence of a trusted and trustworthy curator who holds the data of individuals in a database
For particitants, differential privacy promises to protect individuals from any additional harm that they might face due to their data being in the private database x that they would not have faced had their data not been part of x.
For adversaries, differential privacy asserts that for all pairs of adjacent databases x, y and all outputs $o$, an adversary cannot distinguish which is the true database on the basis of observing $o$, which implies the adversary’s prior and posterior views about an individual should not be too different.
DP merely ensures that one’s participation in a survey will not in itself be disclosed, nor will participation lead to disclosure of any specifics that one has contributed to the survey.
Understanding of DP definition
How DP mean by saying it can limit the knowledge of adversaries by quering the database?
According to the definition, $Pr(o|D)\le{e^{\epsilon}Pr(o|D’)}$, by observing the output $o$, the adversary cannot reliably infer whether the database is $D$ or $D’$, which means it cannot know whether someone is added or removed to the database. Indeed, the smaller the $\epsilon$ is, the closer the likehood ratio of $D$ to $D’$ is to 1. Therefore, when $\epsilon$ is small, the adversary cannot distinguish the true database.
Properties
- Post-Processing : Differential privacy is immune to post-processing
Vector DP
This following details are from[2014-To].
Group DP
- the strength of the privacy guarantee drops linearly with the size of the group
Event VS User
Differential privacy promises that the behavior of an algorithm will be roughly unchanged even if a single entry in the database is modified. But what constitutes a single entry in the database? Consider for example a database that takes the form of a graph. Such a database might encode a social network: each individual i ∈ [n] is represented by a vertex in the graph, and friendships between
individuals are represented by edges.We could consider differential privacy at a level of granularity corresponding to individuals: that is, we could require that differentially private algorithms be insensitive to the addition or removal of any vertex from the graph.
We could on the other hand consider differential privacy at a level of granularity corresponding to edges, and ask our algorithms to be insensitive only to the addition or removal of single, or small numbers of, edges from the graph.
As another example, a differentially private movie recommendation system can be designed to protect the data in the training set at the “event” level of single movies, hiding the viewing/rating of any single
movie but not, say, hiding an individual’s enthusiasm for cowboy westerns or gore, or at the “user” level of an individual’s entire viewing and rating history.
Non-interactive VS Interactive
Non-interactive
Interactive
Geo-indistinguishability
Enjoying $l$-privacy within $r$ means that for any two locations $x_1$ and $x_2$ $s.t. d_2(x_1,x_2)\le r$, the probability of generating the same perturbed location $z$ should be similar, where $\frac{P(z|x_1)}{P(z|x_2)}\le e^{\epsilon d_2(x_1,x_2)}$ and $l=\epsilon r$. In short, $\epsilon $-geo-indistinguishability provides $\epsilon d_2$-privacy.
- It guarantees that obfuscated locations are statistically indistinguishable from other locations within a radius around the users’ real location.
- And the parameter $\epsilon$ is no longer considered as the privacy budget; actually it is determined as $\epsilon=\frac{l}{r}$, where $l$ is the privacy level and $r$ is the privacy radius. This ensures that, when Alice is in location $x$ and releases a perturbed location $z$, her location is statistically indistinguishable from all the other locations $x’$ within a radius of $r$ around her, i.e., $\frac{P(z|x)}{P(z|x’)}\le {e^{\epsilon d(x,x’)}}\le {e^l}$.
- Note also that standard differential privacy simply corresponds to $\epsilon d_h(x, x’ )$-privacy, where $d_h$ is the Hamming distance between databases $x,x’$ , i.e. the number of individuals in which they differ. And standard differential privacy is too strong because it aims at completely protecting the value of a single individual; while in geo-indistinguishability the user’s secret is a single location.
Privacy loss
DP Utility Analysis
This following details are from [2018-Yang].
Remarks
Prupose
- differential privacy aims to ensure that the output of the algorithm does not significantly depend on any particular individual’s data and ensures that an adversary should not be able to confidently infer whether a particular individual is present in a database even with access to every other entry in the database and an unbounded computational power. [2016-Wang]
Adversary Models
[2017-Mousumi]
[2016-Chen]
Reference
[2014-To] A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing
[2018-Yang] Density-Based Location Preservation for Mobile Crowdsensing With Differential Privacy
[2016-Wang] Using Randomized Response for Differential Privacy Preserving Data Collection
[2017-Mousumi] Computing Aggregates Over Numeric Data with Personalized Local Differential Privacy
[2016-Chen] Private Spatial Data Aggregation in the Local Setting