DP Application - Random Response

[2016-Atsushi] gives a comprehensive survey about random response. Randomized response tends to be used in data collection scenario.Randomized response is purely a client-based privacy solution. It does not rely upon a trusted third-party server and puts control over data back to clients. The basic idea is answer truthfully with probability $p$, and answer randomly by picking a answer from the rest choices with probability $1-p$. And the rest choices include the true answer.

Random Response and its Variant

Definition[2016-Wang]

Screen Shot 2018-05-29 at 10.18.29 PM

Warner Version

Random response was first proposed by [1965-Warner]. The aim is to estimate the proportion $\pi(A)$ of people who have some attribute A.

Each user has reports her true answer $t\in \{1,-1\}​$ with probability p, and a random answer with probability 1 − p. The latter has the same probability to be −1 and +1;

Framework
  1. There are red cards and non-red cards in a box, where the ratio of red cards among all the cards is $q$ with $0 < q < 1$ and $q \ne \frac{1}{2}$. And user draws a card.
  2. If red, answer truthfully to the question “I am a member of A”
  3. Otherwise, answer truthfully to the question “I am a member of A”
Utility

Let $\hat{T}$ be the proportion to which the respondents reply “True.” It is easy to see that the expectation of $\hat{T}$ is:

Screen Shot 2018-05-29 at 12.40.16 AM

Privacy Analysis

According to the definition of Differential Privacy, the following ratio should be bounded,

$P(A|A)=q​$

$P(A| not \ A)=(1-q)$

So combining the above two, we have:

The randomized response satisfied $\epsilon$-Differential Privacy, where

myplot

Kuk Version

Kuk’s proposed another kind of randomized response mechanism to estimate the proportion $\pi_A$, the same as Warner’s mechanism.

Framework
  1. There are two boxes, $BOX_1$ and $BOX_2$. There are red and non-red cards in each box and the ratio of red cards in each box is $q_1,q_2$ respectively, where $0<q_1,q_2<1, q_1\ne{q_2}$. User takes one card from each box.
  2. If this user is a member of A, then replies “red card” or “non-red card” in accordance
    with the card taken from $BOX_1$
  3. Otherwise, he does the same as above except that he takes a card from $BOX_2$.
Privacy Analysis

Screen Shot 2018-05-29 at 1.08.31 AM

Screen Shot 2018-05-29 at 1.08.51 AM

Negative Survey Mechanism

Framework

Screen Shot 2018-05-29 at 1.10.11 AM

Privacy Analysis

Screen Shot 2018-05-29 at 1.11.13 AM

t-times Negative Survey

Privacy Analysis

Screen Shot 2018-05-29 at 1.12.13 AM-7574379

Screen Shot 2018-05-29 at 1.12.36 AM

Application

[2011-Quercia]

Framework

[2011-Quercia] uses random response to obfuscate locations.

Let $k$ be the number of locations on a map. Then, the main part of SpotMe works as follows:

  1. the mobile phone chooses the location k uniformly at random with probability $p$
  2. it chooses the true location with probability $1-p$
Privacy Analysis

According to the definition of Differential Privacy, the following ratio should be bounded,

where $L_{true}$ is user’s actual location and $L^{‘}_{true}$ is the user’s location after making some changes while $L_{noise}$ is user’s obfuscated location.

$P(L_{true}|L_{true})=(1-p)+p*\frac{1}{k}$

$P(L_{true}|L^{‘}_{true})=p*\frac{1}{k}$

so after combining above two formulations, we have privacy bugdet:

[2016-Wang]

Randomized Response vs. Laplace Mechanism

Randomized Response

Without loss of generality, we assume the randomized response still favors the true value, i.e., $p_{00}, p_{11} > 0.5$. Intuitively, under the same privacy standard, the mechanism with larger diagonal elements in the corresponding design matrix tends to achieve better utility.

  • Screen Shot 2018-06-01 at 5.09.06 PM
  • What is the proportion of $X=1$? It aims to o learn the population distribution based on the collected randomized dataset. We use $\pi_1$to denote the true proportion of value 1 to be estimated in the original population. The observed proportion of value 1 in the collected dataset is denoted as $\lambda_1$. We denote the unbiased estimator for $\pi_1$ respectively as $\hat{\pi}_1$.
  • Screen Shot 2018-06-01 at 5.13.14 PM

    Laplace Mechanism

  • Screen Shot 2018-06-01 at 5.22.11 PM

  • Screen Shot 2018-06-01 at 5.23.19 PM-7891843
  • Screen Shot 2018-06-01 at 5.24.26 PM

[2006-Huseyin]

One-Group Schemes vs Multi-Group Schemes.

in my understanding, one-group means there arw n users and m items. for a user, all the items rating are in a group. For example, user_i’s true rateing for all items is (0101),with probability theta to upload (0101) and (1-theta) probability to upload (1010).

And in m-group, for example m=2, then user_i’s rating is (01)and(11). with probability theta1 to upload (01) and (1-theta1) to upload (10); theta2 to upload(11) and (1-theta2) to upload (00)

Although we achieve decent accuracy in this scheme, the privacy level is very low. We improve privacy level by introducing multi-group schemes, while with increasing M, accuracy decreases because we add more randomness. The accuracy is from how much info is preserved after randomness. The more info, the more accurate. P(true=x|randon resp = x) means preserved info. m-group info preserved: $P^m$

Screen Shot 2018-06-02 at 4.16.46 PM

Screen Shot 2018-06-02 at 4.16.55 PM

Screen Shot 2018-06-02 at 4.17.21 PM

Private Mechanism With Full Privacy

Screen Shot 2018-06-02 at 4.20.01 PM

[2014-Sun]

The problem setting is how to find frequent itemset with privacy. Say we have the item set $I={I_1,I_2,..,I_n}$. Before each user sends out their transaction which is $n$-dimention vector and ithe entry being 1 means this user has this item or else, he does the the randomized selection for each item. That is, keep true answer with probability $p_i$ for item i and perturb the answer with $1-p$. And p controls the privacy level.

For k-itemset, we want to find out its support, which comes from the estimator based on the perturbed data. Say we want to calculate the support of $2$-itemset. And the item set is $\{A,B,C,D\}$. And the candidate domain is $\{AB,AC,AD,BC,BD,CD\}$. The aim is to get the true count of $AB, \ c(AB)$, which is $11$ and may comes from $\{00,01,10,11\}$, with the probability matrix $P=\matrix{p_{00},p_{01}\\p_{10},p_{11}}$. so based on the observed count of $[A=0,B=0],[A=0,B=1],[A=1,B=0],[A=1,B=1]$ we can get estimator count matrix by $C_{observed}=PC_{estimator}\to C_{estimator}=P^{-1}C_{observed}$. And the support for $AB$ is $C_{estimator}[-1]$.

Basic RAPPOR

[Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries]

Screen Shot 2018-06-17 at 11.08.18 AM

Reference

[2011-Quercia] SpotME if you can: randomized responses for location obfuscation on mobile phones

[1965-Warner] Randomized response: a survey technique for eliminating evasive answer bias

[2016-Wang] Using Randomized Response for Differential Privacy Preserving Data Collection

[2006-Huseyin] Achieving Private Recommendations Using Randomized Response Techniques

[2014-Sun] Personalized Privacy-Preserving Frequent Itemset Mining Using Randomized Response

[]