ML-Kernel Function

Posted on 2018-07-04 | In Machine Learning

kernel function的核心是，只要我们对整个空间给定一个对距离相关性的度量标准，那么我们因为这个度量标准可以推测出别处的数据(可能的)分布。

机器学习里的 kernel 是指什么？ - 知乎 https://www.zhihu.com/question/30371867

DP-Lectures by Dwork

Posted on 2018-07-02 | In Differential Privacy

Lecture1

Lecture2

Screen Shot 2018-07-02 at 6.14.21 PM

$P(|x|\ge tb)=e^{-t}$

Screen Shot 2018-07-02 at 6.25.02 PM

Screen Shot 2018-07-02 at 6.24.38 PM

x

Screen Shot 2018-07-02 at 6.34.57 PM

sensitivity of utility $\Delta u$: how much in the worst case can one person’s data affect the ytility.

Screen Shot 2018-07-02 at 6.40.25 PM

xx

Screen Shot 2018-07-02 at 6.44.44 PM (2) .png)

for the utility above, $u(x,y)$, it is defined according to the query error between the databases; if the query error is small, then should have a good utility, that’s why we need a negative sign in front of the max.

the sensitivity of utility????

Screen Shot 2018-07-02 at 6.54.26 PM (2) .png)

Screen Shot 2018-07-02 at 6.55.13 PM

Screen Shot 2018-07-02 at 7.02.17 PM

Screen Shot 2018-07-02 at 7.02.47 PM

Screen Shot 2018-07-02 at 7.03.48 PM

Screen Shot 2018-07-02 at 7.05.11 PM

Screen Shot 2018-07-03 at 9.34.45 AM

In the video, she says the mechanism should be saperated from the database????

Screen Shot 2018-07-03 at 9.38.58 AM

uncoordinated responses: ask one question and I add some nosie to the true answer and return it to you ; ask another question and do the same thing. They are independent of everything I did in the past.

Screen Shot 2018-07-03 at 9.43.24 AM

Stateless Mechanism : it does not remember what it does before. Answering the subsequent queries doesn’t depend on the previous queries.

Screen Shot 2018-07-03 at 9.50.47 AM

density response?????

Screen Shot 2018-07-03 at 9.52.36 AM

Screen Shot 2018-07-03 at 10.03.33 AM

Screen Shot 2018-07-03 at 9.56.54 AM

Screen Shot 2018-07-03 at 9.58.36 AM

Screen Shot 2018-07-03 at 10.00.51 AM

Lecture3

Screen Shot 2018-07-10 at 5.38.25 PM

Screen Shot 2018-07-10 at 5.40.29 PM

Screen Shot 2018-07-10 at 5.43.16 PM

Screen Shot 2018-07-10 at 5.44.13 PM

$D(q)$ is the probability so is bounded by 1.

Screen Shot 2018-07-10 at 5.48.18 PM

Screen Shot 2018-07-10 at 6.06.02 PM

Screen Shot 2018-07-10 at 6.07.31 PM

Screen Shot 2018-07-10 at 6.10.34 PM

There are two databases $x$ and $x’$, where $x$ has property $p2,p3,p6$ and $p8$ and $x’$ has property 1 through 4 but not 5 to 8. And we have a set of queries results $Y_i$ and say we release the result with some noise, where for $Y_1,Y_2,Y_3,Y_4$ we add positive noise while negative noise for $Y_5,Y_6,Y_7,Y_8$.

Screen Shot 2018-07-10 at 6.17.02 PM

Screen Shot 2018-07-10 at 6.18.22 PM

Screen Shot 2018-07-10 at 6.23.33 PM

Screen Shot 2018-07-10 at 6.27.49 PM

DP-DEEP LEARNING

Posted on 2018-06-28 | In Differential Privacy

Deep learning is the process of learning nonlinear features and functions from complex data. Deep learning has been shown to outperform traditional techniques for speech recognition, image recognition, and face detection. Deep learning aims to extract complex features from high-dimensional data and use them to build a model that relates inputs to outputs (e.g., classes). Deep learning architectures are usually constructed as multi-layer networks so that more abstract features are computed as nonlinear functions of lower-level features.

Privacy in deep learning consists of three aspects: privacy of the data used for learning a model or as input to an existing model, privacy of the model, and privacy of the model’s output.

Python

Posted on 2018-06-26 | In Python

last

琐碎

warning ignore

1 2	import warnings warnings.filterwarnings("ignore")

中文注释乱码
1

# -*- coding:utf8 -*-
放在python脚本的第一行。

a = np.array([1,2,3,...]) # n
b = np.array([2,3,4,...]) # m
c = a - b # error because n != m
c = a.reshape(n,1)-b # produce a two-dim array (n,m)

Number

数据类型

Python支持不同的数字类型 -

int (有符号整数): 它们通常被称为只是整数或整数，是正的或负的整数，没有小数点。 Python3整数是无限的大小。Python 2中有两个整数类型 - int 和 long。

在Python3中不再有 “长整型”了。
float (点实数值) : 也叫浮点数，它们代表实数，并用小数点分割整数和小数部分。浮点数也可以用科学记数法，使用 e 或 E 表示10的幂 (2.5e2 = 2.5 x 102 = 250).
complex (复数) : 格式是 a + bJ，其中a和b是浮点数，而J(或j)代表-1的平方根(这是一个虚数)。实数是a的一部分，而虚部为b。复数不经常使用在 Python 编程了。

数值类型转换

类型 int(X)是将x转换为纯整数
类型 long(x) 将 x 转换为一个长整型
类型 float(x) 将 x 转换为浮点数
类型 complex(x) 将 x 转换成具有实数部分x和虚部为零的复数
类型 complex(x, y) x和y转换成一个带x实部和y为虚部的复数。x和y是数值表达式

数值函数

Screen Shot 2018-06-27 at 11.09.15 AM

随机函数

Screen Shot 2018-06-27 at 11.10.03 AM

random.choice()

以下是 choice() 方法的语法：

1	random.choice( seq ) ##seq是一个列表、元组或者字符串，返回一个随机选择的值

import random
print(random.choice(range(100)))
# 31
print(random.choice('HelleWorld'))
# r
print(random.choice([1, 2, 3, 5, 9]))
# 3

random.randrange()

randrange()方法的语法：

1	randrange ([start,] stop [,step]) ##该方法从给定范围内返回一个随机项

start — 范围的开始点。这个起点包括在该范围内。默认值为0
stop — 停止的范围点。这个点不包含在这个范围内
step — 递增值。默认值为1

random.random()

返回范围[0,1]内的随机值

random.shuffle()

1	random.shuffle (lst) ##返回重新洗牌列表。

import random

list = [20, 16, 10, 5];
random.shuffle(list)
print(list)
# [16, 20, 10, 5]
random.shuffle(list)
print(list)
# [10, 16, 20, 5]

random.unform()

1	random.uniform(x, y) ##介于[x,y)的均分布随机值,返回一个浮点数 r，使得 x <= r < y

三角函数

Screen Shot 2018-06-27 at 12.14.54 PM

常量

Screen Shot 2018-06-27 at 12.16.18 PM

字符串

字符串访问

要访问子字符串，用方括号以及索引或索引来获得子切片

Screen Shot 2018-06-27 at 12.18.32 PM

字符串更新

字符串是常量，不能直接修改，但是可以通过字符串拼接，修改。

s = 'hello x world'
s = s[:6]+"wq "+s[8:]
print(s)
# hello wq world

字符串特殊操作符

Screen Shot 2018-06-27 at 12.22.40 PM

字符串格式化%

1 2	print ("My name is %s and weight is %d kg!" % ('Zara', 21)) # My name is Zara and weight is 21 kg!

Screen Shot 2018-06-27 at 12.24.44 PM

Screen Shot 2018-06-27 at 12.25.15 PM

字符串内置函数

Screen Shot 2018-06-27 at 12.26.32 PM

Screen Shot 2018-06-27 at 12.26.45 PM

Screen Shot 2018-06-27 at 12.27.04 PM

Screen Shot 2018-06-27 at 12.27.16 PM

List(列表)

逗号分隔列表中元素，列表重要的一点是，在列表中的项目不必是同一类型。

1
2
3

list1 = ['physics', 'chemistry', 1997, 2000];
list2 = [1, 2, 3, 4, 5 ];
list3 = ["a", "b", "c", "d"];

与字符串索引类似，列表的索引从0开始，并列出可切片，联接，等等。

访问列表值

要访问列表值，请使用方括号连同索引或索引切片获得索引对应可用的值。例如 -

list1 = ['physics', 'chemistry', 1997, 2000]
list2 = [1, 2, 3, 4, 5, 6, 7 ]
print ("list1[0]: ", list1[0])
print ("list2[1:5]: ", list2[1:5])

当执行上面的代码，它产生以下结果 -

1 2	list1[0]: physics list2[1:5]: [2, 3, 4, 5]

列表值更新

可以通过给赋值运算符到左侧切片更新列表中的单个或多个元素，并且可以使用 append()方法中加入一元素。例如 -

list = ['physics', 'chemistry', 1997, 2000]
print ("Value available at index 2 : ", list[2])
list[2] = 2001
print ("New value available at index 2 : ", list[2])

当执行上面的代码，它产生以下结果 -

Value available at index 2 :
1997
New value available at index 2 :
2001

删除列表值

1	del list[index] #删除index处的值

list = ['physics', 'chemistry', 1997, 2000]
print (list)
del list[2]
print ("After deleting value at index 2 : ", list)
# ['physics', 'chemistry', 1997, 2000]
# After deleting value at index 2 :  ['physics', 'chemistry', 2000]

列表基本操作

Screen Shot 2018-06-28 at 12.16.11 PM

列表内置函数

函数名	功能描述
cmp(list1, list2)	列表元素比较，返回0,1,-1,
len(list)	列表长度
max(list)	返回列表中最大值
min(list)	列表中最小值
list(seq)	转化为list
append(ele)	元素添加到列表
count(ele)	统计元素在列表中出现次数
extend(list)	合并list
index(ele)	返回列表中 ele 对象对应最低索引值
insert(index, ele)	插入 ele 对象到列表的 index 索引位置,index必须给出
pop([index=-1])	根据index删除元素，默认删除最后一个
remove(ele)	删除ele元素
reverse()	列表翻转
sort()	列表元素排序

append()

list1 = ['C++', 'Java', 'Python']
list1.append('C#')
print ("updated list : ", list1)
# updated list :  ['C++', 'Java', 'Python', 'C#']

extend()

list1 = ['physics', 'chemistry', 'maths']
list2=list(range(5)) 
list1.extend( list2)
print ('Extended List :',list1)
# Extended List : ['physics', 'chemistry', 'maths', 0, 1, 2, 3, 4]

count()

aList = [123, 'xyz', 'zara', 'abc', 123];
print ("Count for 123 : ", aList.count(123))
print ("Count for zara : ", aList.count('zara'))
# Count for 123 :  2
# Count for zara :  1

Tuple元组

元组是不可变的Python对象的序列，元组序列就像列表，元组和列表之间的区别是，元组不像列表那样不能被改变以及元组使用圆括号，而列表使用方括号，创建一个元组是将不同的逗号分隔值。

tup1 = ('physics', 'chemistry', 1997, 2000)
tup2 = (1, 2, 3, 4, 5 )
tup3 = "a", "b", "c", "d"
print(tup1,tup2,tup3)
# ('physics', 'chemistry', 1997, 2000) (1, 2, 3, 4, 5) ('a', 'b', 'c', 'd')

为了编写含有一个单一的值，必须包含逗号，即使只有一个值的元组 −

1	tup1 = (50,)

元素访问

要访问值元组，用方括号带索引或索引切片来获得可用的索引值

tup1 = ('physics', 'chemistry', 1997, 2000)
tup2 = (1, 2, 3, 4, 5 )
tup3 = "a", "b", "c", "d"
print(tup1[1]) # chemistry
print(tup2[:2]) # (1, 2)
print(tup3[2:]) # ('c', 'd')

元素更新

元组是不可变的，这意味着我们不可以更新或更改元组元素的值。如下面的例子说明了可以把现有的元组创建新的元组的部分

tup1 = ('physics', 'chemistry', 1997, 2000)

#TypeError: 'tuple' object does not support item assignment
#tup1[0] = 100
t_add = (100,)
tup1 = t_add + tup1[1:]
print(tup1)

元素删除

移除个元组的别元素是不可能的。如要明确删除整个元组，只需要用 del 语句。

1 2	tup = ('physics', 'chemistry', 1997, 2000); del tup;

基本操作

Screen Shot 2018-07-01 at 4.07.25 PM

Screen Shot 2018-07-01 at 4.08.13 PM

内置函数

Screen Shot 2018-07-01 at 4.09.04 PM

Python字典

每个键是从它的值由冒号(:)，即在项目之间用逗号隔开，整个东西是包含在大括号中。没有任何项目一个空字典只写两个大括号，就像这样：{}.

键在一个字典中是唯一的，而值则可以重复。字典的值可以是任何类型，但键必须是不可变的数据的类型，例如：字符串，数字或元组这样的类型。

访问字典中的值

要访问字典元素，你可以使用方括号和对应键，以获得其对应的值。

Screen Shot 2018-07-01 at 4.11.11 PM

更新字典

可以通过添加新条目或键值对，修改现有条目，或删除现有条目

Screen Shot 2018-07-01 at 4.12.07 PM

删除字典

可以删除单个字典元素或清除字典的全部内容。也可以在一个单一的操作删除整个词典。

要明确删除整个词典，只要用 del 语句就可以做到。

Screen Shot 2018-07-01 at 4.13.11 PM

字典键

每个键对应多个条目是不允许的。这意味着重复键是不允许的。当键分配过程中遇到重复，以最后分配的为准。例如 -
1

2

3

dict = {'Name': 'Zara', 'Age': 7, 'Name': 'Manni'}

print ("dict['Name']: ", dict['Name'])

## dict['Name']: Manni
键必须是不可变的。这意味着可以使用字符串，数字或元组作为字典的键，但是像[‘key’]是不允许的。

内置函数

Screen Shot 2018-07-01 at 4.18.40 PM

fromkeys()

使用seq的值作为键，来设置创建新的字典。
1

dict.fromkeys(seq[, value]))
参数
- seq — 这是将用于字典键准备值的列表。
- value — 这是可选的，如果提供的话则这个值将被设置为字典的值
get()

返回给定键的值。如果键不可用，那么返回默认值 - None。
1

dict.get(key, default=None)
参数
- key — 这是在字典中被搜索的键。
- default — 这是以防键不存在对应值时，则使用这个值返回。
update()

添加字典dict2键值对到字典dict

DP-Smart Metering

Posted on 2018-06-26 | In Differential Privacy

The several features of smart metering:

Experiment-Metric

Posted on 2018-06-25 | In Experiment

箱线图

箱线图（Boxplot）也称箱须图（Box-whisker Plot），可以用于异常值检测。它是用一组数据中的最小值、第一四分位数、中位数、第三四分位数和最大值来反映数据分布的中心位置和散布范围，可以粗略地看出数据是否具有对称性。通过将多组数据的箱线图画在同一坐标上，则可以清晰地显示各组数据的分布差异，为发现问题、改进流程提供线索。

2009624114634640

四分位数

所谓四分位数，就是把组中所有数据由小到大排列并分成四等份，处于三个分割点位置的数字就是四分位数。

第一四分位数（Q1），又称“较小四分位数”或“下四分位数”，等于该样本中所有数值由小到大排列后第25%的数字。
第二四分位数（Q2），又称“中位数”，等于该样本中所有数值由小到大排列后第50%的数字。
第三四分位数（Q3），又称“较大四分位数”或“上四分位数”，等于该样本中所有数值由小到大排列后第75%的数字。
第三四分位数与第一四分位数的差距又称四分位间距（InterQuartile Range，IQR）。

四分位数计算

确定Q1、Q2、Q3的位置（n表示数字的总个数）
- Q1的位置=（n+1）/4
- Q2的位置=（n+1）/2
- Q3的位置=3（n+1）/4
对于数字个数为奇数的，其四分位数比较容易确定。例如，数字“5、47、48、15、42、41、7、39、45、40、35”共有11项，由小到大排列的结果为“5、7、15、35、39、40、41、42、45、47、48”，计算结果如下：
- Q1的位置=（11+1）/4=3，该位置的数字是15。
- Q2的位置=（11+1）/2=6，该位置的数字是40。
- Q3的位置=3（11+1）/4=9，该位置的数字是45。
而对于数字个数为偶数的，其四分位数确定起来稍微繁琐一点。例如，数字“8、17、38、39、42、44”共有6项，位置计算结果如下：
- Q1的位置=（6+1）/4=1.75
- Q2的位置=（6+1）/2=3.5
- Q3的位置=3（6+1）/4=5.25
这时的数字以数据连续为前提，由所确定位置的前后两个数字共同确定。例如，Q2的位置为3.5，则由第3个数字38和第4个数字39共同确定，计算方法是：38+（39-38）×3.5的小数部分，即38+1×0.5=38.5。该结果实际上是38和39的平均数。

同理，Q1、Q3的计算结果如下：
- Q1 = 8+（17-8）×0.75=14.75
- Q3 = 42+（44-42）×0.25=42.5

MAE vs MSE

ref1

MAE(Mean Absolute Error)

Screen Shot 2018-07-16 at 9.10.11 PM

MSE(Mean Squared Error)

Screen Shot 2018-07-16 at 9.11.04 PM

Comparison

They are both used in crowd counting as evaluation metric.
Roughly speaking, MAE indicates the accuracy of the estimates, and MSE indicates the robustness of the estimates.

This is because for mse, the errors are squared before they are averaged, the MSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.

Confusion Matrix

A confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs. Example for a classification algorithm: Predicting if the patient has cancer. Here, there can only be two outputs i.e. Yes or No.

A confusion matrix gives us a better idea of what our classification model is predicting right and what types of errors it is making.

Below is what an Confusion Matrix looks like:

creen Shot 2019-08-06 at 9.01.39 P

True Positive: You predicted positive and your are right.

True Negative: You predicted negative and your are right.

False Positive: (Type 1 Error): You predicted positive and you are wrong.

False Negative: (Type 2 Error): You predicted negative and you are wrong.

_OpSYGh2-XE6aE3sVAJAHr

_uR09zTlPgIj5PvMYJZScV

Recall

Among the data with label 1, what is the percentage of them predicted as 1?

$\operatorname{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$

Precision

Among the data predicted as 1, what is the percenatge of correctness?

$\text { Precision }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$

F-measure

$F-\text {measure}=\frac{2 * \text { Recall * Precision }}{\text { Recall }+\text { Precision }}$

accuracy常用于分类，比对两个向量，一个是真实向量，一个是预测向量，预测正确加1，最终sum除以向量长度就是准确率。

精确率(precision)的公式是katex is not defined,它计算的是所有”正确被检索的item(TP)”占所有”实际被检索到的(TP+FP)”的比例.（在所有找对找错里面，找对的概率）

召回率(recall)的公式是katex is not defined,它计算的是所有”正确被检索的item(TP)”占所有”应该检索到的item(TP+FN)”的比例。（找到正确的，能覆盖目标的所有的概率）

假如某个班级有男生80人,女生20人,共计100人.目标是找出所有女生. 现在某人挑选出50个人,其中20人是女生,另外还错误的把30个男生也当作女生挑选出来了. 作为评估者的你需要来评估(evaluation)下他的工作。

很容易，我们可以得到:他把其中70(20女+50男)人判定正确了,而总人数是100人，所以它的accuracy就是70 %(70 / 100).
在例子中就是希望知道此君得到的所有人中,正确的人(也就是女生)占有的比例.所以其precision也就是40%(20女生/(20女生+30误判为女生的男生)).
在例子中就是希望知道此君得到的女生占本班中所有女生的比例,所以其recall也就是100%(20女生/(20女生+ 0 误判为男生的女生))

AUC ROC Curve

What is ROC?

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

Let’s assume we have a model which predicts whether the patient has a particular disease or no. The model predicts probabilities for each patient (in python we use the“ predict_proba*” function*). Using these probabilities, we plot the distribution as shown below:

_HxNvqTl-Pd63niUIbrD4p

Here, the red distribution represents all the patients who do not have the disease and the green distribution represents all the patients who have the disease.

Now we got to pick a value where we need to set the cut off i.e. a threshold value, above which we will predict everyone as positive (they have the disease) and below which will predict as negative (they do not have the disease). We will set the threshold at “0.5” as shown below:

_qLjMtrdG3qIcuNBALvsYQ

All the positive values above the threshold will be “True Positives” and the negative values above the threshold will be “False Positives” as they are predicted incorrectly as positives.

All the negative values below the threshold will be “True Negatives” and the positive values below the threshold will be “False Negative” as they are predicted incorrectly as negatives.

_Bwhr9ots47akHbrgssKXr

Here, we have got a basic idea of the model predicting correct and incorrect values with respect to the threshold set. Before we move on, let’s go through two important terms: Sensitivity and Specificity.

What is Sensitivity and Specificity?

In simple terms, the proportion of patients that were identified correctly to have the disease (i.e. True Positive) upon the total number of patients who actually have the disease is called as Sensitivity or Recall.

_aLUZ01GaLPwGDI24jb-uU

Similarly, the proportion of patients that were identified correctly to not have the disease (i.e. True Negative) upon the total number of patients who do not have the disease is called as Specificity.

_mPEFI9HFEF7GKH5HYQRDz

Trade-off between Sensitivity and Specificity

When we decrease the threshold, we get more positive values thus increasing the sensitivity. Meanwhile, this will decrease the specificity.

Similarly, when we increase the threshold, we get more negative values thus increasing the specificity and decreasing sensitivity.

As Sensitivity ⬇️ Specificity ⬆️

As Specificity ⬇️ Sensitivity ⬆️

_ceB9hobuBUjnPpRKedA-V

But, this is not how we graph the ROC curve. To plot ROC curve, instead of Specificity we use (1 — Specificity) and the graph will look something like this:

_4Ar_wBQ_xWrFUqwwQGV-8

So now, when the sensitivity increases, (1 — specificity) will also increase. This curve is known as the ROC curve.

_QqZzGJwzYxnHWZ_axq6yn

Area Under the Curve

The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances.

Let’s take a few examples

_AgDJbm6d8qr8ESHNv6VvK

_KNhNw8BsjbIETPF_BH8Qp

As we see, the first model does quite a good job of distinguishing the positive and the negative values. Therefore, there the AUC score is 0.9 as the area under the ROC curve is large.

Whereas, if we see the last model, predictions are completely overlapping each other and we get the AUC score of 0.5. This means that the model is performing poorly and it is predictions are almost random.

Why do we use (1 — Specificity)?

Let’s derive what exactly is (1 — Specificity):

$\begin{array}{c}{\text { Specificity }=\frac{T N}{T N+F P}} \\ {1-\text {Specificity}=1-\frac{T N}{T N+F P}} \\ {1-\text {Specificity}=\frac{T N+F P-T N}{T N+F P}} \\ {1-\text {Specificity}=\frac{F P}{T N+F P}}\end{array}$

As we see above, Specificity gives us the True Negative Rate and (1 — Specificity) gives us the False Positive Rate.

So the sensitivity can be called as the “True Positive Rate” and (1 — Specificity) can be called as the “False Positive Rate”.

So now we are just looking at the positives. As we increase the threshold, we decrease the TPR as well as the FPR and when we decrease the threshold, we are increasing the TPR and FPR.

Thus, AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes.

ref1

AOC是衡量一个模型是否有效的参数，比如：我们使用了grid search来暴力寻找最佳的超参数组合，我们就可以使用AOC来比较不同超参数组合模型的效果，从而选择最佳模型的超参数组合。一般将scoring=’roc_auc’。

rfc = RandomForestClassifier()
param_grid = {'n_estimators':[70,100,180],'criterion':['gini','entropy'],'verbose':[0,4,10],'warm_start':['False','True'], 'random_state':[42,72,100,200]}
CV_rfc = GridSearchCV(estimator=rfc,param_grid=param_grid, scoring='roc_auc', cv= 5)
CV_rfc.fit(X, Y)
print('BEST PARAMETERS:\n',CV_rfc.best_params_)
print('BEST SCORE:\n',CV_rfc.best_score_)
#>>>BEST PARAMETERS:{'criterion': 'entropy', 'n_estimators': 180, 'random_state': 72, 'verbose': 0, 'warm_start': 'False'}
#>>>BEST SCORE:0.8326181651784338

DP-Recommendation

Posted on 2018-06-20 | In Differential Privacy

[2016-Xue Zhu]

It employs the differential privacy method in the process of recommendation rather than on the data. An advantage of such choice is that it does not generate accumulative error.

Reference

[2016-Xue Zhu] Differential Privacy for Collaborative Filtering Recommender Algorithm

DM-Collaborative Filtering

Posted on 2018-06-19 | In Data Mining

参考

协同过滤推荐算法主要的功能是预测和推荐。算法通过对用户历史行为数据的挖掘发现用户的偏好，基于不同的偏好对用户进行群组划分并推荐品味相似的商品。协同过滤推荐算法分为两类，分别是基于用户的协同过滤算法(user-based collaboratIve filtering)，和基于物品的协同过滤算法(item-based collaborative filtering)。简单的说就是：人以类聚，物以群分。

code

Python - NumPy

Posted on 2018-06-15 | In Python

NumPy 是一个 Python 包。它代表 “Numeric Python”。它是一个由多维数组对象和用于处理数组的例程集合组成的库。

DP - Related Paper

Posted on 2018-06-10 | In Differential Privacy

survey

($\epsilon,\delta$)-differential privacy

[ ] Our Data, Ourselves: Privacy Via Distributed Noise Generation

$\delta$-probability privacy

[ ] Privacy: Theory meets practice on the map

Sensitivity

Composition

[ ] Boosting and Differential Privacy
[ ] Interactive Privacy via the Median Mechanism
[ ] [Privacy odometers and filters: Pay-as-you-go composition]
[ ] [Differential privacy and robust statistics]

Mechanism

Location DP

Distributed Lap

DP in SGD

[ ] [Private empirical risk minimization: Efficient algorithms and tight error bounds]
[ ] [Stochastic gradient descent with differentially private updates]

DP in Machine Learning

[ ] [Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis]
[ ] Related work in [2015-Reza Shokri] Privacy-Preserving Deep Learning

Error Bound in DP

[ ] [Concentrated differential privacy: Simplifications, extensions, and lower bounds.]
[ ] [R’enyi differential privacy]