ML - K-Nearest Neighbors

Posted on 2018-06-10 | In Machine Learning

kNN (k-Nearest Neighbors) is a classification algorithm. It is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

TWICE

Posted on 2018-06-08

综艺

What is Love?

[Ask in a box]

Heart Shake

Likey

Signal

[ 认识的哥哥]

Knock Knock

TT

Cheer Up

[同床异梦]

Like Ooh-Ahh

Latex

Posted on 2018-06-08 | In Latex

公式换行使等号对齐

\begin{equation}
\begin{aligned}
Y&=2*(x+3)\\
&=2*x+6
\end{aligned}
\end{equation}

$\begin{equation} \begin{aligned} Y&=2*(x+3)\\ &=2*x+6 \end{aligned} \end{equation}$

联立方程组

t^*=
\left\{
\begin{align} 
x&=eq1 & \text{if x=1}\\
y&=eq2 &\text{if x=2}
\end{align}
\right.

$t^*= \left\{ \begin{align} x&=eq1 & \text{if x=1}\\ y&=eq2 &\text{if x=2} \end{align} \right.$

Python - Pandas

Posted on 2018-06-08 | In Python

本文参考1

pandas导入

Python-SciPy

Posted on 2018-06-07 | In Python

Scipy基于Numpy，提供了大量科学算法，它的不同子模块相应于不同的应用，本文参考了1 2

文件IO（scipy.io）：数据输入输出
特殊函数（scipy.special）：特殊函数是先验函数，常用的有伽马函数scipy.special.gamma()
线性代数运算（scipy.linalg）
快速傅里叶变化（scipy.fftpack）
优化和拟合（scipy.optimize）：提供了函数最小值(标量或多维)、曲线拟合和寻找等式的根的有用算法。
统计和随机数（scipy.stats）
数值积分（scipy.integrate Fusy）

模块导入的标准方式是：

Python - Sklearn

Posted on 2018-06-07 | In Python

本文参考了1

ML - Frequent Itemset Mining

Posted on 2018-06-05 | In Machine Learning

An association rule is a pattern that states when an event occurs, another event occurs with certain probability. Association relus find all sets of items that have support count greater than the mimimum support; then using the large itemsets to generate the desired rules that have confidence greater than the minimum confidence. For frequent itemset mining, we use Apriori algorithm.

The following details are from The Apriori Algorithm … How The Apriori Algorithm Works.

Algorithm - Bloom Filter

Posted on 2018-06-04 | In Algorithm

文字参考自海量数据处理算法—Bloom Filter.

Bloom Filter（BF）是一种空间效率很高的随机数据结构，它是一个判断元素是否存在集合的快速的概率算法。Bloom Filter有可能会出现错误判断，但不会漏掉判断。也就是Bloom Filter判断元素不再集合，那肯定不在。如果判断元素存在集合中，有一定的概率判断错误。因此，Bloom Filter不适合那些“零错误”的应用场合。

ML - Expectation Maximization Algorithm

Posted on 2018-06-03 | In Machine Learning

下文资料参考了EM(期望最大化)算法初步认识

极大似然估计，是参数估计的方法之一。其基本思想是已知样本符合某种概率分布，但是分布的参数未知，于是通过采样的随机样本估计参数。其基本步骤是：

求出似然函数：该样本集的概率，即每个样本出现的概率连积
对似然函数取对数：将连乘变连加
求导：使对数似然函数取最大值的参数便是结果
求解方程：得到的参数即为所求。

期望最大算法，是一种从不完全数据或有数据丢失的数据集（存在隐含变量）中求解概率模型参数的最大似然估计方法。在每一次的迭代过程中，主要分为两步：即求期望(Expectation)步骤和最大化(Maximization)步骤。

DP Application - Local Privacy

Posted on 2018-06-02 | In Differential Privacy

Q?

the communication cost of $n$ users for Bassily and Smith is $log_2(n)$
the communication cost of $n$ users for k-rr is $log_2d$
[Optimizing Locally Differentially Private Protocols] remains : what is pure? THE protocol? BLH? Threshold $T_s$ in experiment and true/false positive?

T!

in order to optimize ldp protocol, tends to optimize encoding step,
using domain knowledge to elimate impossible candidates to narrow down the size of encoded input.Locally Differentially Private Heavy Hitter Identification

Local differential privacy (LDP) techniques collects randomized answers from each user, with guarantees of plausible deniability; meanwhile, the aggregator can still build accurate models and predictors by analyzing large amounts of such randomized data.

Unlike other models of differential privacy, which publish randomized aggregates but still collect the exact sensitive data, LDP avoids collecting exact personal information in the first place, thus providing a stronger assurance to the users and to the aggregator.

The well-established Laplace mechanism and exponential mechanism are no longer suitable to the local setting in which a user may have only a single element to release and laplace noise may introduce too much noise. But unlike the laplace mechanism where we can use the noisy output directly, we need to do some estimation on the randomized output to get the estimator in LDP.2016-Zhan

Screen Shot 2018-06-11 at 10.16.20 PM

Development Track

Local differential privacy starts from randomized response. At first, it collects binary categorical data from clients with W-RR algorithm. Then, polybasic categorical data is considered with solutions like K-RR and K-RAPPOR methods. After that, Random Matrix Projection is proposed for dealling with extremely large categorical data in the practical setting.

Since the method above is limited to categorical data, so [Duchi et al.] proposed for numeric data and details algorithm is in [Collecting and Analyzing Data from Smart Device Users with Local Differential Privacy] and based on Duchi, harmony algorithm is developed.

The most methods above assmue that each client only has one value, so the mechanism for set-value setting is proposed [Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy].

And in real world, the collecting process in the client end happens frequently, like daily or every few hours. So methods for data collecting regularly is proposed like RAPPOR and [Collecting Telemetry Data Privately]

Recently, a mechanism for computing joint distribution of data attributes collected by LDP is proposed in [Building a RAPPOR with the Unknown- Privacy-Preserving Learning of Associations and Data Dictionaries 12.45.24 PM] and [LoPub: High-Dimensional Crowdsourced Data Publication With Local Differential Privacy]