
Chapter 2

Probabilistic and Statistical Models for Outlier Detection

1 Introduction



  • probabilistic model:数据点与生成模型(generative model)的似然拟合(likelihood fit)
  • proximity-based mode:与最近的 k 个邻居的距离、与centroid的距离、局部密度值
  • linear model:数据点与数据低维表示的残差距离(residual distance)
  • temporal model:数据点与预测值的偏差
概率模型:学习模型的参数,使得观测数据被该模型生成的概率最大;This model is, therefore, a generative model for the data, and the probability of a particular data point being generated can be estimated from this model.

2 Statistical Methods for Extreme-Value Analysis


2.1 Probabilistic Tail Inequalities

较少假设的不等式能力较弱,但适用范围较广;反之亦反。For example, the Markov and Chebychev inequalities are weak inequalities but they apply to very large classes of random variables. On the other hand, the Chernoff bound and Hoeffding inequality are both stronger inequalities but they apply to restricted classes of random variables.

Markov Inequality


Chebychev inequality


两种不等式只是对数据进行粗略的界定(not provide tight enough bound),这是因为它们对数据的假设较少,马尔科夫只说不是负数就可以,而切比雪夫更没有任何假设,导致这两种不等式在实际应用中很难发挥作用。

而许多变量是以aggregate的形式定义的,可以表述为有界随机变量之和(sums of bounded random variables)

Lower-Tail Chernoff Bound

Chernoff Bound假设独立变量服从伯努利,但是接下来要介绍的Hoeffding不等式却没有这个假设,更通用一些。

可以看出Hoeffding不等式是Chernoff Bound的更一般形式。下面的表格总结了几种不等式的应用场景及它们的效果。


2.2 Statistical-Tail Confidence Tests


