Distance-Based Outlier Detection - Nested Loop Method
Formally: where is an object from the Dataset and is the neighborhood around that object.
is a threshold that has to be determined.
We would call an Outlier a DB(r,)-outlier.
Another way of formulating this would be to look at the -nearest neighbor with . would then be an outlier if the distance to is bigger than .
Laufzeit but linear in CPU time with respect to data set size, however it often terminates quickly with a small dataset that has few outliers. Might be costly for large datasets that dont fit into RAM.
Also objects are checked one by one and not group by group which slows down this algorithm.
Thus a better apporach is a Grid-Based method like the Distance-Based Outlier Detection - Grid-Based Method.
Simple Python Implementation
def nested_loop_remove_outliers(data, pi, r):
normal_data = []
n = len(data)
for oi in data:
c = 0
for oj in data:
if oi != oj and abs(oi - oj) <= r:
c += 1
if c >= pi * n:
normal_data.append(oi)
break;
return normal_data