Distance-Based Outlier Detection - Nested Loop Method

Formally: $\frac{∥ { o ^{'} ∣ d ( o , o ^{'} ) \leq r } ∥}{∥ D ∥} \leq π$ where $o$ is an object from the Dataset $D$ and $r$ is the neighborhood around that object.

$π$ is a threshold that has to be determined.

We would call an Outlier a DB(r, $π$ )-outlier.

Another way of formulating this would be to look at the $k$ -nearest neighbor $o_{k}$ with $k = ⌈ πn ⌉$ . $o$ would then be an outlier if the distance to $o_{k}$ is bigger than $r$ .

Laufzeit $O (n^{2})$ but linear in CPU time with respect to data set size, however it often terminates quickly with a small dataset that has few outliers. Might be costly for large datasets that dont fit into RAM.

Also objects are checked one by one and not group by group which slows down this algorithm.

Thus a better apporach is a Grid-Based method like the Distance-Based Outlier Detection - Grid-Based Method.

Simple Python Implementation

def nested_loop_remove_outliers(data, pi, r):
    normal_data = []
    n = len(data)
    for oi in data:
        c = 0
        for oj in data:
            if oi != oj and abs(oi - oj) <= r:
                c += 1
                if c >= pi * n:
                    normal_data.append(oi)
                    break;
    return normal_data

Marcs Notes

Explorer

Distance-Based Outlier Detection - Nested Loop Method

Distance-Based Outlier Detection - Nested Loop Method

Simple Python Implementation

Graphansicht

Inhaltsverzeichnis

Backlinks