Correlation Analysis

Numerical data

Correlation, Covariance, Sample Covariance, Variance → Visual detection using Scatter Plot or Scatter Plot Matrices.

Nominal data

X = set of all distinct combinations of attributes Y = set of all tuples

Actual quantity:

Expected quantity in case of independence:

To determine Correlation you need to apply the Chi-Squared Test to the values in the table.

Python Implementation

def correlation(AS: pd.Series, BS: pd.Series):
    A = AS.unique()
    B = BS.unique()
    A = np.flip(A)
    B = np.flip(B)
 
    n = len(A)
    m = len(B)
 
    # Generate all possible combinations of values in A and B
    X = [(i, j) for i in A for j in B]
    Y = [(i, j) for i, j in zip(AS,BS)]
 
    nY = len(Y)
 
    def actual(i, j):
        count = 0
        for ab in Y:
            if ab[0] == i and ab[1] == j:
                count = count + 1
        return count
 
    def expected(i, j):
        s1 = sum([actual(i, k) for k in B])
        s2 = sum([actual(l, j) for l in A])
        return (s1 * s2) / nY
 
    # Calculate Expected and Actual matrices
    e_table = np.zeros((m,n))
    c_table = np.zeros((m,n))
    for i, b in enumerate(B):
        for j, a in enumerate(A):
            c = actual(a, b)
            e = expected(a, b)
            e_table[i, j] = e
            c_table[i, j] = c
 
    # Calculate Chi Squared Value
    chi = 0
    for i in range(n):
        for j in range(m):
            e = e_table[i, j]
            c = c_table[i, j]
            chi += (c - e)**2 / (e)
 
    return c_table, e_table, chi