Information Gain

An Attribute Selection Method which selects the attribute with the highest information gain.

$Gain (A) = Info (D) - Info_{A} (D)$ where $Info$ is the Expected Information also called Entropy.

So for a Binary Classification we have $Gain (A) = I (⟨ \frac{p}{p + n}, \frac{n}{p + n} ⟩) - \sum_{a} \frac{p _{a} + n _{a}}{p + n} I (⟨ \frac{p _{a}}{p _{a} + n _{a}}, \frac{n _{a}}{p _{a} + n _{a}} ⟩)$

Calculate Expected Information for every attribute and data partition.

def information_partitioned(dataset: pd.DataFrame, target_attribute: str, partition_attribute: str) -> float:
 
    weights = dataset.value_counts(partition_attribute) / dataset.shape[0]
    return sum(
        [
	        weight * information(
	            dataset[dataset[partition_attribute] == index],
	            target_attribute
            )
	        for index, weight in weights.items()
        ]
    )

→ See Expected Information for information()

def information_gain(dataset: pd.DataFrame, target_attribute: str, partition_attribute: str) -> float:
 
    return information(dataset, target_attribute) - information_partitioned(dataset, target_attribute, partition_attribute)

Select attribute with the highest gain first.

Disadvantages

Biased towards multi-valued attributes
favors attributes with large numbers

Marcs Notes

Explorer

Information Gain

Information Gain

Disadvantages

Graphansicht

Inhaltsverzeichnis

Backlinks