Log-Likelihood

Because small values are very unstable in computers, we use the Log-Likelihood. It uses a sum and applies the logarithm to every datapoint.

L (θ) = i = 1 \prod N p (s_{i} ∣ θ) = * i = 1 \sum N lo g [p (s_{i} ∣ θ)] = ℓ (θ)

Rewrite Log-Likelihood by inserting the Logistic Function and applying the logarithm to the sum

L (θ) = lo g (i = 1 \prod m P (y_{i} ∣ x_{i})) = i = 1 \sum m lo g (g (θ^{⊤} x_{i})^{y_{i}} (1 - g (θ^{⊤} x_{i}))^{1 - x_{i}})

then again applying the logarithm to the exponents $= \sum_{i = 1}^{m} y_{i} lo g (g (θ^{⊤} x_{i})) + (1 - y_{i}) lo g (1 - g (θ^{⊤} x_{i}))$ then using the definition of the Logistic Function to get

= i = 1 \sum m y_{i} lo g (\frac{e ^{θ^{⊤} x_{i}}}{1 + e ^{θ^{⊤} x_{i}}}) + (1 - y_{i}) lo g (\frac{1}{1 + e ^{θ^{⊤} x_{i}}}) = i = 1 \sum m y_{i} θ^{⊤} x_{i} + lo g \frac{1}{1 + e ^{θ^{⊤} x_{i}}}

and finally using the definition again to get $= \sum_{i = 1}^{m} y_{i} θ^{⊤} x_{i} + lo g (1 - g (θ^{⊤} x_{i}))$

We can use this much simpler form of the equation to calculate the Gradient of the Log-Likelihood:

\nabla_{θ} L (θ) = \nabla_{θ} (i = 1 \sum m y_{i} θ^{⊤} x_{i} + lo g (1 - g (θ^{⊤} x_{i}))) = i = 1 \sum m y_{i} x_{ij} + \frac{1}{1 - g ( θ ^{⊤} x _{i} )} (- g (θ^{⊤} x_{i}) (1 - g (θ^{⊤} x_{i})) x_{ij} = i = 1 \sum m y_{i} x_{ij} - g (θ^{⊤} x_{i}) x_{ij} = i = 1 \sum m (y_{i} - g (θ^{⊤} x_{i}) x_{ij}

and the Hessian Matrix like this

\nabla_{θ} \nabla_{θ} L (θ) = \nabla_{θ} i = 1 \sum m (y_{i} - g (θ^{⊤} x_{i}) x_{ij} = i = 1 \sum m - g (θ^{⊤} x_{i}) (1 - g (θ^{⊤} x_{i})) x_{i} x_{i}^{⊤}

We can now use the Newton Method to iteratively calculate the best parameters: $θ^{k + 1} = θ^{k} - (\frac{\partial ^{2}}{\partial θ \partial θ ^{T}} L (θ^{k}))^{- 1} \frac{\partial}{\partial θ} L (θ^{k})$

For Gaussian

We then insert the Gaussian Distribution of $y$ like this:

i = 1 \sum N lo g [p (s_{i} ∣ θ)] = i = 1 \sum N lo g [(\frac{1}{2 π σ ^{2}})^{\frac{1}{2}} e^{- \frac{1}{2 σ ^{2}} (y_{i} - w^{T} x_{i})^{2}}] = - \frac{1}{2 σ ^{2}} i = 1 \sum N (y_{i} - w^{T} x_{i})^{2} - \frac{N}{2} lo g (2 π σ^{2})

The Log-Likelihood can then be simplified to a negative constant times the Residual Sum of Squares (Also called the Loss) and adding another constant at the back.

We now have a minus sign in the front. That is why we have to minimize the Residual Sum of Squares to maximize the Log-Likelihood.

The Loss is convex thus it has a unique minimum which can be calculated with:

Gradient Descent
Newtons Method
Analytical solution

Marcs Notes

Explorer

Log-Likelihood

Log-Likelihood

For Gaussian

Graphansicht

Inhaltsverzeichnis

Backlinks