Log-Likelihood

Because small values are very unstable in computers, we use the Log-Likelihood. It uses a sum and applies the logarithm to every datapoint.

Rewrite Log-Likelihood by inserting the Logistic Function and applying the logarithm to the sum

then again applying the logarithm to the exponents then using the definition of the Logistic Function to get

and finally using the definition again to get

We can use this much simpler form of the equation to calculate the Gradient of the Log-Likelihood:

and the Hessian Matrix like this

We can now use the Newton Method to iteratively calculate the best parameters:

For Gaussian

We then insert the Gaussian Distribution of like this:

The Log-Likelihood can then be simplified to a negative constant times the Residual Sum of Squares (Also called the Loss) and adding another constant at the back.

We now have a minus sign in the front. That is why we have to minimize the Residual Sum of Squares to maximize the Log-Likelihood.

The Loss is convex thus it has a unique minimum which can be calculated with: