Maximum likelihood estimation | Theory, assumptions, properties (2024)

by Marco Taboga, PhD

Maximum likelihood estimation (MLE) is an estimation method that allows us to use a sample to estimate the parameters of the probability distribution that generated the sample.

This lecture provides an introduction to the theory of maximum likelihood, focusing on its mathematical aspects, in particular on:

  • its asymptotic properties;

  • the assumptions that are needed to prove the properties.

At the end of the lecture, we provide links to pages that contain examples and that treat practically relevant aspects of the theory, such as numerical optimization and hypothesis testing.

Maximum likelihood estimation | Theory, assumptions, properties (1)

Table of contents

  1. The sample and its likelihood

  2. Maximum likelihood estimator

  3. Asymptotic properties

    1. Assumptions

    2. Information inequality

    3. Consistency

    4. Score vector

    5. Information matrix

    6. Asymptotic normality

    7. Different assumptions

  4. Numerical optimization

  5. Examples

  6. More details

    1. Estimation of the asymptotic covariance matrix

    2. Hypothesis testing

  7. References

The sample and its likelihood

The main elements of a maximum likelihood estimation problem are the following:

  • a sample Maximum likelihood estimation | Theory, assumptions, properties (2), that we use to make statements about the probability distribution that generated the sample;

  • the sample Maximum likelihood estimation | Theory, assumptions, properties (3) is regarded as the realization of a random vector Maximum likelihood estimation | Theory, assumptions, properties (4), whose distribution is unknown and needs to be estimated;

  • there is a set Maximum likelihood estimation | Theory, assumptions, properties (5) of real vectors (called the parameter space) whose elements (called parameters) are put into correspondence with the possible distributions of Maximum likelihood estimation | Theory, assumptions, properties (6); in particular:

    • if Maximum likelihood estimation | Theory, assumptions, properties (7) is a discrete random vector, we assume that its joint probability mass function Maximum likelihood estimation | Theory, assumptions, properties (8) belongs to a set of joint probability mass functions Maximum likelihood estimation | Theory, assumptions, properties (9) indexed by the parameter Maximum likelihood estimation | Theory, assumptions, properties (10); when the joint probability mass function is considered as a function of Maximum likelihood estimation | Theory, assumptions, properties (11) for fixed Maximum likelihood estimation | Theory, assumptions, properties (12), it is called likelihood (or likelihood function) and it is denoted byMaximum likelihood estimation | Theory, assumptions, properties (13)

    • if Maximum likelihood estimation | Theory, assumptions, properties (14) is a continuous random vector, we assume that its joint probability density function Maximum likelihood estimation | Theory, assumptions, properties (15) belongs to a set of joint probability density functions Maximum likelihood estimation | Theory, assumptions, properties (16) indexed by the parameter Maximum likelihood estimation | Theory, assumptions, properties (17); when the joint probability density function is considered as a function of Maximum likelihood estimation | Theory, assumptions, properties (18) for fixed Maximum likelihood estimation | Theory, assumptions, properties (19), it is called likelihood and it is denoted byMaximum likelihood estimation | Theory, assumptions, properties (20)

  • we need to estimate the true parameter Maximum likelihood estimation | Theory, assumptions, properties (21), which is associated with the unknown distribution that actually generated the sample (we rule out the possibility that several different parameters are put into correspondence with true distribution).

Maximum likelihood estimator

A maximum likelihood estimator Maximum likelihood estimation | Theory, assumptions, properties (22) of Maximum likelihood estimation | Theory, assumptions, properties (23) is obtained as a solution of a maximization problem:Maximum likelihood estimation | Theory, assumptions, properties (24)In other words, Maximum likelihood estimation | Theory, assumptions, properties (25) is the parameter that maximizes the likelihood of the sample Maximum likelihood estimation | Theory, assumptions, properties (26). Maximum likelihood estimation | Theory, assumptions, properties (27) is called the maximum likelihood estimator of Maximum likelihood estimation | Theory, assumptions, properties (28).

In what follows, the symbol Maximum likelihood estimation | Theory, assumptions, properties (29) will be used to denote both a maximum likelihood estimator (a random variable) and a maximum likelihood estimate (a realization of a random variable): the meaning will be clear from the context.

The same estimator Maximum likelihood estimation | Theory, assumptions, properties (30) is obtained as a solution ofMaximum likelihood estimation | Theory, assumptions, properties (31)i.e., by maximizing the natural logarithm of the likelihood function. Solving this problem is equivalent to solving the original one, because the logarithm is a strictly increasing function. The logarithm of the likelihood is called log-likelihood and it is denoted byMaximum likelihood estimation | Theory, assumptions, properties (32)

Asymptotic properties

To derive the (asymptotic) properties of maximum likelihood estimators, one needs to specify a set of assumptions about the sample Maximum likelihood estimation | Theory, assumptions, properties (33) and the parameter space Maximum likelihood estimation | Theory, assumptions, properties (34).

The next section presents a set of assumptions that allows us to easily derive the asymptotic properties of the maximum likelihood estimator. Some of the assumptions are quite restrictive, while others are very generic. Therefore, the subsequent sections discuss how the most restrictive assumptions can be weakened and how the most generic ones can be made more specific.

Note: the presentation in this section does not aim at being one hundred per cent rigorous. Its aim is rather to introduce the reader to the main steps that are necessary to derive the asymptotic properties of maximum likelihood estimators. Therefore, some technical details are either skipped or de-emphasized. After getting a grasp of the main issues related to the asymptotic properties of MLE, the interested reader can refer to other sources (e.g., Newey and McFadden - 1994, Ruud - 2000) for a fully rigorous presentation of MLE theory.

Assumptions

Let Maximum likelihood estimation | Theory, assumptions, properties (35) be a sequence of Maximum likelihood estimation | Theory, assumptions, properties (36) random vectors. Denote by Maximum likelihood estimation | Theory, assumptions, properties (37) the sample comprising the first Maximum likelihood estimation | Theory, assumptions, properties (38) realizations of the sequenceMaximum likelihood estimation | Theory, assumptions, properties (39)which is a realization of the random vectorMaximum likelihood estimation | Theory, assumptions, properties (40)

We assume that:

  1. IID. Maximum likelihood estimation | Theory, assumptions, properties (41) is an IID sequence.

  2. Continuous variables. A generic term Maximum likelihood estimation | Theory, assumptions, properties (42) of the sequence Maximum likelihood estimation | Theory, assumptions, properties (43) is a continuous random vector, whose joint probability density function Maximum likelihood estimation | Theory, assumptions, properties (44)belongs to a set of joint probability density functions Maximum likelihood estimation | Theory, assumptions, properties (45) indexed by a Maximum likelihood estimation | Theory, assumptions, properties (46) parameter Maximum likelihood estimation | Theory, assumptions, properties (47) (where we have dropped the subscript Maximum likelihood estimation | Theory, assumptions, properties (48) to highlight the fact that the terms of the sequence are identically distributed).

  3. Identification. If Maximum likelihood estimation | Theory, assumptions, properties (49), then the ratioMaximum likelihood estimation | Theory, assumptions, properties (50)is not almost surely constant. This also implies that the parametric family is identifiable: there does not exist another parameter Maximum likelihood estimation | Theory, assumptions, properties (51) such that Maximum likelihood estimation | Theory, assumptions, properties (52) is the true probability density function of Maximum likelihood estimation | Theory, assumptions, properties (53).

  4. Integrable log-likelihood. The log-likelihood is integrable:Maximum likelihood estimation | Theory, assumptions, properties (54)

  5. Maximum. The density functions Maximum likelihood estimation | Theory, assumptions, properties (55) and the parameter space Maximum likelihood estimation | Theory, assumptions, properties (56) are such that there always exists a unique solution Maximum likelihood estimation | Theory, assumptions, properties (57) of the maximization problem:Maximum likelihood estimation | Theory, assumptions, properties (58)where the rightmost equality is a consequence of independence (see the IID assumption above). Of course, this is the same asMaximum likelihood estimation | Theory, assumptions, properties (59)where Maximum likelihood estimation | Theory, assumptions, properties (60) is the log-likelihood and Maximum likelihood estimation | Theory, assumptions, properties (61)are the contributions of the individual observations to the log-likelihood. It is also the same asMaximum likelihood estimation | Theory, assumptions, properties (62)

  6. Exchangeability of limit. The density functions Maximum likelihood estimation | Theory, assumptions, properties (63) and the parameter space Maximum likelihood estimation | Theory, assumptions, properties (64) are such thatMaximum likelihood estimation | Theory, assumptions, properties (65)where Maximum likelihood estimation | Theory, assumptions, properties (66) denotes a limit in probability. Roughly speaking, the probability limit can be brought inside the Maximum likelihood estimation | Theory, assumptions, properties (67) operator.

  7. Differentiability. The log-likelihood Maximum likelihood estimation | Theory, assumptions, properties (68) is two times continuously differentiable with respect to Maximum likelihood estimation | Theory, assumptions, properties (69) in a neighborhood of Maximum likelihood estimation | Theory, assumptions, properties (70).

  8. Other technical conditions. The derivatives of the log-likelihood Maximum likelihood estimation | Theory, assumptions, properties (71) are well-behaved, so that it is possible to exchange integration and differentiation, compute their first and second moments, and probability limits involving their entries are also well-behaved.

Information inequality

Given the assumptions made above, we can derive an important fact about the expected value of the log-likelihood:Maximum likelihood estimation | Theory, assumptions, properties (72)

Proof

First of all,Maximum likelihood estimation | Theory, assumptions, properties (73)Therefore, the inequalityMaximum likelihood estimation | Theory, assumptions, properties (74)is satisfied if and only ifMaximum likelihood estimation | Theory, assumptions, properties (75)which can be also written asMaximum likelihood estimation | Theory, assumptions, properties (76)(note that everything we have done so far is legitimate because we have assumed that the log-likelihoods are integrable). Thus, proving our claim is equivalent to demonstrating that this last inequality holds. In order to do this, we need to use Jensen's inequality. Since the logarithm is a strictly concave function and, by our assumptions, the ratioMaximum likelihood estimation | Theory, assumptions, properties (77)is not almost surely constant, by Jensen's inequality we haveMaximum likelihood estimation | Theory, assumptions, properties (78)But,Maximum likelihood estimation | Theory, assumptions, properties (79)Therefore,Maximum likelihood estimation | Theory, assumptions, properties (80)which is exactly what we needed to prove.

This inequality, called information inequality by many authors, is essential for proving the consistency of the maximum likelihood estimator.

Consistency

Given the assumptions above, the maximum likelihood estimator Maximum likelihood estimation | Theory, assumptions, properties (81) is a consistent estimator of the true parameter Maximum likelihood estimation | Theory, assumptions, properties (82):Maximum likelihood estimation | Theory, assumptions, properties (83)where Maximum likelihood estimation | Theory, assumptions, properties (84) denotes a limit in probability.

Proof

We have assumed that the density functions Maximum likelihood estimation | Theory, assumptions, properties (85) and the parameter space Maximum likelihood estimation | Theory, assumptions, properties (86) are such thatMaximum likelihood estimation | Theory, assumptions, properties (87)But Maximum likelihood estimation | Theory, assumptions, properties (88)The last equality is true, because, by Kolmogorov's Strong Law of Large Numbers (we have an IID sequence with finite mean), the sample average Maximum likelihood estimation | Theory, assumptions, properties (89) converges almost surely to Maximum likelihood estimation | Theory, assumptions, properties (90) and, therefore, it converges also in probability (convergence almost surely implies convergence in probability). Thus, putting things together, we obtainMaximum likelihood estimation | Theory, assumptions, properties (91)In the proof of the information inequality (see above), we have seen thatMaximum likelihood estimation | Theory, assumptions, properties (92)which, obviously, impliesMaximum likelihood estimation | Theory, assumptions, properties (93)Thus,Maximum likelihood estimation | Theory, assumptions, properties (94)

Score vector

Denote by Maximum likelihood estimation | Theory, assumptions, properties (95) the gradient of the log-likelihood, that is, the vector of first derivatives of the log-likelihood, evaluated at the point Maximum likelihood estimation | Theory, assumptions, properties (96). This vector is often called the score vector.

Given the assumptions above, the score has zero expected value:Maximum likelihood estimation | Theory, assumptions, properties (97)

Proof

First of all, note thatMaximum likelihood estimation | Theory, assumptions, properties (98)because probability density functions integrate to Maximum likelihood estimation | Theory, assumptions, properties (99). Now, taking the first derivative of both sides with respect to any component Maximum likelihood estimation | Theory, assumptions, properties (100) of Maximum likelihood estimation | Theory, assumptions, properties (101) and bringing the derivative inside the integral:Maximum likelihood estimation | Theory, assumptions, properties (102)Now, multiply and divide the integrand function by Maximum likelihood estimation | Theory, assumptions, properties (103):Maximum likelihood estimation | Theory, assumptions, properties (104)SinceMaximum likelihood estimation | Theory, assumptions, properties (105)we can writeMaximum likelihood estimation | Theory, assumptions, properties (106)or, using the definition of expected value:Maximum likelihood estimation | Theory, assumptions, properties (107)which can be written in vector form using the gradient notation asMaximum likelihood estimation | Theory, assumptions, properties (108)This result can be used to derive the expected value of the score as follows:Maximum likelihood estimation | Theory, assumptions, properties (109)

Information matrix

Given the assumptions above, the covariance matrix of the score (called information matrix or Fisher information matrix) isMaximum likelihood estimation | Theory, assumptions, properties (110)where Maximum likelihood estimation | Theory, assumptions, properties (111) is the Hessian of the log-likelihood, that is, the matrix of second derivatives of the log-likelihood, evaluated at the point Maximum likelihood estimation | Theory, assumptions, properties (112).

Proof

From the previous proof, we know thatMaximum likelihood estimation | Theory, assumptions, properties (113)Now, taking the first derivative of both sides with respect to any component Maximum likelihood estimation | Theory, assumptions, properties (114) of Maximum likelihood estimation | Theory, assumptions, properties (115), we obtainMaximum likelihood estimation | Theory, assumptions, properties (116)Rearranging, we getMaximum likelihood estimation | Theory, assumptions, properties (117)Since this is true for any Maximum likelihood estimation | Theory, assumptions, properties (118) and any Maximum likelihood estimation | Theory, assumptions, properties (119), we can express it in matrix form asMaximum likelihood estimation | Theory, assumptions, properties (120)where the left hand side is the covariance matrix of the gradient. This result is equivalent to the result we need to prove becauseMaximum likelihood estimation | Theory, assumptions, properties (121)

The latter equality is often called information equality.

Asymptotic normality

The maximum likelihood estimator is asymptotically normal:Maximum likelihood estimation | Theory, assumptions, properties (122)In other words, the distribution of the maximum likelihood estimator Maximum likelihood estimation | Theory, assumptions, properties (123) can be approximated by a multivariate normal distribution with mean Maximum likelihood estimation | Theory, assumptions, properties (124) and covariance matrixMaximum likelihood estimation | Theory, assumptions, properties (125)

Proof

Denote byMaximum likelihood estimation | Theory, assumptions, properties (126)the gradient of the log-likelihood, i.e., the vector of first derivatives of the log-likelihood. Denote byMaximum likelihood estimation | Theory, assumptions, properties (127)the Hessian of the log-likelihood, i.e., the matrix of second derivatives of the log-likelihood. Since the maximum likelihood estimator Maximum likelihood estimation | Theory, assumptions, properties (128) maximizes the log-likelihood, it satisfies the first order conditionMaximum likelihood estimation | Theory, assumptions, properties (129)Furthermore, by the Mean Value Theorem, we haveMaximum likelihood estimation | Theory, assumptions, properties (130)where, for each Maximum likelihood estimation | Theory, assumptions, properties (131), the intermediate points Maximum likelihood estimation | Theory, assumptions, properties (132) satisfyMaximum likelihood estimation | Theory, assumptions, properties (133)and the notationMaximum likelihood estimation | Theory, assumptions, properties (134)indicates that each row of the Hessian is evaluated at a different point (row Maximum likelihood estimation | Theory, assumptions, properties (135) is evaluated at the point Maximum likelihood estimation | Theory, assumptions, properties (136)). Substituting the first order condition in the mean value equation, we obtainMaximum likelihood estimation | Theory, assumptions, properties (137)which, by solving for Maximum likelihood estimation | Theory, assumptions, properties (138), becomesMaximum likelihood estimation | Theory, assumptions, properties (139)which can be rewritten asMaximum likelihood estimation | Theory, assumptions, properties (140)We will show that the term in the first pair of square brackets converges in probability to a constant, invertible matrix and that the term in the second pair of square brackets converges in distribution to a normal distribution. The consequence will be that their product also converges in distribution to a normal distribution (by slu*tsky's theorem).

As far as the first term is concerned, note that the intermediate points Maximum likelihood estimation | Theory, assumptions, properties (141) converge in probability to Maximum likelihood estimation | Theory, assumptions, properties (142):Maximum likelihood estimation | Theory, assumptions, properties (143)Therefore, skipping some technical details, we getMaximum likelihood estimation | Theory, assumptions, properties (144)As far as the second term is concerned, we get Maximum likelihood estimation | Theory, assumptions, properties (145)By putting things together and using the Continuous Mapping Theorem and slu*tsky's theorem (see also the exercises in the lecture on slu*tsky's theorem), we obtainMaximum likelihood estimation | Theory, assumptions, properties (146)

By the information equality (see its proof), the asymptotic covariance matrix is equal to the negative of the expected value of the Hessian matrix: Maximum likelihood estimation | Theory, assumptions, properties (147)

Different assumptions

As previously mentioned, some of the assumptions made above are quite restrictive, while others are very generic. We now discuss how the former can be weakened and how the latter can be made more specific.

Assumption 1 (IID). It is possible to relax the assumption that Maximum likelihood estimation | Theory, assumptions, properties (148) is IID and allow for some dependence among the terms of the sequence (see, e.g., Bierens - 2004 for a discussion). In case dependence is present, the formula for the asymptotic covariance matrix of the MLE given above is no longer valid and needs to be replaced by a formula that takes serial correlation into account.

Assumption 2 (continuous variables). It is possible to prove consistency and asymptotic normality also when the terms of the sequence Maximum likelihood estimation | Theory, assumptions, properties (149) are extracted from a discrete distribution, or from a distribution that is neither discrete nor continuous (see, e.g., Newey and McFadden - 1994).

Assumption 3 (identification). Typically, different identification conditions are needed when the IID assumption is relaxed (e.g., Bierens - 2004).

Assumption 5 (maximum). To ensure the existence of a maximum, requirements are typically imposed both on the parameter space and on the log-likelihood function. For example, it can be required that the parameter space be compact (closed and bounded) and the log-likelihood function be continuous. Also, the parameter space can be required to be convex and the log-likelihood function strictly concave (e.g.: Newey and McFadden - 1994).

Assumption 6 (exchangeability of limit). To ensure the exchangeability of the limit and the Maximum likelihood estimation | Theory, assumptions, properties (150) operator, the following condition is often imposed:Maximum likelihood estimation | Theory, assumptions, properties (151)

Assumption 8 (other technical conditions). See, for example, Newey and McFadden (1994) for a discussion of these technical conditions.

Numerical optimization

In some cases, the maximum likelihood problem has an analytical solution. That is, it is possible to write the maximum likelihood estimator Maximum likelihood estimation | Theory, assumptions, properties (152) explicitly as a function of the data.

However, in many cases there is no explicit solution. In these cases, numerical optimization algorithms are used to maximize the log-likelihood. The lecture entitled Maximum likelihood - Algorithm discusses these algorithms.

Examples

The following lectures provide detailed examples of how to derive analytically the maximum likelihood (ML) estimators and their asymptotic variance:

  • ML estimation of the parameter of the Poisson distribution

  • ML estimation of the parameter of the exponential distribution

  • ML estimation of the parameters of the normal distribution

  • ML estimation of the parameters of the multivariate normal distribution

  • ML estimation of the parameters of a normal linear regression model

The following lectures provides examples of how to perform maximum likelihood estimation numerically:

  • ML estimation of the degrees of freedom of a standard t distribution (MATLAB example)

  • ML estimation of the coefficients of a logistic classification model

  • ML estimation of the coefficients of a probit classification model

  • ML estimation of the parameters of a Gaussian mixture

More details

The following sections contain more details about the theory of maximum likelihood estimation.

Estimation of the asymptotic covariance matrix

Methods to estimate the asymptotic covariance matrix of maximum likelihood estimators, including OPG, Hessian and Sandwich estimators, are discussed in the lecture entitled Maximum likelihood - Covariance matrix estimation.

Hypothesis testing

Tests of hypotheses on parameters estimated by maximum likelihood are discussed in the lecture entitled Maximum likelihood - Hypothesis testing, as well as in the lectures on the three classical tests:

  1. Wald test;

  2. score test;

  3. likelihood ratio test.

References

Bierens, H. J. (2004) Introduction to the mathematical and statistical foundations of econometrics, Cambridge University Press.

Newey, W. K. and D. McFadden (1994) "Chapter 35: Large sample estimation and hypothesis testing", in Handbook of Econometrics, Elsevier.

Ruud, P. A. (2000) An introduction to classical econometric theory, Oxford University Press.

How to cite

Please cite as:

Taboga, Marco (2021). "Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/maximum-likelihood.

Maximum likelihood estimation | Theory, assumptions, properties (2024)
Top Articles
Latest Posts
Article information

Author: Melvina Ondricka

Last Updated:

Views: 6081

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Melvina Ondricka

Birthday: 2000-12-23

Address: Suite 382 139 Shaniqua Locks, Paulaborough, UT 90498

Phone: +636383657021

Job: Dynamic Government Specialist

Hobby: Kite flying, Watching movies, Knitting, Model building, Reading, Wood carving, Paintball

Introduction: My name is Melvina Ondricka, I am a helpful, fancy, friendly, innocent, outstanding, courageous, thoughtful person who loves writing and wants to share my knowledge and understanding with you.