Lecture Note
University:
University of MichiganCourse:
STATS 621 | Probability TheoryAcademic year:
2019
Views:
342
Pages:
11
Author:
Koen D
q} Z ≤ (p − q) d µ {p>q} = P (A0 ) − Q(A0 ) = | P (A0 ) − Q(A0 ) | . Similarly, we deduce that for any set A: Q(A) − P (A) ≤ Q(B0 ) − P (B0 ) =| Q(B0 ) − P (B0 ) | . It follows that: supA∈A | P (A) − Q(A) |≤| P (A0 ) − Q(A0 ) |=| Q(B0 ) − P (B0 ) | . But 1 2 Z | p − q | dµ = 1 2 Z 1 (p − q) dµ + 2 {p>q} Z (q − p) dµ {q>p} | P (A0 ) − Q(A0 ) | + | P (B0 ) − Q(B0 ) | 2 = | P (A0 ) − Q(A0 ) | = = | P (B0 ) − Q(B0 ) | . 7 Exercise R(a): Show that H 2 (P, Q) = 1 − ρ(P, Q) where the affinity ρ is defined as √ ρ(P, Q) = p q d µ. Exercise (b): Show that dT V (P, Q) = 1 − R p ∧ q d µ. Exercise (c): Show that the following inequalities hold: H 2 (P, Q) ≤ dT V (P, Q) ≤ H(P, Q) (1 + ρ(P, Q))1/2 ≤ √ 2 H(P, Q) . We now introduce an important discrepancy measure that crops up frequently in statistics. This is the Kullback-Leibler distance, though the term “distance” in this context needs to be interpreted with a dose of salt, since the Kullback-Leibler divergence does not satisfy the properties of a distance. For probability measures P, Q dominated by a probability measure µ, we define the Kullback-Leibler divergence as: Z p K(P, Q) = log p d µ . q Once again this definition is independent of µ and the densities p and q, since p/q gives the density of the absolutely continuous part of P w.r.t. Q and p d µ is simply d P . In any case, we will focus on the representation in terms of densities since this of more relevance in statistics. We first discuss the connection of the Kullback-Leibler discrepancy to Hellinger distance. We have: K(P, Q) ≥ 2 H 2 (P, Q) . It is easy to check that ¸ ·Z µ r ¶ q pdµ . 2 H (P, Q) = 2 1− p 2 Thus, we need to show that: ¶ r ¶ Z µ Z µ q q − log pdµ ≥ 2 1− pdµ. p p This is equivalent to showing that: r ¶ r ¶ Z µ Z µ q q pdµ ≥ 1− pdµ. − log p p But this follows immediately from the fact that for any positive x, log x ≤ x − 1 whence − log x ≥ 1 − x. We thus have: d2T V (P, Q) ≤ 2 H 2 (P, Q) ≤ K(P, Q) . It follows that if the Kullback-Leibler divergence between a sequence of probabilities {Pn } and a fixed probability P goes to 0, then convergence of Pn to P must happen both in the Hellinger and the TV sense (the latter two modes of convergence being equivalent). 8 3 Connections to Maximum Likelihood Estimation The Kullback-Leibler divergence is intimately connected with maximum likelihood estimation as we demonstrate below. Consider a random variable/vector whose distribution comes from one of a class of densities {p(x, θ) : θ ∈ Θ}. Here Θ is the parameter space (think of this as a subset of a metric space with metric τ ). For the following discussion, we do not require to assume that Θ is finite-dimensional. Let θ0 denote the data generating parameter. The MLE is defined as that value of θ that maximizes the log-likelihood function based on i.i.d. observations X1 , X2 , . . . , Xn from the underlying distribution; i.e. θ̂n = argmaxθ∈Θ n X l(Xi , θ) , i=1 where l(x, θ) = log p(x, θ). Now · K(pθ0 , pθ ) = Eθ0 ¸ p(X, θ0 ) ≥ 0, log p(X, θ) with equality if and only if θ = θ0 (this requires the tacit assumption of identifiability), since: · ¸ · ¸ · ¸ p(X, θ0 ) p(X, θ) p(X, θ) Eθ0 log = Eθ0 − log ≥ − log Eθ0 = 0. p(X, θ) p(X, θ0 ) p(X, θ0 ) Equality happens if and only if the ratio p(x, θ)/p(x, θ0 ) is Pθ0 a.s. constant, in which case equality must hold a.s Pθ0 if Pθ and Pθ0 are mutually absolutely continuous. This would imply that the two distribution functions are identical. Now, define B(θ) = Eθ0 (log p(X, θ)). It is then clear that θ0 is the unique maximizer of of B(θ). Based on the sample however, there is no way to compute the (theoretical) expectation that defines B(θ). A surrogate is the expectation of l(X, θ) ≡ log p(X, θ) based on the empirical measure Pn that assigns mass 1/n to every Xi . In other words, we stipulate θ̂n as an estimate of θ, where: θ̂n = argmaxθ∈Θ Pn (l(X, θ)) ≡ argmaxθ∈Θ n−1 n X l(Xi , θ) = argmaxθ∈Θ i=1 n X l(Xi , θ) . i=1 Since Pn (l(X, θ)) → Eθ0 l(X, θ) a.s., one might expect θ̂n , the empirical maximizer to converge to θ0 , the theoretical maximizer. This is the intuition behind ML estimation. Define gθ (X) = log {p(X, θ0 )/p(X, θ)}. of θ̂n it follows that: 0≥ From the definition n n 1 X 1 X gθ̂n (Xi ) = (gθ̂n (Xi ) − K(pθ0 , pθ̂n )) + K(pθ0 , pθ̂n ) n n i=1 which implies that: Then Eθ0 gθ (X) = K(pθ0 , pθ ). i=1 ¯ n ¯ ¯1 X ¯ ¯ ¯ 0 ≤ K(pθ0 , pθ̂n ) ≤ ¯ gθ̂n (Xi ) − K(pθ0 , pθ̂n )¯ . ¯n ¯ i=1 9 (3.1) By the strong law of large numbers, it is the case that for each fixed θ, n 1X gθ (Xi ) − K(pθ0 , pθ ) →Pθ0 a.s. 0 . n i=1 This however does not imply that: n 1X gθ̂n (Xi ) − K(pθ0 , pθ̂n ) →Pθ0 a.s. 0 n i=1 since θ̂n is a random argument. But suppose that we could ensure: ¯ ¯ n ¯1 X ¯ ¯ ¯ gθ (Xi ) − K(pθ0 , pθ )¯ →Pθ0 a.s. 0 . supθ∈Θ ¯ ¯ ¯n i=1 This is called a Glivenko-Cantelli condition: it ensures that the strong law of large numbers holds uniformly over a class of functions (which in this case is indexed by θ). Then, by (3.1) we can certainly conclude that K(pθ0 , pθ̂n ) converges a.s. to 0. By the inequality relating the Hellinger distance to the Kullback-Leibler divergence, we conclude that H 2 (pθ0 , pθ̂n ) converges a.s. to 0. This is called Hellinger consistency. However, in many applications, we are really concerned with the consistency of θ̂n to θ0 in the natural metric on Θ (which we denoted by τ ). The following proposition, adapted from Van de Geer (Annals of Statistics, 21, Hellinger consistency of certain nonparametric maximum likelihood estimators, pages 14 – 44) shows that consistency in the natural metric can be deduced from Hellinger consistency under some additional hypotheses. Proposition: Say that θ0 is identifiable for the metric τ on Θ, containing Θ if, for all θ ∈ Θ, H(pθ , pθ0 ) = 0 implies that τ (θ, θ0 ) = 0. Suppose that (a) (Θ, τ ) is a compact metric space, (b) θ 7→ p(x, θ) is µ–almost everywhere continuous (here, µ is the underlying dominating measure) in the τ metric, and (c) θ0 is identifiable for τ . The H(pθn , pθ0 ) → 0 implies that τ (θn , θ0 ) → 0. Hence, under the conditions of the above proposition, Hellinger consistency a.s. imply a.s. consistency of θ̂n for θ0 in the τ –metric. would Proof: Suppose that τ (θn , θ0 ) 6= 0. Since {θn } lies in a compact set and does not converge to θ0 , there exists a subsequence {n0 } such that θn0 → θ? 6= θ0 . Note that h(pθn0 , pθ0 ) → 0. Now, by the triangle inequality: h(pθ0 , pθ? ) ≤ h(pθn0 , pθ0 ) + h(pθn0 , pθ? ) . The first term on the right side of the above display converges to 0; as for the second term, this also goes to 0, since by Scheffe’s theorem, the a.s. convergence of p(x, θn0 ) to p(x, θ? ), guarantees that convergence of the densities {pθn0 } to pθ? happens in total variation norm, and consequently in the Hellinger metric. Conclude that h(pθ0 , pθ? ) = 0; by identifiability τ (θ0 , θ? ) = 0. This shows that θn0 converges to θ0 and provides a contradiction. 2. 10 Exercise: Consider the model {Ber(θ) : 0 < a ≤ θ ≤ b < 1}. Consider the M.L.E. of θ based on i.i.d. observations {Xi }ni=1 from the Bernoulli distribution, with the true parameter θ0 lying in (a, b). Use the ideas developed above to show that the M.L.E. converges to the truth, almost surely, in the Euclidean metric. This is admittedly akin to pointing to your nose by wrapping your arm around your head, but nevertheless, illustrates how these techniques work. For “real applications” of these ideas one has to work with high dimensional models, where the niceties of standard parametric inference fail. 11
Metrics on Spaces of Probabilites
Report
Tell us what’s wrong with it:
Thanks, got it!
We will moderate it soon!
Our EduBirdie Experts Are Here for You 24/7! Just fill out a form and let us know how we can assist you.
Enter your email below and get instant access to your document