Questions and Answers #9 Confidence Intervals

University:
Stanford University
Course:
MATH 20 | Calculus
Academic year:

2021
Views:

13

Pages:

8
Author:

MathPaladin

Questions and Answers Sheet 9 Confidence Intervals Question #1 Statistics - Confidence Intervals P(X>1) 𝑋1 , ⋯ 𝑐, 𝑋𝑛 ∼ 𝑈(0,2𝑝) Answer: Details of Comment: From what you say, I assume you have lower and upper bounds for a 2-sided 95% CI for p. Say that these bounds are L and U. Then P(L < p < U) = 0.95 . 1 1 2 2𝑝 Manipulate the inequality L 1) = 1 − 1 2𝐿 2 , 1 − )for 𝑈 1 . 2𝑝 All of this depends on what you have told me. I have not checked the correctness of that. I'm just filling in details of my comment as you requested. This is not fundamentally different (or more difficult) than manipulating the probability statement 𝑃 (−1.96 < 𝑍 = 𝑋̅−𝜇 𝜎 √𝑛 < 1.96) = 0.95 1.96𝜎 to get the 𝐶𝐼𝑋̅ ± for 𝜇 when sampling from normal data with 𝜎 known. √𝑛 Question #2 Statistics finding the confidence level of the interval estimate medical expenses A confidence interval for the true mean of the annual medical expenses of a middle-class American family is given as (738, 777). If this interval is based on interviews with 110 families and a standard deviation of $ 120 is assumed. Suppose all annual medical expenses of middle-class American families follow an approximately normal distribution. (a) What is the sample mean of annual medical expenses? Sample mean = 777 + 738 / 2 = 757.5 (b) What is the confidence level of the interval estimate (as a decimal) 𝐶𝐼 = sample mean + 𝑧𝛼/2 ⋅ (standard deviation/square root n)) Isolate for z 𝛼/2 777 = 757.5 + 𝑧𝛼/2 ⋅ (120/square root 110) Rearrange (777 ⋅ square root110)/(757.5 ⋅ 120) = 𝑧𝛼/2 0.089650657 = 𝑧 𝛼/2 What are the correct solutions and answers? Answer: Foremost, you are told that the standard deviation of the expenses is assumed to be 𝜎 = 120. This is your first hint that the confidence interval was derived using a normal distribution, not a student's t distribution, because the latter uses the sample variance s that is calculated from the data. Thus, you have correctly inferred that the interval has the form [738,777] = [𝐿, 𝑈] = [𝑥̅ − 𝑧𝛼/2 𝜎 √𝑛 , 𝑥̅ + 𝑧𝛼/2 𝜎 √𝑛 ]. It follows that 𝑥̅ = 𝐿+𝑈 2 = 738+777 2 as you again correctly compute. It also follows that 𝑧𝛼/2 𝜎 √𝑛 = 𝑈−𝐿 2 , which is a quantity usually given the name "margin of error." Hence the critical value 𝑧𝛼/2 = √𝑛 𝜎 × 𝑈−𝐿 2 ≈ 1.70431, and if you recall, this means Pr[𝑍 > 𝑧𝛼/2 ] = Pr[𝑍 > 1.70431] = 𝛼/2, where Z is the standard normal distribution. So you would either use a calculator or a normal distribution table to find that 𝛼/2 ≈ 0.0441612 , which means that the provided confidence interval corresponds to a significance level of about 𝛼 = 0.088 , or 100(1 − 𝛼)% = 91.17% confidence. Question #3 Statistics range rule of thumb confidence interval Suppose that the minimum and maximum ages for typical textbooks currently used in college courses are 0 and 8 years. Use the range rule of thumb to estimate the standard deviation. Standard deviation = I have gotten max - min/4 =8-0/4S =2 Find the size of the sample required to estimage the mean age of textbooks currently used in college courses. Assume that you want 98% confidence that the sample mean is within 0.4 year of the population mean. Required sample size = I have no clue how to get the required sample size Answer: OK, now can apply Central Limit theorem if n is large enough. The thumb rule is n>30 . We will see if it is the case. The width of the confidence interval is 𝑋 + 𝑧1−𝛼 ⋅ 2 𝑠 √𝑛 − (𝑋 − 𝑧1−𝛼 ⋅ 2 𝑠 √𝑛 ) = 2 ⋅ 𝑧1−𝛼 ⋅ 2 𝑠 √𝑛 𝛼 is the significance level. Here it is 1-0.98=0.02 2 ⋅ 𝑧0.99 ⋅ 2 ⋅ 2.33 ⋅ 𝑠 √𝑛 2 √𝑛 = 0.4 = 0.4 2 √𝑛 = 2 ⋅ 2.33 ⋅ 0.4 Question #4 The bad debt ratio for a financial institution is defined to be the dollar value of loans defaulted divided by the total dollar value of all loans made. A random sample of 6 Ontario banks is selected and that the bad debt ratios (in percentages) for these banks are: 7, 4, 6, 7, 5, 8 Compute and interpret a 95% confidence interval for the mean bad debt ratio. What needs to be true in order for this interval and interpretation to be valid? Answer: The variance has presumably been calculated as (7− 37 2 37 2 37 2 37 2 37 2 37 2 ) +(4− ) +(6− ) +(7− ) +(5− ) +(8− ) 6 6 6 6 6 6 6−1 with division by 6-1 rather than 6 to give an unbiased estimate of the population estimate. With such as small sample, even if you assume a normal distribution for the population, you should probably be using a Student's t-distribution for the confidence interval. Question #5 The fill of a cereal box machine is required to 18 ounces, with the variance already established at 0.24. The past 150 boxes revealed an average of 17.96 ounces. A testing error of 𝛼 = 2% is considered acceptable. If the box overfills, profits are lost; if the box underfills, the consumer is cheated. (a) Can we conclude that machine is set at 18 ounces? (b) Find a 98% confidence interval for 𝜇 . (c) Would the result change if the sample variance were 0.24 from the data rather than knowing 𝜎 2 = 0.24 ? My Answer: (a) I calculated the positive square root of the variance (0.24) to get the standard deviation of the population - 𝜎 = 0.49 . I found the acceptable region to be between -2.33 and 2.33, so, for a value of z = -0.9998 , I concluded that the machine is set at 18 ounces. (b) 17.87 < 𝜇 < 18.05 (c) I understood the question as: "will there be a change in the result of part (b) if 0.24 changed from being a population variance to a sample variance". My answer is that when only the standard deviation of a sample is known and n>30 , the same formula will be used to calculate the 98% confidence interval, so there will be no change in the result of part (b). Answer: Parts (a) and (b) are correct. (Although for part (a) one might quibble that we are not accepting that the machine pours an average of 18 ounces, rather failing to reject that it does. In any event this quibble is for whoever wrote the problem, not you.) Part (c) is right in the sense that for a sample size as large as 150, using a normal distribution rather than a t distribution doesn't change the answer significantly. There is a theoretical difference though. Question #6 The times that a cashier spends processing each person’s transaction are independent and identically distributed random variables with a mean of µ and a variance of 𝜎 2 . Thus, if Xi is the processing time for each transaction, 𝐸(𝑋𝑖 ) = 𝜇 𝑎𝑛𝑑 Var(𝑋𝑖 ) = 𝜎 2 . Let Y be the total processing time for 100 orders: 𝑌 = 𝑋1 + 𝑋2 + ⋯ + 𝑋100 a) What is the approximate probability distribution of Y, the total processing time of 100 orders? b) Suppose for 𝑍 ∼ 𝑁(0,1) , a standard normal random variable: 𝑃(𝑎 < 𝑍 < 𝑏) = 100(1 − 𝛼)% . Using your distribution from part (a), show that an approximate 100(1 − 𝛼)% confidence interval for the unknown population mean 𝜇 is: ( 𝑌−10𝑏𝜎 100 )<𝜇<( 𝑌−10𝑎𝜎 100 ) I have done part (a) and ended up with 𝑌 ∼ 𝑁(100𝜇, 𝜎 2 ) using the central limit theorem, but I am unsure of where to start in part (b). Answer: The goal of the exercise to illustrate that you may derive a confidence interval that might not be the standard symmetric one (the usual conventional one that uses 𝑥̅ as a pivot). So starting with some a and b such that 𝑃(𝑎 < 𝑍 < 𝑏) = (1 − 𝛼) , we can write (1 − 𝛼) = 𝑃(𝑎 < 𝑍 < 𝑏) = 𝑃 (𝑎 < 𝑌−100𝜇 10𝜎 < 𝑏) = 𝑃(10𝑎𝜎 < 𝑌 − 100𝜇 < 10𝑏𝜎) = 𝑃(−𝑌 + 10𝑎𝜎 < −100𝜇 < −𝑌 + 10𝑏𝜎) = 𝑃( = 𝑃( 𝑌−10𝑎𝜎 100 𝑌−10𝑏𝜎 100 >𝜇> <𝜇< 𝑌−10𝑏𝜎 100 𝑌−10𝑎𝜎 100 ) ) As stated at the beginning, this is a general confidence interval not necessarily symmetric. IF one takes 𝛼 𝛼 2 2 𝑎 = Φ−1 ( ) and 𝑏 = Φ−1 (1 − ), one gets the standard/conventional symmetric confidence interval. But the point is, if the only requirement is the interval having probability (1 − 𝛼) without requiring symmetry, then there are infinitely many such intervals possible. Question #7 To calculate the credible region for the odds ratio 𝜃 1−𝜃 . This is part of a bigger problem where I have already calculated 90% credible region for random variable 𝜃: 𝑃(𝜃 ≤ 0.30) = 0.05 𝑃(𝜃 ≤ 0.52) = 0.95 𝑃(0.30 ≤ 𝜃 ≤ 0.52) = 0.9. Answer: You should be careful in that your manipulation only works if 𝜃 ≤ 1 . You say that 𝜃 comes from a beta distribution though, so you're good. You have a small algebra mistake: it should be 𝑃( 𝜃 1−𝜃 ≤ 𝑎) = 𝑃 (𝜃 ≤ 𝑎 1+𝑎 ) = 0.05 which implies a=0.42 and similarly 𝑃( 𝜃 1−𝜃 ≤ 𝑏) = 𝑃 (𝜃 ≤ 𝑏 1+𝑏 ) = 0.95 which implies b=1.08 Question #8 Trouble Understanding the Formal Definition of a Confidence Interval A 1 − 𝛼 confidence interval for a paramater 𝜃 is an interval 𝐶𝑛 = (𝑎, 𝑏) where 𝑎 = 𝑎(𝑋1 , ⋯ , 𝑋𝑛 ) and b = 𝑏(𝑋1 , … , 𝑋𝑛 ) are functions of the data such that 𝑃𝜃 (𝜃 ∈ 𝐶𝑛 ) > 1 − 𝛼, for all 𝜃 ∈ If 𝜃 is a vector then we use a confidence set (such as a sphere or an ellipse) instead of an interval. Question: While I understand conceptually what a confidence interval is (i.e., a 95% CI means that 95% of experiments will trap the paramater in the interval), I don't understand how this formality is capturing this concept. In particular, I don't understand what is meant by the notion of 𝑃𝜃 (𝜃 ∈ 𝐶𝑛 ). What is the sample space which P is drawing from? What is the set 𝜃 ∈ 𝐶_n ? It seems here 𝜃 is being treated both as a fixed value (from the notation Pθ) and as a random variable (by the notation 𝜃 ∈ 𝐶𝑛 ). Answer: 𝐶𝑛 is an interval with random endpoints, denoted a and b. Both the endpoints are functions of your sample 𝑋1 , 𝑋2 , … , 𝑋𝑛 , and the joint distribution of the X's is parametrized by 𝜃, hence the subscript on 𝑃𝜃 . The parameter 𝜃 that governs this joint distribution is nonrandom, and generally unknown (and the mission of the CI is to capture this unknown parameter). The set {𝜃 ∈ 𝐶𝑛 } is shorthand for {𝜔: 𝑎(𝑋1 (𝜔), 𝑋2 (𝜔), … , 𝑋𝑛 (𝜔)) < 𝜃 < 𝑏(𝑋1 (𝜔), 𝑋2 (𝜔), … , 𝑋𝑛 (𝜔))} (1) Viewed this way, the event (1) is more a statement about the random endpoints a and b, than about the parameter 𝜃 : it's asking whether the random left endpoint is less than the number 𝜃 and the random right endpoint is greater than the number 𝜃 . In the frequentist treatment of confidence intervals, 𝜃 is not a random variable; in other treatments it is possible to regard 𝜃 as the observed value of a random variable, but that's not what this definition appears to be about. Question #9 Two-sided confidence intervals and tests From a sample of 1751 army hospitals, estimate the mean expenses for a full time equivalent employee for all US army hospitals using a 90% confidence interval given x = 6563 and s = 2484. Answer: Since population variance is not known, go for two tailed t-test: 𝐻0 : 𝜇0 = 7000 𝐻1 : 𝜇1 ≠ 7000 𝑡statistic = 6563−7000 2484 √1751 𝑡statistic = −7.361 Rejec 𝐻0 if 𝑡statistic < 𝑡critical 𝐴𝑡 𝛼 = 0.1, t critical = −1.645 Since −7.361 ≤ −1.645 . Reject 𝐻0 and conclude that the mean expenses of a hospital employee is not equal to that of the public. Question #10 Use 2 ∑𝑛𝑖=1 𝑌𝑖 /𝛽 which is a pivotal quantity to derive a 95% confidence interval for 𝛽 Answer: 2 2 I am suprised that you say 𝜒0.975 𝑙𝑡 𝜒0.025 , but if it is true then n P( . 0252  2 Yi  . 9752 ) = 0 which is not what you want. i =1 There will be other 95% confidence interval. Examples would include the one-sided cases n 2 Yi [ i =1 . 95 2 n , ) or (−, Question #11 2 Yi i =1 . 052 ] We are given a 95% confidence interval with some range - in this particular case (4.0; 11.0) - to find the 90% and 99% confidence intervals for the same parameter, assuming normal distribution, without any aditional information. Answer: Based on your formulae 𝑥̅ ± 𝑡 × 𝑠𝑥̅ , it seems like you are dealing with a t confidence interval. If so, you will need the sample size n to accurately get the 90% or 99% confidence interval. But, just to clarify the thought process, suppose the 95% confidence interval given (4,11) is a z confidence interval. If so, you can get the other confidence intervals based on the following logic (some of which also apply in the t confidence interval scenario as well). First, the mid-point of that interval is the sample mean 𝑥̅ by design. So 𝑥̅ = 4+11 2 = 7.5 Second, the half-length of the interval is 1.96 × So 𝜎 √𝑛 = (11−7.5) 1.96 = 𝜎 √𝑛 3.5 1.96 Then, 90% confidence interval: 𝑥̅ ± 1.645 × 𝜎 √𝑛 ⇒ 7.5 ± 1.645 × 3.5 1.96 Question #12 Weaknesses of Wald confidence interval for binomial distribution A survey samples 1000 people, among which 500 say they will vote for A, 400 for B e 100 for C. Calculate a confidence interval for the proportion of people that will vote for A. What are the major weaknesses of the confidence interval you just calculated? Let 𝐴𝑛 ∼ Binom(𝑛, 𝑝) . By the CLT, n ( An − p) xright  d hence (0, p(1 − p)) n ( An − p) xright  d p(1 − p) (0,1) Since p is unknown, we approximate it with ̅̅ 𝐴̅̅ 𝑛 . It follows that, at 1 − 𝛼 confidence, p  An  z 1−  2 An (1 − An ) n Since this confidence interval relies on the CLT, it gives poor results when n is small or p is very close to ̅̅̅̅ 0 or 1. But this is not the case here, as n=1000 and 𝐴 𝑛 = 0.5 , so I fail to see what weaknesses the question is talking about? Answer: On theoretical grounds there are two reasons for suspicion that a (say 95%) Wald CI may not actually have 95% coverage probability. First, it uses the normal approximation to the binomial. Second, it ̅̅̅̅ ̅̅̅̅ estimates the standard error √𝑝(1 − 𝑝)/𝑛 by (in your notation) √𝐴 𝑛 (1 − 𝐴𝑛 )/𝑛. Immediately obvious undesirable consequences are that the Wald CI degenerates to a point at 0 or 1 if the observed proportion of Successes is 0 or 1, respectively. Perhaps more seriously, the actual coverage probability of the Wald interval differs greatly from one value of p to another, and that coverage probability is often below the intended 100(1 − 𝛼) percent. An easily accessible article discussing this point is Brown, Cai, and DasGupta (2001) in Statistical Science. The graphs shown in the second link are for n=25 . You have used n=1000 for your example. The Wald interval is based on asymptotic theory and so works better for a thousand observations than for a few dozen, but in general the weaknesses remain. Question #13 What am I doing wrong in finding the confidence interval? In a certain bio-engineering experiment, a successful outcome was achieved 60 times out of 125 attempts. Construct a 95% confidence interval for the probability, p, of success in a single trial. My answer: We know that the confidence interval is: (p-ks,p+ks) Where s is the standard error. I found the k value using Matlab code: k=norminv(0.975)=1.9600 Also: p=60/125=0.48 Standard Error: 𝑝(1−𝑝) 𝑠=√ 𝑛 =√ 0.48⋅0.52 125 = 0.04469 We get the Confidence Interval: (0.392,0.568) But my answer is wrong. Is there anything I'm missing out here? Answer: If X=60 successes in n=125 binomial trials, maybe your text is using the confidence interval that 'appends 2 successes and 2 failures' to the data before computing the CI. Such CIs have been shown to have more accurate coverage probabilities than the traditional ones. Then 𝑝+ = (60 + 2)/(125 + 4) = .4806 and 𝑛+ = 129 so that the CI is (0.394,0.567). If that is not the 'correct answer' you're looking for, please do tell us what it is.