### Comparison of Confidence Intervals and Hypothesis Tests with Different Sample Sizes

##### Introduction

Here, I would like to analyze a set of data from an intrument I use consistently at work. The data set is from an analytical technique called Karl Fischer Titrations by methanol extraction/dilutions. It is commonly used to measure the amount of water a liquid sample contains to help enforce the federal Resource Conservation and Recovery Act (RCRA). This information is then used to qualify whether or not the particular sample is considered aqueous or not and which other analyses should be performed on the sample. Generally, a sample that has greater than 50% (by weight) water content is considered aqueous and can be tested for pH (corrosivity) and metals (toxicity) determination. When the sample has less than 50% (by weight) water content, it is generally considered non-aqueous or organic and would be tested for flashpoint (ignitability) and be available for volatile or semi-volatile analyses. Misidentification of aqueousness can cause serious damage to instruments and waste money, time and sample unnecessarily.

Approximately one year of retrospective data (86 data points) was obtained from my laboratory in order to perform statistical data analysis methods on the 100% water standard data, herein referred to as the H2O population. With this real data set, I was interested in computing the general statistics of the population as well as computing the confidence interval and the hypothesis test for sample means of size n = 3. I also decided to investigate further, how the confidence interval and hypothesis test would change with increasing sample size. This paper will describe, illustrate, and compare the results found. All data and code can be found in Appendix A.

##### Statistical Analysis Results & Comparison

Initially, the original H2O population histogram looked normal, and the boxplot exhibited symmetry with four outliers. The Q-Q plot fits a normal distribution fairly well towards the center but not at the extreme ends. There was not a significant difference between the mean, median, or 0.10 trimmed mean values, 99.22, 99.42, 99.32, respectively, indicating that the shape of the distribution is fairly symmetric. The variance and standard deviation values of 5.364 and 2.312, respectively, do not seem out of the ordinary and are probably more influenced by the observed outliers. The theoretical water standard value of 100 fits well within the computed range of 92.40 to 107.35. Interestingly, the Shapiro-Wilk’s normality test of the H2O population provides a p-value of 0.01412, which is less than 0.05. Therefore, we can say that the data did not come from a normal distribution. A nonparametric bootstrap would be a great way to obtain the sample distribution of a single mean because we do not need to assume anything about the distribution of the statistic. The nonparametric bootstrap also provides a simple random sample (SRS) by randomly sampling (with replacement) from the non-normal H2O population. In order to compute a confidence interval (CI) for the mean, the new sampling distribution should be “standardized” around zero and not depend on any unknown values, similar to the CIs for the standard normal distribution or the Student’s t distribution. This standardized pivot, (xbar-μ)/(s/sqrt(n)), is then used in the nonparametric bootstrap to produce 25,000 sample means for each sample size, n = 3, n = 4, and n = 5. In general, the sampling distributions’ summary statistics decreased in values as the sample size increased, Table 1. There appears to be natural decreases in the mean, median and 0.10 trimmed mean values with increasing sample sizes, but there was a noticeably large decrease in the variance and standard deviation values between sample sizes n = 3 and n = 4. The histograms, boxplots, and normal Q-Q plots look pretty similar between the three different sample sizes. The Shapiro-Wilk’s normality test was used again, this time on each of the three sampling distributions. The same conclusion was reached, each of the n = 3, n = 4, and n = 5, sampling distributions had a p-value that was less than 0.05, and therefore did not come from a normal distribution. Another option for a possible type of distribution is the skew t distribution. It is an extension of the Student’s t distribution, but has four parameters associated with it; beta (β) which relates to the center, omega (Ω) which relates to the spread, alpha (α) which relates to skewness or shape, and df which relates to degrees of freedom or the heaviness in the tails. When alpha is zero, the skew t distribution becomes a Student’s t distribution. When the degrees of freedom are large, the skew t distribution transforms into a skew normal distribution. When alpha is equal to zero and the degrees of freedom are large, the skew t distribution turns into a normal distribution. With this in mind, I applied the skew t to each individual sample distribution to see the fit and to gain some information from the two additional parameters. As you can see from Table 2, the skew t parameters Ω, α, & df steadily increase in values with the increase of sample size. So as the sample size increases, the skewness becomes slightly worse and the heaviness in the tails becomes slightly better. The histograms below show that the variability decreases with increasing sample size. The Q-Q & P-P plots on the next page show that the sampling distributions fit better with increasing sample size. Confidence intervals of a single mean are usually computed with the quantiles of the requested confidence level from a Student’s t distribution when the data is from a SRS and from a normal sample distribution. In this case, that computation does not apply since none of sample distributions are normally distributed. Therefore, the interval needs to be computed with the quantiles of the requested confidence level from the bootstrapped sample distribution of the particular sample size of interest. Having already established the SRS, and a confidence level of α = 0.05, the specific quantiles were obtained for the three sample distributions, n = 3, n = 4, and n = 5, and were used with a new set of corresponding sample observations to calculate the confidence intervals. The density plots, shown on the bottom of the previous page, and Table 3 below, indicate that both the quantiles and the confidence intervals become narrower as the sample size increases. For comparison purposes, similar quantiles of the Student’s t distribution and the normal distribution were added. The theoretical value of the water standard is 100% (by weight). Using this information, I conducted a hypothesis test on each of the sample size distributions. The null hypothesis (H0) is that μ (pop mean) is equal to 100, and the alternative hypothesis (Ha) is that μ does not equal 100, for a two sided test. The H0 value of 100 was substituted into the pivot for the parameter μ, and a new sample distribution for each sample size was obtained by reapplying the nonparametric bootstrap method. The new sample distributions’ summary statistics are shown in Table 4. The variance and standard deviation decreased with increasing sample size, and again displaying the big decreasing jump between n = 3 and n = 4. The median and trimmed mean values also decreased with increasing sample size. The histograms, boxplots, and normal Q-Q plots between the three new sample distributions look comparable. The Shapiro-Wilk’s normality test was used on each of the three new sample distributions, providing a p-value that was less than 0.05, and therefore did not come from a normal distribution. The skew t distribution was used again to obtain data on each of the new sample distributions, with the results shown in Table 5. The skew t parameters Ω & df again steadily increase in value while β & α decrease in value with increasing sample size. The skewness becomes slightly worse and the heaviness in the tails becomes slightly better as the sample size increases. The histograms show the variability decreasing with increasing sample size and the Q-Q & P-P plots have a better fit with increasing sample size. A test statistic was calculated for each sample size distribution using the new set of sample observations of corresponding sample size, listed in Table 6. Since the three new sampling distributions are non-normal, the quantiles from each of their SRS distributions, and α = 0.05, were used to acquire the critical values, Table 6. The test statistic values seem to be increasing and the critical values seem to be narrowing with increasing sample sizes. In all three sample size distributions, the test statistics (TS) do not exceed their corresponding critical values (CV), therefore, we fail to reject the null hypothesis for each sample size distribution. The p-values were also computed for a two-sided hypothesis test and the values for each of the three new sample distributions can be found in Table 6. The p-values decreased with increasing sample size. The p-value/α decision confirmed the TS/CV decision, with failing to reject the null hypothesis because the p-values were greater than α = 0.05 for all three new sample distributions, n = 3, n = 4, and n = 5. Interestingly, as the sample size increases to n = 5, both the TS/CV decision and the p-value/α decision approaches the edge of the where that decision of failing to reject H0 is just barely away from the benchmark where the decision would change to the other side of the fence by rejecting H0. Since both of the decisions match, we can say that for each sample distribution of n =3, n = 4, and n =5, with α = 0.05, we have insufficient evidence that the mean water standard value is significantly different from the true mean value of 100 (% by weight).