John Ramey Statistics and Machine Learning

Chapter 2 Solutions - Statistical Methods in Bioinformatics

As I have mentioned previously, I have begun reading Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. In this post, I will give my solution to two problems. The first problem is pretty straightforward.

Problem 2.20

Suppose that a parent of genetic type Mm has three children. Then the parent transmits the M gene to each child with probability 1/2, and the genes that are transmitted to each of the three children are independent. Let if children 1 and 2 had the same gene transmitted, and otherwise. Similarly, let if children 1 and 3 had the same gene transmitted, otherwhise, and let if children 2 and 3 had the same gene transmitted, otherwise.

The question first asks us to how that the three random variables are pairwise independent but not independent. The pairwise independence comes directly from the bolded phrase in the problem statement. Now, to show that the three random variables are not independent, denote by the probability that , . If we had independence, then the following statement would be true:

However, notice that the event in the lefthand side can never happen because if and , then must be 1. Hence, the lefthand side must equal 0, while the righthand side equals 1/8. Therefore, the three random variables are not independent.

The question also asks us to discuss why the variance of is equal to the sum of the individual variances. Often, this is only the case of the random variables are independent. But because the random variables here are pairwise independent, the covariances must be 0. Thus, the equality must hold.

Problems 2.23 - 2.27

While I worked the above problem because of its emphasis on genetics, the following set of problems is much more fun in terms of the mathematics because of its usage of approximations.

For , let be the th lifetime of certain cellular proteins until degradation. We assume that are iid random variables, each of which is exponentially distributed with rate parameter . Furthermore, let be an odd integer.

This set of questions is concerned with the mean and variance of the sample median, , where denotes the th order statistic. First, note that the mean and variance of the minimum value are and , respectively. From the memoryless property of the exponential distribution, the mean value of the time until the next protein degrades is independent of the previous. However, there are now proteins remaining. Thus, the mean and variance of are and , respectively. Continuining in this manner, we have


Approximation of

Now, we wish to approximate the mean with a much simpler formula. First, from (B.7) in Appendix B, we have

where is Euler’s constant. Then, we can write the expected sample median as

Hence, as , this approximation goes to , which is the median of an exponentially distributed random variable. Specifically, the median is the solution to , where denotes the cumulative distribution function of the random variable .

Improved Approximation of

It turns out that we can improve this approximation with the following two results:

Following the derivation of our above approximation, we have that

Approximation of

We can also approximate using the approximation

With and , we have