Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Stat 435 homeworks 1 to 8 solution

1. We will perform k-nearest-neighbors in this problem, in a setting with 2 classes, 25 observations per class, and p = 2 features. We will call one class the “red” class and the other class the “blue” class. The observations in the red class are drawn i.i.d. from a Np(µr, I) distribution, and the observations in the blue class are drawn i.i.d. from a Np(µb, I) distribution, where µr = 0 0 is the mean in the red class, and where µb = 1.5 1.5 is the mean in the blue class. (a) Generate a training set, consisting of 25 observations from the red class and 25 observations from the blue class. (You will want to use the R function rnorm.) Plot the training set. Make sure that the axes are properly labeled, and that the observations are colored according to their class label. (b) Now generate a test set consisting of 25 observations from the red class and 25 observations from the blue class. On a single plot, display both the training and test set, using one symbol to indicate training observations (e.g. circles) and another symbol to indicate the test observations (e.g. squares). Make sure that the axes are properly labeled, that the symbols for training and test observations are explained in a legend, and that the observations are colored according to their class label. (c) Using the knn function in the library class, fit a k-nearest neighbors model on the training set, for a range of values of k from 1 to 20. Make a plot that displays the value of 1/k on the x-axis, and classification error (both training error and test error) on the y-axis. Make sure all axes and curves are properly labeled. Explain your results. 1 (d) For the value of k that resulted in the smallest test error in part (c) above, make a plot displaying the test observations as well as their true and predicted class labels. Make sure that all axes and points are clearly labeled. (e) In this example, what is the Bayes error rate? Justify your answer. 2. We will once again perform k-nearest-neighbors in a setting with p = 2 features. But this time, we’ll generate the data differently: let X1 ∼ Unif[0, 1] and X2 ∼ Unif[0, 1], i.e. the observations for each feature are i.i.d. from a uniform distribution. An observation belongs to class “red” if (X1−0.5)2+(X2−0.5)2 > 0.15 and X1 > 0.5; to class “green” if (X1 − 0.5)2 + (X2 − 0.5)2 > 0.15 and X1 ≤ 0.5; and to class “blue” otherwise. (a) Generate a training set of n = 200 observations. (You will want to use the R function runif.) Plot the training set. Make sure that the axes are properly labeled, and that the observations are colored according to their class label. (b) Now generate a test set consisting of another 200 observations. On a single plot, display both the training and test set, using one symbol to indicate training observations (e.g. circles) and another symbol to indicate the test observations (e.g. squares). Make sure that the axes are properly labeled, that the symbols for training and test observations are explained in a legend, and that the observations are colored according to their class label. (c) Using the knn function in the library class, fit a k-nearest neighbors model on the training set, for a range of values of k from 1 to 50. Make a plot that displays the value of 1/k on the x-axis, and classification error (both training error and test error) on the y-axis. Make sure all axes and curves are properly labeled. Explain your results. (d) For the value of k that resulted in the smallest test error in part (c) above, make a plot displaying the test observations as well as their true and predicted class labels. Make sure that all axes and points are clearly labeled. (e) In this example, what is the Bayes error rate? Justify your answer, and explain how it relates to your findings in (c) and (d). 3. For each scenario, determine whether it is a regression or a classification problem, determine whether the goal is inference or prediction, and state the values of n (sample size) and p (number of predictors). (a) I want to predict each student’s final exam score based on his or her homework scores. There are 50 students enrolled in the course, and each student has completed 8 homeworks. 2 (b) I want to understand the factors that contribute to whether or not a student passes this course. The factors that I consider are (i) whether or not the student has previous programming experience; (ii) whether or not the student has previously studied linear algebra; (iii) whether or not the student has taken a previous stats/probability course; (iv) whether or not the student attends office hours; (v) the student’s overall GPA; (vi) the student’s year (e.g. freshman, sophomore, junior, senior, or grad student). I have data for all 50 students enrolled in the course. 4. In each setting, would you generally expect a flexible or an inflexible statistical machine learning method to perform better? Justify your answer. (a) Sample size n is very small, and number of predictors p is very large. (b) Sample size n is very large, and number of predictors p is very small. (c) Relationship between predictors and response is highly non-linear. (d) The variance of the error terms, i.e. σ 2 = Var(), is extremely high. 5. This question has to do with the bias-variance decomposition. (a) Make a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods to more flexible approaches. The x-axis should represent the amount of flexibility in the model, and the y-axis should represent the values of each curve. There should be five curves. Make sure to label each one. (b) Explain why each of the five curves has the shape displayed in (a). 6. This exercise involves the Boston housing data set, which is part of the MASS library in R. (a) How many rows are in this data set? How many columns? What do the rows and columns represent? (b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings. (c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship. (d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor. (e) How many of the suburbs in this data set bound the Charles river? (f) What are the mean and standard deviation of the pupil-teacher ratio among the towns in this data set? 3 (g) Which suburb of Boston has highest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings. (h) In this data set, how many of the suburbs average more than six rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.1. Suppose we have a quantitative response Y , and a single feature X ∈ R. Let RSS1 denote the residual sum of squares that results from fitting the model Y = β0 + β1X + using least squares. Let RSS12 denote the residual sum of squares that results from fitting the model Y = β0 + β1X + β2X 2 + using least squares. (a) Prove that RSS12 ≤ RSS1. (b) Prove that the R2 of the model containing just the feature X is no greater than the R2 of the model containing both X and X2 . 2. Describe the null hypotheses to which the p-values in Table 3.4 of the textbook correspond. Explain what conclusions you can draw based on these pvalues. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model. 3. Consider a linear model with just one feature, Y = β0 + β1X + . Suppose we have n observations from this model, (x1, y1), . . . ,(xn, yn). The least squares estimator is given in (3.4) of the textbook. Furthermore, we saw 1 in class that if we construct a n × 2 matrix X˜ whose first column is a vector of 1’s and whose second column is a vector with elements x1, . . . , xn, and if we let y denote the vector with elements y1, . . . , yn, then the least squares estimator takes the form βˆ 0 βˆ 1 = X˜ TX˜ −1 X˜ T y. (1) Prove that (1) agrees with equation (3.4) of the textbook, i.e. βˆ 0 and βˆ 1 in (1) equal βˆ 0 and βˆ 1 in (3.4). 4. This question involves the use of multiple linear regression on the Auto data set, which is available as part of the ISLR library. (a) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. Provide an interpretation for the coefficient associated with the variable year. Make sure that you treat the qualitative variable origin appropriately. (b) Try out some models to predict mpg using functions of the variable horsepower. Comment on the best model you obtain. Make a plot with horsepower on the x-axis and mpg on the y-axis that displays both the observations and the fitted function (i.e. ˆf(horsepower)). (c) Now fit a model to predict mpg using horsepower, origin, and an interaction between horsepower and origin. Make sure to treat the qualitative variable origin appropriately. Comment on your results. Provide a careful interpretation of each regression coefficient. 5. Consider fitting a model to predict credit card balance using income and student, where student is a qualitative variable that takes on one of three values: student∈ {graduate, undergraduate, not student}. (a) Encode the student variable using two dummy variables, one of which equals 1 if student=graduate (and 0 otherwise), and one of which equals 1 if student=undergraduate (and 0 otherwise). Write out an expression for a linear model to predict balance using income and student, using this coding of the dummy variables. Interpret the coefficients in this linear model. (b) Now encode the student variable using two dummy variables, one of which equals 1 if student=not student (and 0 otherwise), and one of which 2 equals 1 if student=graduate (and 0 otherwise). Write out an expression for a linear model to predict balance using income and student, using this coding of the dummy variables. Interpret the coefficients in this linear model. (c) Using the coding in (a), write out an expression for a linear model to predict balance using income, student, and an interaction between income and student. Interpret the coefficients in this model. (d) Using the coding in (b), write out an expression for a linear model to predict balance using income, student, and an interaction between income and student. Interpret the coefficients in this model. (e) Using simulated data for balance, income, and student, show that the fitted values (predictions) from the models in (a)–(d) do not depend on the coding of the dummy variables (i.e. the models in (a) and (b) yield the same fitted values, as do the models in (c) and (d)). 6. Extra Credit. Consider a linear model with just one feature, Y = β0 + β1X + , with E() = 0 and Var() = σ 2 . Suppose we have n observations from this model, (x1, y1), . . . ,(xn, yn). We assume that x1, . . . , xn are fixed, so the only randomness in the model comes from 1, . . . , n. Use (3.4) in the textbook — or, if you prefer, the matrix algebra formulation in (1) of this homework assignment — in order to derive the expressions for Var(βˆ 0) and Var(βˆ 1) given in (3.8) of the textbook.1. A random variable X has an Exponential(λ) distribution if its probability density function is of the form f(x) = ( λe−λx if x > 0 0 if x ≤ 0 , where λ > 0 is a parameter. Furthermore, the mean of an Exponential(λ) random variable is 1/λ. Now, consider a classification problem with K = 2 classes and a single feature X ∈ R. If an observation is in class 1 (i.e. Y = 1) then X ∼ Exponential(λ1). And if an observation is in class 2 (i.e. Y = 2) then X ∼ Exponential(λ2). Let π1 denote the probability that an observation is in class 1, and let π2 = 1 − π1. (a) Derive an expression for Pr(Y = 1 | X = x). Your answer should be in terms of x, λ1, λ2, π1, π2. (b) Write a simple expression for the Bayes classifier decision boundary, i.e., an expression for the set of x such that Pr(Y = 1 | X = x) = Pr(Y = 2 | X = x). (c) For part (c) only, suppose λ1 = 2, λ2 = 7, π1 = 0.5. Make a plot of feature space. Clearly label: i. the region of feature space corresponding to the Bayes classifier decision boundary, ii. the region of feature space for which the Bayes classifier will assign an observation to class 1, 1 iii. the region of feature space for which the Bayes classifier will assign an observation to class 2. (d) Now suppose that we observe n independent training observations, (x1, y1), . . . ,(xn, yn). Provide simple estimators for λ1, λ2, π1, π2, in terms of the training observations. (e) Given a test observation X = x0, provide an estimate of P(Y = 1 | X = x0). Your answer should be written only in terms of the n training observations (x1, y1), . . . ,(xn, yn), and the test observation x0, and not in terms of any unknown parameters. 2. We collect some data for students in a statistics class, with predictors X1 = number of lectures attended, X2 = average number of hours studied per week, and response Y = receive an A. We fit a logistic regression model, and get coefficient estimates βˆ 0, βˆ 1, βˆ 2. (a) Write out an expression for the probability that a student gets an A, as a function of the number of lectures she attended, and the average number of hours she studied per week. Your answer should be written in terms of X1, X2, βˆ 0, βˆ 1, βˆ 2. (b) Write out an expression for the minimum number of hours a student should study per week in order to have at least an 80% chance of getting an A. Your answer should be written in terms of X1, X2, βˆ 0, βˆ 1, βˆ 2. (c) Based on a student’s value of X1 and X2, her predicted probability of getting an A in this course is 60%. If she increases her studying by one hour per week, then what will be her predicted probability of getting an A in this course? 3. When the number of features p is large, there tends to be a deterioration in the performance of K-nearest neighbors (KNN) and other approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality. We will now investigate this curse. (a) Suppose that we have a set of observations, each with measurements on p = 1 feature, X. We assume that X is uniformly distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction? 2 (b) Now suppose that we have a set of observations, each with measurements on p = 2 features, X1 and X2. We assume that (X1, X2) are uniformly distributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within 10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction? (c) Now suppose that we have a set of observations on p = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction? (d) Using your answers to parts (a)-(c), argue that a drawback of KNN when p is large is that there are very few training observations “near” any given test observation. (e) Now suppose that we wish to make a prediction for a test observation by creating a p-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For p = 1, 2, and 100, what is the length of each side of the hypercube? Comment on your answer. Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When p = 1, a hypercube is simply a line segment, when p = 2 it is a square. 4. Pick a data set of your choice. It can be chosen from the ISLR package (but not one of the data sets explored in the Chapter 4 lab, please!), or it can be another data set that you choose. Choose a binary qualitative variable in your data set to be the response, Y . (By binary qualitative variable, I mean a qualitative variable with K = 2 classes.) If your data set doesn’t have any binary qualitative variables, then you can create one (e.g. by dichotomizing a continuous variable: create a new variable that equals 1 or 0 depending on whether the continuous variable takes on values above or below its median). I suggest selecting a data set with n p. (a) Describe the data. What are the values of n and p? What are you trying to predict, i.e. what is the meaning of Y ? What is the meaning of the features? (b) Split the data into a training set and a test set. Perform LDA on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error? 3 (c) Perform QDA on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error? (d) Perform logistic regression on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error? (e) Perform KNN on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error? (f) Comment on your results.1. Consider the validation set approach, with a 50/50 split into training and validation sets: (a) Suppose you perform the validation set approach twice, each time with a different random seed. What’s the probability that an observation, chosen at random, is in both of those training sets? (b) If you perform the validation set approach repeatedly, will you get the same result each time? Explain your answer. 2. Consider K-fold cross-validation: (a) Consider the observations in the 1st fold’s training set, and the observations in the 2nd fold’s training set. What’s the probability that an observation, chosen at random, is in both of those training sets? (b) If you perform K-fold CV repeatedly, will you get the same result each time? Explain your answer. 3. Now consider leave-one-out cross-validation: (a) Consider the observations in the 1st fold’s training set, and the observations in the 2nd fold’s training set. What’s the probability that an observation, chosen at random, is in both of those training sets? (b) If you perform leave-one-out cross-validation repeatedly, will you get the same result each time? Explain your answer. 1 4. Consider a very simple model, Y = β + , where Y is a scalar response variable, β ∈ R is an unknown parameter, and is a noise term with E() = 0, V ar() = σ 2 . Our goal is to estimate β. Assume that we have n observations with uncorrelated errors. (a) Suppose that we perform least squares regression using all n observations. Prove that the least squares estimator, βˆ, equals 1 n Pn i=1 Yi . (b) Suppose that we perform least squares using all n observations. Prove that the least squares estimator, βˆ, has variance σ 2/n. (c) Consider the least squares estimator of β fit using just n/2 observations. What is the variance of this estimator? (d) Consider the least squares estimator of β fit using n(K − 1)/K observations, for some K > 2. What is the variance of this estimator? (e) Consider the least squares estimator of β fit using n − 1 observations. What is the variance of this estimator? (f) Derive an expression for E(βˆ), where βˆ is the least squares estimator fit using all n observations. (g) Using your results from the earlier sections of this question, argue that the validation set approach tends to over -estimate the expected test error. (h) Using your results from the earlier sections of this question, argue that leave-one-out cross-validation does not substantially over-estimate the expected test error, provided that n is large. (i) Using your results from the earlier sections of this question, argue that K-fold CV provides an over-estimate of the expected test error that is somewhere between the big over-estimate resulting from the validation set approach and the very mild over-estimate resulting from leave-one-out CV. 5. As in the previous problem, assume Y = β + , where Y is a scalar response variable, β ∈ R is an unknown parameter, and is a noise term with E() = 0, V ar() = σ 2 . Our goal is to estimate β. Assume that we have n observations with uncorrelated errors. (a) Suppose that we perform K-fold cross-validation. What is the correlation between βˆ1 , the least squares estimator of β that we obtain from the 1st fold, and βˆ2 , the least squares estimator of β that we obtain from the 2nd fold? 2 (b) Suppose that we perform the validation set approach twice, each time using a different random seed. Assume further that exactly 0.25n observations overlap between the two training sets. What is the correlation between βˆ1 , the least squares estimator of β that we obtain the first time that we perform the validation set approach, and βˆ2 , the least squares estimator of β that we obtain the second time that we perform the validation set approach? (c) Now suppose that we perform leave-one-out cross-validation. What is the correlation between βˆ1 , the least squares estimator of βˆ that we obtain from the 1st fold, and βˆ2 , the least squares estimator of β that we obtain from the 2nd fold? Remark 1: Problem 5 indicates that the βˆ’s that you estimate using LOOCV are very correlated with each other. Remark 2: You might remember from an earlier stats class that if X1, . . . , Xn are uncorrelated with variance σ 2 and mean µ, then the variance of 1 n Pn i=1 Xi equals σ 2/n. But if Cor(Xi , Xk) = σ 2 , then the variance of 1 n Pn i=1 Xi is quite a bit higher. Remark 3: Together, problems 4 and 5 might give you some intuition for the following: LOOCV results in an approximately unbiased estimator of expected test error (if n is large), but this estimator has high variance. In contrast, Kfold CV results in an estimator of expected test error that has higher bias, but lower variance.1. In this exercise, you will generate simulated data, and will use this data to perform best subset selection. (a) Use the rnorm() function to generate a predictor X of length n = 100, and a noise vector of length n = 100. (b) Generate a response vector Y of length n = 100 according to the model Y = 3 − 2X + X 2 + . (c) Use the regsubsets() function to perform best subset selection, considering X, X2 , . . . , X7 as candidate predictors. Make a plot like Figure 6.2 in the textbook. What is the overall best model according to Cp, BIC, and adjusted R2 ? Report the coefficients of the best model obtained. Comment on your results. (d) Repeat (c) using forward stepwise selection instead of best subset selection. (e) Repeat (c) using backward stepwise selection instead of best subset selection. Hint: You may need to use the data.frame() function to create a single data set containing both X and Y . 2. In class, we discussed the fact that if you choose a model using stepwise selection on a data set, and then fit the selected model using least squares on the same data set, then the resulting p-values output by R are highly misleading. We’ll now see this through simulation. 1 (a) Use the rnorm() function to generate vectors X1, X2, . . . , X100 and , each of length n = 1000. (Hint: use the matrix() function to create a 1000 × 100 data matrix.) (b) Generate data according to Y = β0 + β1X1 + . . . + β100X100 + , where β1 = . . . = β100 = 0. (c) Fit a least squares regression model to predict Y using X1, . . . , Xp. Make a histogram of the p-values associated with the null hypotheses H0j : βj = 0 for j = 1, . . . , 100. Hint: You can easily access these p-values using the command (summary(lm(y~X)))$coef[,4]. (d) Recall that under H0j : βj = 0, we expect the p-values to have a Unif[0, 1] distribution. In light of this fact, comment on your results in (c). Do any of the features appear to be significantly associated with the response? (e) Perform forward stepwise selection in order to identify M2, the best twovariable model. (For this problem, there is no need to calculate the best model Mk for k 6= 2.) Then fit a least squares regression model to the data, using just the features in M2. Comment on the p-values obtained for the coefficients. (f) Now generate another 1000 observations by repeating the procedure in (a) and (b). Using the new observations, fit a least squares linear model to predict Y using just the features in M2 calculated in (e). (Do not perform forward stepwise selection again using the new observations! Instead, take the M2 obtained earlier in this problem.) Comment on the p-values for the coefficients. How do they compare to the p-values in (e)? (g) Are the features in M2 significantly associated with the response? Justify your answer. THE BOTTOM LINE: If you showed a friend the p-values obtained in (e), without explaining that you obtained M2 by performing forward stepwise selection on this same data, then he or she might incorrectly conclude that the features in M2 are highly associated with the response. 3. Let’s consider doing least squares and ridge regression under a very simple setting, in which p = 1, and Pn i=1 yi = Pn i=1 xi = 0. We consider regression without an intercept. (It’s usually a bad idea to do regression without an intercept, but if our feature and response each have mean zero, then it is okay to do this!) (a) The least squares solution is the value of β ∈ R that minimizes Xn i=1 (yi − βxi) 2 . 2 Write out an analytical (closed-form) expression for this least squares solution. Your answer should be a function of x1, . . . , xn and y1, . . . , yn. Hint: Calculus!! (b) For a given value of λ, the ridge regression solution minimizes Xn i=1 (yi − βxi) 2 + λβ2 . Write out an analytical (closed-form) expression for the ridge regression solution, in terms of x1, . . . , xn and y1, . . . , yn and λ. (c) Suppose that the true data-generating model is Y = 3X + , where has mean zero, and X is fixed (non-random). What is the expectation of the least squares estimator from (a)? Is it biased or unbiased? (d) Suppose again that the true data-generating model is Y = 3X + , where has mean zero, and X is fixed (non-random). What is the expectation of the ridge regression estimator from (b)? Is it biased or unbiased? Explain how the bias changes as a function of λ. (e) Suppose that the true data-generating model is Y = 3X + , where has mean zero and variance σ 2 , and X is fixed (non-random), and also Cov(i , i 0)= 0 for all i 6= i 0 . What is the variance of the least squares estimator from (a)? (f) Suppose that the true data-generating model is Y = 3X + , where has mean zero and variance σ 2 , and X is fixed (non-random), and also Cov(i , i 0)= 0 for all i 6= i 0 . What is the variance of the ridge estimator from (b)? How does the variance change as a function of λ? (g) In light of your answers to parts (d) and (f), argue that λ in ridge regression allows us to control model complexity by trading off bias for variance. Hint: For this problem, you might want to brush up on some basic properties of means and variances! For instance, if Cov(Z, W) = 0, then V ar(Z + W) = V ar(Z) + V ar(W). And if a is a constant, then V ar(aW) = a 2V ar(W), and V ar(a + W) = V ar(W). 4. Suppose that you collect data to predict Y (height in inches) using X (weight in pounds). You fit a least squares model to the data, and you get Yˆ = 3.1 + 0.57X. (a) Suppose you decide that you want to measure weight in ounces instead of pounds. Write out the least squares model for predicting Y using X˜ (weight in ounces). (You should calculate the coefficient estimates explicitly.) Hint: there are 16 ounces in a pound! 3 (b) Consider fitting a least squares model to predict Y using X and X˜. Let β denote the coefficient for X in the least squares model, and let β˜ denote the coefficient for X˜. Argue that any equation of the form Yˆ = 3.1 + βX + β˜X, ˜ where β + 16β˜ = 0.57, is a valid least squares model. (c) Suppose that you use ridge regression to predict Y using X, using some value of λ, and obtain the fitted model Yˆ = 3.1 + 0.4X. Now consider fitting a ridge regression model to predict Y using X˜, again using that same value of λ. Will the coefficient of X˜ be equal to 0.4/16, greater than 0.4/16, or less than 0.4/16? Explain your answer. (d) For the same value of λ considered in (c), suppose you perform ridge regression to predict Y using X, and separately you perform ridge regression to predict Y using X˜. Which fitted model will have smaller residual sum of squares (on the training set)? Explain your answer. (e) Finally, suppose you use ridge regression to predict Y using X and X˜, using some value of λ (not necessarily the same value of λ used in (d)), and obtain the fitted model Yˆ = 3.17 + 0.03X + 0.03X. ˜ Is the following claim true or false? Explain your answer. Claim: Any equation of the form Yˆ = 3.17 + βX + β˜X, ˜ where β+16β˜ = 0.03+16×0.03 = 0.51, is a valid ridge regression solution for that value of λ. (f) Argue that your answers to the previous sub-problems support the following claim: Claim: least squares is scale-invariant, but ridge regression is not. 5. Suppose we wish to fit a linear regression model using least squares. Let MBSS k ,MFW D k ,MBW D k denote the best k-feature models in the best subset, forward stepwise, and backward stepwise selection procedures. (For notational details, see Algorithms 6.1, 6.2, and 6.3 of the textbook.) Recall that the training set residual sum of squares (or RSS for short) is defined as Pn i=1(yi − yˆi) 2 . For each claim, fill in the blank with one of the following: “less than”, “less than or equal to”, “greater than”, “greater than or equal to”, “equal to”. Say “not enough information to tell” if it is not possible to complete the sentence as given. Explain each of your answers. 4 (a) Claim: The RSS of MFW D 1 is the RSS of MBW D 1 . (b) Claim: The RSS of MFW D 0 is the RSS of MBW D 0 . (c) Claim: The RSS of MFW D 1 is the RSS of MBSS 1 . (d) Claim: The RSS of MFW D 2 is the RSS of MBSS 1 . (e) Claim: The RSS of MBW D 1 is the RSS of MBSS 1 . (f) Claim: The RSS of MBW D p is the RSS of MBSS p . (g) Claim: The RSS of MBW D p−1 is the RSS of MBSS p−1 . (h) Claim: The RSS of MBW D 4 is the RSS of MBSS 4 . (i) Claim: The RSS of MBW D 4 is the RSS of MFW D 4 . (j) Claim: The RSS of MBW D 4 is the RSS of MBW D 3 . 6. This problem is extra credit!!!! Let y denote an n-vector of response values, and let X denote an n × p design matrix. We can write the ridge regression problem as minimizeβ∈Rp ky − Xβk 2 + λkβk 2, where we are omitting the intercept for convenience. Derive an analytical (closed-form) expression for the ridge regression estimator. Your answer should be a function of X, y, and λ.1. For this problem, you will analyze a data set of your choice, not taken from the ISLR package. I suggest choosing a data set that has p ≈ n or even p > n, since you will apply methods from Chapter 6 on this data. (a) Describe the data in words. Where did you get it from, and what is the data about? You will perform supervised learning on this data, so you must identify a response, Y , and features, X1, . . . , Xp. What are the values of n and p? Describe the response and the features (e.g. what are they measuring; are they quantitative or qualitative?). Plot some summary statistics of the data. (b) Split the data into a training set and a test set. What are the values of n and p on the training set? (c) Fit a linear model using least squares on the training set, and report the test error obtained. (d) Fit a ridge regression model on the training set, with λ chosen by crossvalidation. Report the test error obtained. (e) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. (f) Fit a principal components regression model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. (g) Fit a partial least squares model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 1 (h) Comment on the results obtained. How accurately is the best model you obtained, in terms of test error? Is there much difference among the test errors resulting from these approaches? Which model do you prefer? 2. Define the basis functions b1(X) = I(−1 < X ≤ 1) − (2X − 1)I(1 < X ≤ 3), b2(X) = (X + 1)I(3 < X ≤ 5) − I(5 < X ≤ 6). We fit the linear regression model Y = β0 + β1b1(X) + β2b2(X) + , and obtain coefficient estimates βˆ 0 = 2, βˆ 1 = −1, βˆ 2 = 2. Sketch the estimated curve between X = −3 and X = 8. Note the intercepts, slopes, and other relevant information.1. For this problem, you will analyze a data set of your choice, not taken from the ISLR package. Choose a data set that has n p, since you will apply methods from Chapter 7 to this data. You will also need to have p > 1. Throughout this problem, make sure to label your axes appropriately, and to include legends when needed. (a) Describe the data in words. Where did you get it from, and what is the data about? You will perform supervised learning on this data, so you must identify a response, Y , and features, X1, . . . , Xp. What are the values of n and p? Describe the response and the features (e.g. what are they measuring; are they quantitative or qualitative?). (b) Fit a generalized additive model, Y = f1(X1) + . . . + fp(Xp) + . Use cross-validation to choose the level of complexity. For j = 1, . . . , p, make a scatterplot of Xj against Y , and plot ˆfj (Xj ). Comment on your results and on the choices you made in fitting this model. (c) Now fit a linear model, Y = β0 + β1X1 + . . . + βpXp + . For j = 1, . . . , p, display the linear fit (Xjβˆ j ) on top of a scatterplot of Xj against Y . (d) Estimate the test error of the generalized additive model and the test error of the linear model. Comment on your results. Which approach gives a better fit to the data? 2. In this problem, we’ll play around with regression splines. (a) Generate data as follows: 1 set.seed(7) x