A bit about the data:
A company that manufactures riding mowers wants to identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is interested in classifying households as prospective owners or nonowners on the basis of income (in $1000s) and lot size (in 1000 ft2). The marketing expert looked at a random sample of 24 households, given in the file RidingMowers.csv.
First, we read in the data into R and look at the first 6 rows to get a sense of the structure of the information.
Question 1: How many quantitative/numeric variables are there and provide the minimum, maximum, median, mean, and standard deviation for each quantitative variable? [2 pts] Answer:
Question 2: How many qualitative variables are there and what are the levels of each of those variables? What percentage of households in the study were owners of a riding mower? [2 pts] Answer:
Question 3: Create a data visualization of a scatterplot of lot size (x-axis) versus income (y-axis), color-coded by the outcome variable, Ownership (paste your visualization at the end of this worksheet). (a) Describe the potential relationship(s) of ownership to lot size and income. (b) From the data visualization, which class seems to have the higher average income, owners or nonowners? [2 pts for visualization, 1 pt for (a), & 1 pt for (b)] Answer:
Using all the data, fit a logistic model of ownership on the two explanatory variables (Lot Size and Income, no interaction term).
Question 4: Use the output from the logistic model to perform the Likelihood Ratio Test, where we will compare the model with Lot Size and Income to the null (no explanatory variable) model. State the hypotheses, test statistic, p-value, and the conclusion for this test. [1 pt for hypotheses, 1 pt for test statistics, 1 pt for p-value, & 1 pt for conclusion] Answer:
Question 5: Perform the Wald test for each individual coefficient. State the general hypotheses for this particular test, then complete the table below and draw conclusion about each coefficient in the model. [2 pts for table, 1 pt for hypotheses, & 1 pts per conclusion] Estimate Std Error Z value Pr( > |t| ) Intercept -25.9382 11.4871 -2.258 0.0239 Income 0.0543 0.0412 Lot Size 0.9638 2.038 Answer:
Question 6: For each explanatory variable, calculate the odds ratio and provide an interpret of each. [1 pt for the odds ratios & 1 pt per interpretation] OR(Income)=
Question 7: (a) Using a threshold cutoff of 50% to decide between owner and nonowner, what is the percentage of households classified correctly among nonowners? (b) To increase the percentage of correctly classified nonowners, should the cutoff probability be increased or decreased? Why? If you concluded a increase or decrease would help, what would be a potential new cutoff? [(a) 1 pt (b) 2 pts] Answer:
Question 8: Using a threshold of 50% for predicting outcomes, create a confusion table for the predicted values verse the true classes. Based on this table, calculate the accuracy, the precision/positive predictive value, sensitivity, and the specificity of this logistic model. [4 pts] Accuracy =
Specificity = —————————————————————————————————————–
Additionally, we would like to use either linear discriminant analysis or quadratic discriminant analysis as a different way to classify households. To decide which classification is most appropriate, check the assumptions needed for LDA and QDA.
Question 9: Based on your exploration of LDA and QDA assumptions, is LDA or QDA more appropriate? Explain why (e.g. what was the assumption and what was your conclusion)? [2 pts] Answer:
Question 10: Using the method you selected in the previous question, calculate the model. (Assume multivariate normality of the variables is true for this data.) Create an appropriate data visualization of a scatterplot of lot size versus income with the predicted classification (paste this at the end of this worksheet). Do we see a general trend to how this classification method separated the groups? [3 pts] Answer:
Question 11: Create a confusion table for the predicted classes verse the true classes. Based on this table, calculate the accuracy and the precision/positive predictive value of this discriminant model. [2 pts] Accuracy =
Question 12: We now have performed 2 different classification methods – Logistic model and a discriminant analysis. We want to determine which classification method was better. (a) Create a data visualization of the ROC curves of both methods on the same graphic (paste this visualization in the Appendix). (b) Calculate the AUC (area under the curve) for each method. (c) Based on our ROC curves and the AUCs for each method, which method is better for this classification problem and why? [2 pts for (a), 2 pts for (b), & 2 pts for (c)] Answer:
(b) AUC for Logistic =
AUC for Discriminant method =
Let’s assume that the Ownership variable is not known, and we would like to cluster the observations into owner and nonowner. Since there are only two categories, we decided to fit 2 clusters. As both variables are in terms of 1000s, this analysis section should be done on the raw (not scaled) data.
Question 13: Run the kMeans clustering algorithm on Lot Size and Income with 2 clusters (run with 20 random starts). Create an appropriate detailed data visualization based on the clusters found from the kMeans method; since the true Ownership category is known, make sure to mark the correct and incorrect clustered observations in some fashion (paste this visualization in the Appendix at the end of the worksheet). Does 2 clusters seem most appropriate? How many of the observations were incorrectly classified? [3 pts] Answer:
Question 14: We decide to test several different number of clusters to determine the best number for k; do this with clusters from 1 to 10 and with a minimum of 10 different initial random starts. Create an exploratory data graphic with the k verses the “total within ss” (paste the graphic into the Appendix). What is the number of clusters you determine best, and why? [4 pts] Answer:
Question 15: Based on the “k” you determined best, run the kMeans algorithm with 10 initial starts then create a data visualization with this chosen number of clusters again make sure to mark the true Ownership in some fashion (paste the visualization into the Appendix). Describe your results. [3 pts] Answer:
We also decide to use Hierarchical clustering to cluster the observations into owner and nonowner groups.
Question 16: Does the data need to be scaled? Why or why not? [2 pts] Answer:
Question 17: We choose to use Euclidean squared distance with complete and average linkage methods within the Hierarchical clustering. For each method, determine where you would “cut” the tree and explain why; specify the number of clusters that you are recommending for each method. [4 pts] Answer:
Question 18: Based on your recommendations for k, create a colored dendrogram of the clustered groups as a data visualization for each method (make sure there is appropriate labels and titles) (paste this into the Appendix). [4 pts] Answer:
Question 19: Create a data visualization of the scatter plot for each of the two hierarchical algorithms (include appropriate labels, titles, etc) (paste this into the Appendix). (a) Does there appear to be anything interesting/significant findings from these two visualizations? (b) Since we know the Ownership, which method (complete or average linkage) is better? [5 pts] Answer:
[5pts for RScript]
All data visualizations that we ask for need to be included here with labeling for each graph (aka what step is it for?).