For Spring 2020 you may pick one of the three analyses described below. Whatever score you obtain on your chosen analysis will count for the three planned analyses. Because this policy adds variability to your grade, we will allow for two submission:
Your analysis will be graded after the first submission, if received before the deadline above. After feedback, you may re-submit before the second submission deadline. Your grade will be whichever score is highest. (Which will likely be the second submission.) You are not required to make two submissions.
Because May 14 is nearly the last day of finals, no late submissions can be accepted.
For this analysis, do the following:
Please review the this IMRAD Cheat Sheet developed by the Carnegie Mellon University Global Communication Center.
Even though it is the first thing to appear in the report, the abstract should be the last thing that you write. Generally the abstract should serve as a summary of the entire report. Reading only the abstract, the reader should have a good idea about what to expect from the rest of the document. Abstracts can be extremely variable in length, but a good heuristic is to use a sentence for each of the main sections of the IMRD:
The introduction should discuss the “why” of your analysis and a brief “what” of your data. Essentially, you need to motivate why the analysis that you’re about to do should be done. Why does this analysis need to be done? What is the goal of this analysis? The introduction should also provide enough background on the subject area for a reader to understand your analysis. Do not assume your reader knows anything about the subject area that your data comes from. If the reader does not understand your data, there is no way the reader will understand your motivation.
Since data is provided for you, but not a scenario, you can create any reasonable scenario that you would like.
You do not need to provide a complete data dictionary in the introduction, but you should include one in the appendix. Often the data would be introduced in the Methods section, but here the data is very closely linked to the motivation of the analysis. It at least needs to be introduced in the the introduction.
Consider including some exploratory data analysis here, and providing some of it to the reader in the report if you feel it helps present and introduce the data.
The methods section should discuss what you did. The methods that you are using are those learned in class. This section should contain the bulk of your “work.” This section will contain most of the R
code that is used to generate the results. Your R
code is not expected to be perfect idiomatic R
, but it is expected to be understood by a reader without too much effort. The majority of your code should be suppressed from the final report, but consider displaying code that helps illustrate the analysis you performed, for example, training of models.
Consider adding subsections in this section. One potential set of subsections could be data and models. The data section would describe your data in detail. What are the rows? What are the columns? Which columns will be used as the response and the features? What is the source of the data? How will it be used in performing your analysis? What if any preprocessing have you done to it? The models section would describe the modeling methods that you will consider, as well as strategies for comparison and evaluation.
Your goal is not to use as many methods as possible. Your task is to use appropriate methods to accomplish the stated goal of the analysis.
The results section should contain numerical or graphical summaries of your results. What are the results of applying your chosen methods? Consider reporting a “final” or “best” model you have chosen. There is not necessarily one, singular correct model, but certainly some methods and models are better than others in certain situations. The results sections is about reporting your results. In addition to tables or graphics, state the result in plain English.
The discussion section should contain discussion of your results. That is, the discussion section is used for commenting on your results. This should also frame your results in the context of the data. What do your results mean? Why would someone care about these results? Results are often just numbers, here you need to explain what they tell you about the analysis you are performing. The results section tells the reader what the results are. The discussion section tells the reader why those results matter.
The appendix section should contain any additional code, tables, and graphics that are not explicitly referenced in the narrative of the report. The appendix should contain a data dictionary.
Consider using the following code as a template for your R Markdown document.
---
title: "Title of Analysis Goes Here"
author: "Your Name (netid@illinois.edu)"
date: "Insert Date Here"
output:
html_document:
theme: default
toc: yes
---
```{r, setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align = 'center')
```
```{r, load-packages, include = FALSE}
# load packages
```
***
# Abstract
> Abstract text goes here.
***
# Introduction
***
# Methods
## Data
## Modeling
***
# Results
***
# Discussion
***
# Appendix
***
Submit a .zip
file to Compass that contains:
.Rmd
file that is your IMRAD..html
file that is the result of knitting your .Rmd
file.Optionally, you .zip
file may contain two additional directories:
/data
which contains the data if it is not loaded directly from the web. May also contain any external data used./img
which contains any external images not generated using R
.The .zip
file should contain no other files. No submitted files should contain any spaces in any of the filenames. There are no formal filename requirements, but something like analysis-netid.Rmd
would be appropriate. Consider using an RStudio Project with the file structure described above.
Submit your .zip
file to the correct assignment on Compass2g. You are granted an unlimited number of submissions. Your “last” submission before each of the deadlines will be graded.
The analysis will be graded out of 20 points. Each of the following criteria will be worth two points. A score of 0, 1, and 2 is possible within each criteria:
0
: Criteria is largely ignored or incorrect.1
: Criteria is only partially satisfied.2
: Criteria is met.20 points for grading (0-1-2)
[eye-test]
Final rendered document passes the “eye test.”
.Rmd
file. Only make code visible in the final report if it is a short, concise, easy to understand way to communicate what you have done.[code-style]
R code and R Markdown follow suggested STAT 432 style guidelines.
tidyverse
style guide.[imrad-style]
Document is well written and follows the IMRAD template.
[data-exp]
Data is well explained.
[pred-just]
There is a clear (context driven) justification for making predictions about the response variable.
[feat-style]
There is a clear (context driven) justification for using the features considered.
[train-test]
Train and test data are used for appropriate tasks.[causation]
No causal claims are made.[result-scrutiny]
Any “chosen” model is scrutinized beyond a single numeric metric.
[issues]
Potential shortcomings of the analysis are made clear to the reader.
The instructor and graders reserve the right to apply additional deductions for submissions that are extremely poor, containing so little content that it cannot be evaluated based on the above criteria.
Use the given data to detect the credit card fraud.
https://stat432.org/data/cc-sub.csv
https://stat432.org/data/creditcard.csv.gz
readr::read_csv("https://stat432.org/data/creditcard.csv.gz")
to read the full data directly.credit.html
credit.Rmd
Please refer to the source documentation for information on data collection and a data dictionary. The response was altered.
0
is now labeled genuine
1
is now labeled fraud
Because of the size of this dataset, a subset has been provided. You may use either the subset or the full dataset.
The following code is available to show how the data subset was created, but please use the .csv
linked above. If you choose to use the full data, you will need to run the line below that alters the response from 0
and 1
to genuine
and fraud
, unless you prefer 0
and 1
.
# load pacakges
library("tidyverse")
# extract file obtained from Kaggle
# https://www.kaggle.com/mlg-ulb/creditcardfraud
untar("creditcardfraud.zip")
# create remote readable compressed file
system("gzip creditcard.csv")
# from gz file
cc = read_csv("creditcard.csv.gz")
# verify data
nrow(cc) == 284807
# make response a factor with names instead of numbers
cc$Class = factor(ifelse(cc$Class == 0, "genuine", "fraud"))
# create data subset
set.seed(42)
sub_idx = sample(nrow(cc), size = 50000)
cc_sub = cc[sub_idx, ]
# write subset to disk
write_csv(cc_sub, "cc-sub.csv")
Use the given data to detect the presence of heart disease.
https://stat432.org/data/heart-disease.csv
heart.html
heart.Rmd
The response num
variable has five levels. The documentation does not appear to be extremely clear about this, so we will assume that the levels mean the following.
v0
: 0 vessels with greater than 50% diameter narrowing. (No presence of heart disease.)v1
: 1 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)v2
: 2 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)v3
: 3 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)v4
: 4 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)In other words, the response variable num
is the number of vessels with greater than 50% diameter narrowing.
The following code is available to show how the data was created. (Note that some pre-processing as been performed.) You may use either the provided .csv
or the source data from the ucidata
package.
# load packages
library("tidyverse")
# install.packages("devtools")
# devtools::install_github("coatless/ucidata")
# load data for each location form ucidata package
hd_ch = as_tibble(ucidata::heart_disease_ch)
hd_cl = as_tibble(ucidata::heart_disease_cl)
hd_hu = as_tibble(ucidata::heart_disease_hu)
hd_va = as_tibble(ucidata::heart_disease_va)
# add location variable for each dataset
hd_ch$location = "ch"
hd_cl$location = "cl"
hd_hu$location = "hu"
hd_va$location = "va"
# determine number of NA values in each column of "combined" dataset
# bind_rows(hd_ch, hd_cl, hd_hu, hd_va) %>%
# mutate_all(is.na) %>%
# summarise_all(sum)
# combine the four locations into one dataset
# remove columns with large proportion of NA values (reasonable in practice)
# remove remainder of rows with NA values (not the best idea in practice)
hd = bind_rows(hd_ch, hd_cl, hd_hu, hd_va) %>%
select(-slope, -ca, -thal) %>%
na.omit()
# coerce location variable to factor
# may need to do this again after reading data
hd$location = factor(hd$location)
# re-define response variable (names will play better with caret)
hd$num = factor(case_when(
hd$num == 0 ~ "v0",
hd$num == 1 ~ "v1",
hd$num == 2 ~ "v2",
hd$num == 3 ~ "v3",
hd$num == 4 ~ "v4"
))
# write to disk
# write_csv(hd, "data/heart-disease.csv")
Use the given data to predict the quality of wine.
ucidata
package. (See below.)# load packages
library("tidyverse")
# install.packages("devtools")
# devtools::install_github("coatless/ucidata")
# view data documentation in R
?ucidata::wine
# "load" data
wine = as_tibble(ucidata::wine)
How long should the report be?