In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal is
to predict the manner in which they did the exercise.
In the following, I describe the steps concerning the training of a predictive model.
First, the .csv file contain the training data is read into R. Here, unavailable values are set as NA.
if(!file.exists("pml-training.csv")){
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
f<- file.path("./pml-training.csv")
download.file(url,f)}
if(!file.exists("pml-testing.csv")){
urlTest <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
f<- file.path("./pml-testing.csv")
download.file(urlTest,f)}
rawdf <-read.csv("./pml-training.csv",na.strings = c("NA",""))
testdata <- read.csv("./pml-testing.csv",na.strings = c("NA",""))
In the next step, I check the proportion of missing values (NAs) in the columns.
propNA <- colMeans(is.na(rawdf))
table(propNA)
## propNA
## 0 0.979308938946081
## 60 100
There are 100 columns in which almost all values (97.93%) are missing. If a column contains a large number of NAs, it will not be of great use for training the model. Hence, these columns will be removed. Only the columns without any NAs will be kept.
idx <- !propNA
rawReduced<-rawdf[idx]
testReduced <- testdata[idx]
There are further unnecessary columns that can be removed. The column X contains the row numbers. The column user_name contains the name of the user. Of course, these variables cannot predictors for the type of exercise.
Furthermore, the three columns containing time stamps (raw_timestamp_part_1, raw_timestamp_part_2, and cvtd_timestamp) will not be used.
The factors new_window and num_window are not related to sensor data. They will be removed too.
idx <- grep("^X$|user_name|timestamp|window", names(rawdf))
rawdataReduced2 <- rawReduced[-idx]
testDataReduced2 <- testReduced[-idx]
Now, the dataset contains one outcome column (classe) and 59 feature columns. The function createDataPartition of the caret package is used to split the data into a training and a cross-validation data set. Here, 70% of the data goes into the training set.
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Warning: package 'ggplot2' was built under R version 3.3.2
intrain <- createDataPartition(rawdataReduced2$classe,p=0.7,list=F)
The index inTrain is used to split the data.
training <- rawdataReduced2[intrain,]
# The number of rows in training dataset
nrow(training)
## [1] 13737
crossval <- rawdataReduced2[-intrain,]
# The number of rows in cross validation dataset
nrow(crossval)
## [1] 5885
I used the random-forest technique to generate a predictive model. In sum, 10 models were trained. I played around with the parameters passed to trControl and specified different models with bootstrapping (method = "boot") and cross-validation (method = "cv").
It took more than one day to train all models. Afterwards I tested their performance on the cross-validation dataset. It turned out that all models showed a good performance (because their accuracy was above 99%) though their training times were quite different.
Due to the similar performance, I will present the model with the shortest training time.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.2
trControl <- trainControl(method = "cv", number = 2)
modFit <- train(classe ~ ., data = training, method = "rf", prox = TRUE, trControl = trControl)
First, the final model is used to predict the outcome in the cross-validation dataset.
pred <- predict(modFit, newdata = crossval)
Second, the function confusionMatrix is used to calculate the accuracy of the prediction.
coMa <- confusionMatrix(pred, reference = crossval$classe)
acc <- coMa$overall["Accuracy"]
acc
## Accuracy
## 0.9938828
The accuracy of the prediction is 99.39%. Hence, the out-of-sample error is 0.61%.
The five most important variables in the model and their relative importance values are:
vi <- varImp(modFit)$importance
vi[head(order(unlist(vi), decreasing = TRUE), 5L), , drop = FALSE]
## Overall
## roll_belt 100.00000
## pitch_forearm 60.82071
## yaw_belt 50.03213
## magnet_dumbbell_z 45.23761
## magnet_dumbbell_y 44.86509
Now lets apply the Model to the testDataReduced2 dataset.
pred <- predict(modFit, newdata = testDataReduced2)
pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E