In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal is
to predict the manner in which they did the exercise.
In the following, I describe the steps concerning the training of a predictive model.
First, the .csv
file contain the training data is read into R. Here, unavailable values are set as NA
.
if(!file.exists("pml-training.csv")){
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
f<- file.path("./pml-training.csv")
download.file(url,f)}
if(!file.exists("pml-testing.csv")){
urlTest <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
f<- file.path("./pml-testing.csv")
download.file(urlTest,f)}
rawdf <-read.csv("./pml-training.csv",na.strings = c("NA",""))
testdata <- read.csv("./pml-testing.csv",na.strings = c("NA",""))
In the next step, I check the proportion of missing values (NA
s) in the columns.
propNA <- colMeans(is.na(rawdf))
table(propNA)
## propNA
## 0 0.979308938946081
## 60 100
There are 100 columns in which almost all values (97.93%) are missing. If a column contains a large number of NA
s, it will not be of great use for training the model. Hence, these columns will be removed. Only the columns without any NA
s will be kept.
idx <- !propNA
rawReduced<-rawdf[idx]
testReduced <- testdata[idx]
There are further unnecessary columns that can be removed. The column X
contains the row numbers. The column user_name
contains the name of the user. Of course, these variables cannot predictors for the type of exercise.
Furthermore, the three columns containing time stamps (raw_timestamp_part_1
, raw_timestamp_part_2
, and cvtd_timestamp
) will not be used.
The factors new_window
and num_window
are not related to sensor data. They will be removed too.
idx <- grep("^X$|user_name|timestamp|window", names(rawdf))
rawdataReduced2 <- rawReduced[-idx]
testDataReduced2 <- testReduced[-idx]
Now, the dataset contains one outcome column (classe
) and 59 feature columns. The function createDataPartition
of the caret
package is used to split the data into a training and a cross-validation data set. Here, 70% of the data goes into the training set.
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Warning: package 'ggplot2' was built under R version 3.3.2
intrain <- createDataPartition(rawdataReduced2$classe,p=0.7,list=F)
The index inTrain
is used to split the data.
training <- rawdataReduced2[intrain,]
# The number of rows in training dataset
nrow(training)
## [1] 13737
crossval <- rawdataReduced2[-intrain,]
# The number of rows in cross validation dataset
nrow(crossval)
## [1] 5885
I used the random-forest technique to generate a predictive model. In sum, 10 models were trained. I played around with the parameters passed to trControl
and specified different models with bootstrapping (method = "boot"
) and cross-validation (method = "cv"
).
It took more than one day to train all models. Afterwards I tested their performance on the cross-validation dataset. It turned out that all models showed a good performance (because their accuracy was above 99%) though their training times were quite different.
Due to the similar performance, I will present the model with the shortest training time.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.2
trControl <- trainControl(method = "cv", number = 2)
modFit <- train(classe ~ ., data = training, method = "rf", prox = TRUE, trControl = trControl)
First, the final model is used to predict the outcome in the cross-validation dataset.
pred <- predict(modFit, newdata = crossval)
Second, the function confusionMatrix
is used to calculate the accuracy of the prediction.
coMa <- confusionMatrix(pred, reference = crossval$classe)
acc <- coMa$overall["Accuracy"]
acc
## Accuracy
## 0.9938828
The accuracy of the prediction is 99.39%. Hence, the out-of-sample error is 0.61%.
The five most important variables in the model and their relative importance values are:
vi <- varImp(modFit)$importance
vi[head(order(unlist(vi), decreasing = TRUE), 5L), , drop = FALSE]
## Overall
## roll_belt 100.00000
## pitch_forearm 60.82071
## yaw_belt 50.03213
## magnet_dumbbell_z 45.23761
## magnet_dumbbell_y 44.86509
Now lets apply the Model to the testDataReduced2
dataset.
pred <- predict(modFit, newdata = testDataReduced2)
pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E