Practical Machine Learning

The Goal of Project

In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The goal is

to predict the manner in which they did the exercise.

In the following, I describe the steps concerning the training of a predictive model.

Read the data

First, the .csv file contain the training data is read into R. Here, unavailable values are set as NA.

if(!file.exists("pml-training.csv")){
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
f<- file.path("./pml-training.csv")
download.file(url,f)}

if(!file.exists("pml-testing.csv")){
urlTest <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
f<- file.path("./pml-testing.csv")
download.file(urlTest,f)}

rawdf <-read.csv("./pml-training.csv",na.strings = c("NA",""))
testdata <- read.csv("./pml-testing.csv",na.strings = c("NA",""))

Reduce the dataset

In the next step, I check the proportion of missing values (NAs) in the columns.

propNA <- colMeans(is.na(rawdf))
table(propNA)

## propNA
##                 0 0.979308938946081 
##                60               100

There are 100 columns in which almost all values (97.93%) are missing. If a column contains a large number of NAs, it will not be of great use for training the model. Hence, these columns will be removed. Only the columns without any NAs will be kept.

idx <- !propNA

rawReduced<-rawdf[idx]

testReduced <- testdata[idx]

There are further unnecessary columns that can be removed. The column X contains the row numbers. The column user_name contains the name of the user. Of course, these variables cannot predictors for the type of exercise.

Furthermore, the three columns containing time stamps (raw_timestamp_part_1, raw_timestamp_part_2, and cvtd_timestamp) will not be used.

The factors new_window and num_window are not related to sensor data. They will be removed too.

idx <- grep("^X$|user_name|timestamp|window", names(rawdf))

rawdataReduced2 <- rawReduced[-idx]
testDataReduced2 <- testReduced[-idx]

Preparing the data for training

Now, the dataset contains one outcome column (classe) and 59 feature columns. The function createDataPartition of the caret package is used to split the data into a training and a cross-validation data set. Here, 70% of the data goes into the training set.

library(caret)

## Warning: package 'caret' was built under R version 3.3.2

## Warning: package 'ggplot2' was built under R version 3.3.2

intrain <- createDataPartition(rawdataReduced2$classe,p=0.7,list=F)

The index inTrain is used to split the data.

training <- rawdataReduced2[intrain,]
# The number of rows in training dataset
nrow(training)

## [1] 13737

crossval <- rawdataReduced2[-intrain,]
# The number of rows in cross validation dataset
nrow(crossval)

## [1] 5885

Train a model

I used the random-forest technique to generate a predictive model. In sum, 10 models were trained. I played around with the parameters passed to trControl and specified different models with bootstrapping (method = "boot") and cross-validation (method = "cv").

It took more than one day to train all models. Afterwards I tested their performance on the cross-validation dataset. It turned out that all models showed a good performance (because their accuracy was above 99%) though their training times were quite different.

Due to the similar performance, I will present the model with the shortest training time.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.2

trControl <- trainControl(method = "cv", number = 2)
modFit <- train(classe ~ ., data = training, method = "rf", prox = TRUE, trControl = trControl)

Evaluate the model (out-of-sample error)

First, the final model is used to predict the outcome in the cross-validation dataset.

pred <- predict(modFit, newdata = crossval)

Second, the function confusionMatrix is used to calculate the accuracy of the prediction.

coMa <- confusionMatrix(pred, reference = crossval$classe)
acc <- coMa$overall["Accuracy"]
acc

##  Accuracy 
## 0.9938828

The accuracy of the prediction is 99.39%. Hence, the out-of-sample error is 0.61%.

Variable importance

The five most important variables in the model and their relative importance values are:

vi <- varImp(modFit)$importance
vi[head(order(unlist(vi), decreasing = TRUE), 5L), , drop = FALSE]

##                     Overall
## roll_belt         100.00000
## pitch_forearm      60.82071
## yaw_belt           50.03213
## magnet_dumbbell_z  45.23761
## magnet_dumbbell_y  44.86509

Predicting Test cases

Now lets apply the Model to the testDataReduced2 dataset.

pred <- predict(modFit, newdata = testDataReduced2)
pred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning - Course Project

Deepesh

March 1, 2017