Introduction

Here we are focusing on building a model to predict how well a particular activity(weight lifting) is being done rather than, which activity is being done. Using activity trackers such as fitbits, apple watch’s, etc people are procducing a lot of data on their physical activities, to improve their health or just interested in the technology. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.(The weight lifting exercise dataset)

Analysis

Cleaning the data

There are a lot of columns with missing values in them so we first have to remove these columns from the dataset, and converting all the remainin integer columns into numeric.

tr<-read.csv("training.csv")
#selecting columns with NA
narm<-sapply(tr, function(x) any(is.na(x)))
#selecting factors with NA
narmfactor <-sapply(tr, function(x) any(levels(x)==""))
#removing columns with NA
tr<-tr[,-which(narm|narmfactor)]
#coercing integers to numeric
for (i in 8:59) {
  tr[,i]<-as.numeric(tr[,i])
}

Building the model

During the data collection process the sensors were placed on the dumbbell,on the hip with the use of a belt, on the arm and forearm by placing it on the glove. So we will divide the training data set into subsets for each sensor placement.

set.seed(4567)
intrain <- createDataPartition(tr$classe,p=0.85,list = F)
test2<-tr[-intrain,]
tr<-tr[intrain,]
intrain <- createDataPartition(tr$classe,p=0.75,list = F)
tst<-tr[-intrain,]
tr<-tr[intrain,]
belt<-tr[,c(grep("belt",names(tr)),60)]
arm<-tr[,c(grep("arm",names(tr))[1:13],60)]
forearm<-tr[,c(grep("forearm",names(tr)),60)]
dumbell<-tr[,c(grep("dumbbell",names(tr)),60)]

We will build an lda model for each of this to define how well each of these sensors can predict the outcome on their own.

belttest<-tst[,c(grep("belt",names(tst)),60)]
armtest<-tst[,c(grep("arm",names(tst))[1:13],60)]
forearmtest<-tst[,c(grep("forearm",names(tst)),60)]
dumbelltest<-tst[,c(grep("dumbbell",names(tst)),60)]
dumbelltest2<-test2[,c(grep("dumbbell",names(test2)),60)]
dumbelltest2<-test2[,c(grep("dumbbell",names(test2)),60)]
forearmtest2<-test2[,c(grep("forearm",names(test2)),60)]
armtest2<-test2[,c(grep("arm",names(test2))[1:13],60)]
belttest2<-test2[,c(grep("belt",names(test2)),60)]
fitbelt<-lda(classe~.,data= belt)
fitdumb<-lda(classe~.,data= dumbell)
fitfore<-lda(classe~.,data= forearm)
fitarm<-lda(classe~.,data= arm)
predblet<-predict(fitbelt,belttest)
predarm<-predict(fitarm,armtest)
predfore<-predict(fitfore,forearmtest)
preddumb<-predict(fitdumb,dumbelltest)

Lets see how well each of these differentiate the different catagories

BELT

confusionMatrix(predblet$class,belttest$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 952 497 498 390 258
##          B 223 288 224 201  82
##          C   7  13   3   6  22
##          D   1   2   0  41  60
##          E   2   7   2  45 344
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3906          
##                  95% CI : (0.3757, 0.4056)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1894          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B  Class: C Class: D Class: E
## Sensitivity            0.8034   0.3569 0.0041265 0.060029  0.44909
## Specificity            0.4492   0.7828 0.9860506 0.981923  0.98354
## Pos Pred Value         0.3669   0.2829 0.0588235 0.394231  0.86000
## Neg Pred Value         0.8519   0.8352 0.8241438 0.842028  0.88800
## Prevalence             0.2843   0.1936 0.1744242 0.163868  0.18378
## Detection Rate         0.2284   0.0691 0.0007198 0.009837  0.08253
## Detection Prevalence   0.6226   0.2442 0.0122361 0.024952  0.09597
## Balanced Accuracy      0.6263   0.5698 0.4950886 0.520976  0.71631

How well the lda splits each of the different classes

fitbelt$svd^2/sum(fitbelt$svd^2)
## [1] 0.76845407 0.14625408 0.07316790 0.01212396

The plot

d<-data.frame(lda = predblet$x,class=belttest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="BELT",y = "lda 2",x = "lda1")

ARM

confusionMatrix(predarm$class,armtest$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 769 209 327 172 187
##          B 119 327 125  56 195
##          C  97  66  48  24  15
##          D 155  47 138 316 109
##          E  45 158  89 115 260
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4127          
##                  95% CI : (0.3977, 0.4278)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2454          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6489  0.40520  0.06602  0.46266  0.33943
## Specificity            0.7000  0.85272  0.94130  0.87116  0.88036
## Pos Pred Value         0.4621  0.39781  0.19200  0.41307  0.38981
## Neg Pred Value         0.8339  0.85655  0.82670  0.89215  0.85547
## Prevalence             0.2843  0.19362  0.17442  0.16387  0.18378
## Detection Rate         0.1845  0.07845  0.01152  0.07582  0.06238
## Detection Prevalence   0.3992  0.19722  0.05998  0.18354  0.16003
## Balanced Accuracy      0.6745  0.62896  0.50366  0.66691  0.60990

How well the lda splits each of the different classes

fitarm$svd^2/sum(fitarm$svd^2)
## [1] 0.66480231 0.25710826 0.06385559 0.01423384

The plot

d<-data.frame(lda = predarm$x,class=armtest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="ARM",y = "lda 2",x = "lda1")

FOREARM (GLOVE)

confusionMatrix(predfore$class,forearmtest$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 710 213 227  65 136
##          B 174 167 108 108 150
##          C 133 153 240 131 160
##          D 123 141 104 336 149
##          E  45 133  48  43 171
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3896          
##                  95% CI : (0.3748, 0.4046)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2253          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.5992  0.20694  0.33012  0.49195  0.22324
## Specificity            0.7851  0.83933  0.83232  0.85165  0.92093
## Pos Pred Value         0.5255  0.23621  0.29376  0.39390  0.38864
## Neg Pred Value         0.8314  0.81508  0.85467  0.89532  0.84040
## Prevalence             0.2843  0.19362  0.17442  0.16387  0.18378
## Detection Rate         0.1703  0.04007  0.05758  0.08061  0.04103
## Detection Prevalence   0.3241  0.16963  0.19602  0.20465  0.10557
## Balanced Accuracy      0.6921  0.52314  0.58122  0.67180  0.57208

How well the lda splits each of the different classes

fitfore$svd^2/sum(fitfore$svd^2)
## [1] 0.82267463 0.08678683 0.07609845 0.01444009

The plot

d<-data.frame(lda = predfore$x,class=forearmtest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="FOREARM",y = "lda 2",x = "lda1")

DUMBBELL

confusionMatrix(preddumb$class,dumbelltest$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 900 301 318 312 301
##          B   6 271  23  29 138
##          C 103 120 318  79 118
##          D 161  55  60 215  84
##          E  15  60   8  48 125
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4388         
##                  95% CI : (0.4237, 0.454)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2694         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7595  0.33581   0.4374  0.31479  0.16319
## Specificity            0.5870  0.94168   0.8779  0.89670  0.96149
## Pos Pred Value         0.4221  0.58030   0.4309  0.37391  0.48828
## Neg Pred Value         0.8600  0.85517   0.8808  0.86975  0.83615
## Prevalence             0.2843  0.19362   0.1744  0.16387  0.18378
## Detection Rate         0.2159  0.06502   0.0763  0.05158  0.02999
## Detection Prevalence   0.5115  0.11204   0.1771  0.13796  0.06142
## Balanced Accuracy      0.6732  0.63875   0.6577  0.60574  0.56234

How well the lda splits each of the different classes

fitdumb$svd^2/sum(fitdumb$svd^2)
## [1] 0.48540143 0.27432608 0.20712718 0.03314531

The plot

d<-data.frame(lda = preddumb$x,class=dumbelltest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="DUMBELL",y = "lda 2",x = "lda1")

Ensembling

As you can see these individually do not do a very good job of segregating the different excecutions. Hence it would be best to combone them in an optimal way. The lda gives us the best they can seprate out the different classes individualy, then combining it these individual models with a boosting technique like “gbm”, which is what we use now will give the best combination of the models based on their individual ability to seperate the classes.

g<-data.frame(pred1=predarm,pred2=predblet,pred3=predfore,pre4=preddumb,classe=tst$classe)
modelgbm<-train(classe~.,data = g,method="gbm",verbose =F)
predblet2<-predict(fitbelt,belttest2)
predarm2<-predict(fitarm,armtest2)
predfore2<-predict(fitfore,forearmtest2)
preddumb2<-predict(fitdumb,dumbelltest2)
g2<-data.frame(pred1=predarm2,pred2=predblet2,pred3=predfore2,pre4=preddumb2,classe=test2$classe)
predtes2<-predict(modelgbm,g2)

Now lets see how well this classifies the data.

confusionMatrix(predtes2,test2$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 796  46  21  20   7
##          B  14 455  46  18  18
##          C   4  42 429  33  10
##          D  18  14  13 396  13
##          E   5  12   4  15 493
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8732         
##                  95% CI : (0.8607, 0.885)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8392         
##  Mcnemar's Test P-Value : 6.477e-06      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9510   0.7996   0.8363   0.8216   0.9113
## Specificity            0.9553   0.9595   0.9634   0.9764   0.9850
## Pos Pred Value         0.8944   0.8258   0.8282   0.8722   0.9319
## Neg Pred Value         0.9800   0.9523   0.9653   0.9654   0.9801
## Prevalence             0.2845   0.1934   0.1744   0.1638   0.1839
## Detection Rate         0.2706   0.1547   0.1458   0.1346   0.1676
## Detection Prevalence   0.3025   0.1873   0.1761   0.1543   0.1798
## Balanced Accuracy      0.9532   0.8796   0.8998   0.8990   0.9481

It does a very good job of segregating the data with accuracy 87% as shown above.

Result

As seen above after ensembling the different models the new model becomes very good at predicting if the particular exercise which in this case is lifting a relativly light duumbbell of 1.25kg is done properly or not based on the sensor data. The accuracy is 87% and kappa value of 83%.

Appendix

The source of the data for this project is from: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Paper used

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.https://web.archive.org/web/20161217164008/http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201