Here we are focusing on building a model to predict how well a particular activity(weight lifting) is being done rather than, which activity is being done. Using activity trackers such as fitbits, apple watch’s, etc people are procducing a lot of data on their physical activities, to improve their health or just interested in the technology. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.(The weight lifting exercise dataset)
There are a lot of columns with missing values in them so we first have to remove these columns from the dataset, and converting all the remainin integer columns into numeric.
tr<-read.csv("training.csv")
#selecting columns with NA
narm<-sapply(tr, function(x) any(is.na(x)))
#selecting factors with NA
narmfactor <-sapply(tr, function(x) any(levels(x)==""))
#removing columns with NA
tr<-tr[,-which(narm|narmfactor)]
#coercing integers to numeric
for (i in 8:59) {
tr[,i]<-as.numeric(tr[,i])
}
During the data collection process the sensors were placed on the dumbbell,on the hip with the use of a belt, on the arm and forearm by placing it on the glove. So we will divide the training data set into subsets for each sensor placement.
set.seed(4567)
intrain <- createDataPartition(tr$classe,p=0.85,list = F)
test2<-tr[-intrain,]
tr<-tr[intrain,]
intrain <- createDataPartition(tr$classe,p=0.75,list = F)
tst<-tr[-intrain,]
tr<-tr[intrain,]
belt<-tr[,c(grep("belt",names(tr)),60)]
arm<-tr[,c(grep("arm",names(tr))[1:13],60)]
forearm<-tr[,c(grep("forearm",names(tr)),60)]
dumbell<-tr[,c(grep("dumbbell",names(tr)),60)]
We will build an lda model for each of this to define how well each of these sensors can predict the outcome on their own.
belttest<-tst[,c(grep("belt",names(tst)),60)]
armtest<-tst[,c(grep("arm",names(tst))[1:13],60)]
forearmtest<-tst[,c(grep("forearm",names(tst)),60)]
dumbelltest<-tst[,c(grep("dumbbell",names(tst)),60)]
dumbelltest2<-test2[,c(grep("dumbbell",names(test2)),60)]
dumbelltest2<-test2[,c(grep("dumbbell",names(test2)),60)]
forearmtest2<-test2[,c(grep("forearm",names(test2)),60)]
armtest2<-test2[,c(grep("arm",names(test2))[1:13],60)]
belttest2<-test2[,c(grep("belt",names(test2)),60)]
fitbelt<-lda(classe~.,data= belt)
fitdumb<-lda(classe~.,data= dumbell)
fitfore<-lda(classe~.,data= forearm)
fitarm<-lda(classe~.,data= arm)
predblet<-predict(fitbelt,belttest)
predarm<-predict(fitarm,armtest)
predfore<-predict(fitfore,forearmtest)
preddumb<-predict(fitdumb,dumbelltest)
confusionMatrix(predblet$class,belttest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 952 497 498 390 258
## B 223 288 224 201 82
## C 7 13 3 6 22
## D 1 2 0 41 60
## E 2 7 2 45 344
##
## Overall Statistics
##
## Accuracy : 0.3906
## 95% CI : (0.3757, 0.4056)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1894
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8034 0.3569 0.0041265 0.060029 0.44909
## Specificity 0.4492 0.7828 0.9860506 0.981923 0.98354
## Pos Pred Value 0.3669 0.2829 0.0588235 0.394231 0.86000
## Neg Pred Value 0.8519 0.8352 0.8241438 0.842028 0.88800
## Prevalence 0.2843 0.1936 0.1744242 0.163868 0.18378
## Detection Rate 0.2284 0.0691 0.0007198 0.009837 0.08253
## Detection Prevalence 0.6226 0.2442 0.0122361 0.024952 0.09597
## Balanced Accuracy 0.6263 0.5698 0.4950886 0.520976 0.71631
How well the lda splits each of the different classes
fitbelt$svd^2/sum(fitbelt$svd^2)
## [1] 0.76845407 0.14625408 0.07316790 0.01212396
d<-data.frame(lda = predblet$x,class=belttest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="BELT",y = "lda 2",x = "lda1")
confusionMatrix(predarm$class,armtest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 769 209 327 172 187
## B 119 327 125 56 195
## C 97 66 48 24 15
## D 155 47 138 316 109
## E 45 158 89 115 260
##
## Overall Statistics
##
## Accuracy : 0.4127
## 95% CI : (0.3977, 0.4278)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2454
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6489 0.40520 0.06602 0.46266 0.33943
## Specificity 0.7000 0.85272 0.94130 0.87116 0.88036
## Pos Pred Value 0.4621 0.39781 0.19200 0.41307 0.38981
## Neg Pred Value 0.8339 0.85655 0.82670 0.89215 0.85547
## Prevalence 0.2843 0.19362 0.17442 0.16387 0.18378
## Detection Rate 0.1845 0.07845 0.01152 0.07582 0.06238
## Detection Prevalence 0.3992 0.19722 0.05998 0.18354 0.16003
## Balanced Accuracy 0.6745 0.62896 0.50366 0.66691 0.60990
How well the lda splits each of the different classes
fitarm$svd^2/sum(fitarm$svd^2)
## [1] 0.66480231 0.25710826 0.06385559 0.01423384
d<-data.frame(lda = predarm$x,class=armtest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="ARM",y = "lda 2",x = "lda1")
confusionMatrix(predfore$class,forearmtest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 710 213 227 65 136
## B 174 167 108 108 150
## C 133 153 240 131 160
## D 123 141 104 336 149
## E 45 133 48 43 171
##
## Overall Statistics
##
## Accuracy : 0.3896
## 95% CI : (0.3748, 0.4046)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2253
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.5992 0.20694 0.33012 0.49195 0.22324
## Specificity 0.7851 0.83933 0.83232 0.85165 0.92093
## Pos Pred Value 0.5255 0.23621 0.29376 0.39390 0.38864
## Neg Pred Value 0.8314 0.81508 0.85467 0.89532 0.84040
## Prevalence 0.2843 0.19362 0.17442 0.16387 0.18378
## Detection Rate 0.1703 0.04007 0.05758 0.08061 0.04103
## Detection Prevalence 0.3241 0.16963 0.19602 0.20465 0.10557
## Balanced Accuracy 0.6921 0.52314 0.58122 0.67180 0.57208
How well the lda splits each of the different classes
fitfore$svd^2/sum(fitfore$svd^2)
## [1] 0.82267463 0.08678683 0.07609845 0.01444009
d<-data.frame(lda = predfore$x,class=forearmtest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="FOREARM",y = "lda 2",x = "lda1")
confusionMatrix(preddumb$class,dumbelltest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 900 301 318 312 301
## B 6 271 23 29 138
## C 103 120 318 79 118
## D 161 55 60 215 84
## E 15 60 8 48 125
##
## Overall Statistics
##
## Accuracy : 0.4388
## 95% CI : (0.4237, 0.454)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2694
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7595 0.33581 0.4374 0.31479 0.16319
## Specificity 0.5870 0.94168 0.8779 0.89670 0.96149
## Pos Pred Value 0.4221 0.58030 0.4309 0.37391 0.48828
## Neg Pred Value 0.8600 0.85517 0.8808 0.86975 0.83615
## Prevalence 0.2843 0.19362 0.1744 0.16387 0.18378
## Detection Rate 0.2159 0.06502 0.0763 0.05158 0.02999
## Detection Prevalence 0.5115 0.11204 0.1771 0.13796 0.06142
## Balanced Accuracy 0.6732 0.63875 0.6577 0.60574 0.56234
How well the lda splits each of the different classes
fitdumb$svd^2/sum(fitdumb$svd^2)
## [1] 0.48540143 0.27432608 0.20712718 0.03314531
d<-data.frame(lda = preddumb$x,class=dumbelltest$classe)
pl<-ggplot(aes(x=lda.LD1,y=lda.LD2,col = class,alpha = 0.5),data = d)
pl+geom_point()+labs(title ="DUMBELL",y = "lda 2",x = "lda1")
As you can see these individually do not do a very good job of segregating the different excecutions. Hence it would be best to combone them in an optimal way. The lda gives us the best they can seprate out the different classes individualy, then combining it these individual models with a boosting technique like “gbm”, which is what we use now will give the best combination of the models based on their individual ability to seperate the classes.
g<-data.frame(pred1=predarm,pred2=predblet,pred3=predfore,pre4=preddumb,classe=tst$classe)
modelgbm<-train(classe~.,data = g,method="gbm",verbose =F)
predblet2<-predict(fitbelt,belttest2)
predarm2<-predict(fitarm,armtest2)
predfore2<-predict(fitfore,forearmtest2)
preddumb2<-predict(fitdumb,dumbelltest2)
g2<-data.frame(pred1=predarm2,pred2=predblet2,pred3=predfore2,pre4=preddumb2,classe=test2$classe)
predtes2<-predict(modelgbm,g2)
Now lets see how well this classifies the data.
confusionMatrix(predtes2,test2$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 796 46 21 20 7
## B 14 455 46 18 18
## C 4 42 429 33 10
## D 18 14 13 396 13
## E 5 12 4 15 493
##
## Overall Statistics
##
## Accuracy : 0.8732
## 95% CI : (0.8607, 0.885)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8392
## Mcnemar's Test P-Value : 6.477e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9510 0.7996 0.8363 0.8216 0.9113
## Specificity 0.9553 0.9595 0.9634 0.9764 0.9850
## Pos Pred Value 0.8944 0.8258 0.8282 0.8722 0.9319
## Neg Pred Value 0.9800 0.9523 0.9653 0.9654 0.9801
## Prevalence 0.2845 0.1934 0.1744 0.1638 0.1839
## Detection Rate 0.2706 0.1547 0.1458 0.1346 0.1676
## Detection Prevalence 0.3025 0.1873 0.1761 0.1543 0.1798
## Balanced Accuracy 0.9532 0.8796 0.8998 0.8990 0.9481
It does a very good job of segregating the data with accuracy 87% as shown above.
As seen above after ensembling the different models the new model becomes very good at predicting if the particular exercise which in this case is lifting a relativly light duumbbell of 1.25kg is done properly or not based on the sensor data. The accuracy is 87% and kappa value of 83%.
The source of the data for this project is from: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.https://web.archive.org/web/20161217164008/http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201