Validating Exercise Quality with Body Sensors
1. Introduction
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is possible to collect a large amount of data about personal activity. These type of devices are part of the quantified self movement — a group of enthusiasts who regularly take measurements about themselves to improve their health, to find patterns in their behaviour, or simple because they are tech geeks. People regularly quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project uses the data from accelerometers on the belt, forearm, arm, and dumbbell of six participants, shown in figure 1 (Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans and Ugulino, Wallace and Fuks, Hugo, 2013).
Figure 1: Sensing setup.
Participants performed one set of ten repetitions of the Unilateral Dumbbell Biceps Curl correctly (Class A) and incorrectly in four ways:
- Throwing the elbows to the front (Class B)
- Lifting the dumbbell only halfway (Class C)
- Lowering the dumbbell only halfway (Class D)
- Throwing the hips to the front (Class E)
The video below demonstrates the correct method to perform unilateral biceps curls.
Brett Taylor, D B Unilateral Curls.
This report partially reproduces the research in Velloso et al. (2013), who developed a live feedback system to assist people performing curls perform them correctly. This paper predicts the type of biceps curl (A to E).
2. Data Preparation
The data is available from the UCI Machine Learning Repository. Each IMU has x, y, and z values + euler angles (roll, pitch and yaw). For each time window (1s of data), there are several statistics calculations, like Kurtosis, Variance, etc. The classe
column contains the dependent variable.
2.1. Load Data
The data for this assignment is loaded directly from the course website.
# Download data library(readr) raw_data <- read_csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv") validate_raw <- read_csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
The training data contains 160
variables and 19622
observations.
2.2. Pre-Processing
The pre-processing step involves removing variables that are less likely to contribute to the prediction. Near-zero variance variables have a low proportion of unique values over the sample size. The data also contains a large number of missing values because many of the variables contain periodic descriptive statistics of other variables. Independent variables with more than 95% of missing values are removed from the data set. This omission will not influence the error rate of the prediction model since these are summary statistics that highly correlate with the other data. The first seven variables are meta data that can also be removed.
This feature-reduction step leaves 53
variables. Thirteen measurements for each of the four locations (belt, arm, dumbell, forearm) and the independent variable. Acceleration, gyroscope, and magnet were measured in three orthogonal directions. Pitch, roll, yaw and total acceleration have one dimension (Table 1).
# Summarise predictors names_clean_data <- gsub("total_accel", "total-accel", names(clean_data)) vars <- strsplit(names_clean_data[-53], "_") measurements <- unlist(lapply(vars, function(x){x[1]})) locations <- unlist(lapply(vars, function(x){x[2]})) table(measurements, locations)
arm | belt | dumbbell | forearm | |
---|---|---|---|---|
accel | 3 | 3 | 3 | 3 |
gyros | 3 | 3 | 3 | 3 |
magnet | 3 | 3 | 3 | 3 |
pitch | 1 | 1 | 1 | 1 |
roll | 1 | 1 | 1 | 1 |
total-accel | 1 | 1 | 1 | 1 |
yaw | 1 | 1 | 1 | 1 |
3. Training and Testing Data
The clean data is partitioned in a training set (70% of the data) and a testing set.
4. Modelling
A Random Forest model is fitted to the data with three-way cross-validation.
# Use three-fold cross-validation to select optimal tuning parameters fit_control <- trainControl(method = "cv", number = 3, verboseIter = FALSE) # fit model fit <- train(classe ~ ., data = training, method = "rf", trControl = fit_control, allowParallel = TRUE) fit$finalModel
Call: randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), allowParallel = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 27 OOB estimate of error rate: 0.68% Confusion matrix: A B C D E class.error A 3899 6 1 0 0 0.001792115 B 15 2639 3 1 0 0.007148232 C 0 17 2373 6 0 0.009599332 D 0 1 28 2220 3 0.014209591 E 0 1 5 7 2512 0.005148515
4.1. Testing the model
The model is applied to the testing data to determine the Out-of-Sample error.
# use model to predict classe in validation set (testing) predictions <- predict(fit, newdata = testing) # show confusion matrix to get estimate of out-of-sample error confusionMatrix(as.factor(testing$classe), predictions)
Confusion Matrix and Statistics Reference Prediction A B C D E A 1670 2 1 0 1 B 3 1134 2 0 0 C 0 3 1020 3 0 D 0 1 6 957 0 E 0 0 5 3 1074 Overall Statistics Accuracy : 0.9949 95% CI : (0.9927, 0.9966) No Information Rate : 0.2843 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9936 Mcnemar's Test P-Value : NA Statistics by Class: Class: A Class: B Class: C Class: D Class: E Sensitivity 0.9982 0.9947 0.9865 0.9938 0.9991 Specificity 0.9991 0.9989 0.9988 0.9986 0.9983 Pos Pred Value 0.9976 0.9956 0.9942 0.9927 0.9926 Neg Pred Value 0.9993 0.9987 0.9971 0.9988 0.9998 Prevalence 0.2843 0.1937 0.1757 0.1636 0.1827 Detection Rate 0.2838 0.1927 0.1733 0.1626 0.1825 Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839 Balanced Accuracy 0.9986 0.9968 0.9926 0.9962 0.9987
The out of sample accuracy is, with 95% confindence, between 0.993 and 0.997, less than 1 % error.
The random forest model has a built-in variable importance score that illustrates the influence each predictor has on the outcomes. The image in figure 2 visualises the top twenty variables. This analysis shows that the roll of the belt is the most influential variable.
Figure 2: Top-twenty variable sorted by importance.
5. Validating the prediction model
The prediction model fit
is applied to the validation data set to test the accuracy of the prediction. The feedback of the quiz shows that all answers are correct.
Case | Prediction |
---|---|
Case 1 | B |
Case 2 | A |
Case 3 | B |
Case 4 | A |
Case 5 | A |
Case 6 | E |
Case 7 | D |
Case 8 | B |
Case 9 | A |
Case 10 | A |
Case 11 | B |
Case 12 | C |
Case 13 | B |
Case 14 | A |
Case 15 | E |
Case 16 | E |
Case 17 | A |
Case 18 | B |
Case 19 | B |
Case 20 | B |
6. References
Velloso, Bulling, Gellersen, Ugulino, Fuks (2013) Qualitative Activity Recognition of Weight Lifting Exercises, AH '13: Proceedings of the 4th Augmented Human International Conference, Pages 116–123. https://doi.org/10.1145/2459236.2459256