\(~\)
Bellabeat is a tech-driven wellness company founded in 2013 by Sandro Mur and Urška Sršen that specializes in health-focused smart products for women and offer a line of products that help empower users with knowledge about their own health and habits. These products include:
Bellabeat App: Provides users with a hub to track their health data related to their habits and activities.
Leaf: Wellness tracker that can be worn as a bracelet, necklace, or clip that tracks activity, sleep, and stress. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Time: A watch that tracks user activity, sleep, and stress. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Spring: A water bottle that tracks daily water intake. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Membership: Gives users 24/7 access to fully personalized guidance based on their lifestyle and goals.
\(~\)
Using data gathered from non-Bellabeat smart devices, we will perform exploratory data analysis to try to identify trends in smart device usage. Some questions that we will attempt to answer include:
\(~\)
\(~\)
The data used in this analysis is from Kaggle and was made available by the user Möbius (CC0: Public Domain). These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk. 30 eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
Upon initial examination of the datasets in Excel, we immediately notice a few things:
We will keep these observations in mind as we delve deeper into the data.
\(~\)
Let’s start by load the necessary libraries:
library(tidyverse)
library(lubridate)
library(ggcorrplot)
We will elect to use the datasets containing the daily data only.
Importing the datasets:
<- read_csv("dailyActivity_merged.csv") daily_activity
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## ActivityDate = col_character(),
## TotalSteps = col_double(),
## TotalDistance = col_double(),
## TrackerDistance = col_double(),
## LoggedActivitiesDistance = col_double(),
## VeryActiveDistance = col_double(),
## ModeratelyActiveDistance = col_double(),
## LightActiveDistance = col_double(),
## SedentaryActiveDistance = col_double(),
## VeryActiveMinutes = col_double(),
## FairlyActiveMinutes = col_double(),
## LightlyActiveMinutes = col_double(),
## SedentaryMinutes = col_double(),
## Calories = col_double()
## )
<- read_csv("sleepDay_merged.csv") daily_sleep
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## SleepDay = col_character(),
## TotalSleepRecords = col_double(),
## TotalMinutesAsleep = col_double(),
## TotalTimeInBed = col_double()
## )
<- read_csv("weightLogInfo_merged.csv") weight_log
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## Date = col_character(),
## WeightKg = col_double(),
## WeightPounds = col_double(),
## Fat = col_double(),
## BMI = col_double(),
## IsManualReport = col_logical(),
## LogId = col_double()
## )
\(~\)
Next, we’ll quickly examine the structure of each of the three tables using head() and glimpse():
head(daily_activity)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
head(daily_sleep)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 12:00:0~ 1 327 346
## 2 1.50e9 4/13/2016 12:00:0~ 2 384 407
## 3 1.50e9 4/15/2016 12:00:0~ 1 412 442
## 4 1.50e9 4/16/2016 12:00:0~ 2 340 367
## 5 1.50e9 4/17/2016 12:00:0~ 1 700 712
## 6 1.50e9 4/19/2016 12:00:0~ 1 304 320
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
head(weight_log)
## # A tibble: 6 x 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1.50e9 5/2/2016 ~ 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1.50e9 5/3/2016 ~ 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1.93e9 4/13/2016~ 134. 294. NA 47.5 FALSE 1.46e12
## 4 2.87e9 4/21/2016~ 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2.87e9 5/12/2016~ 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4.32e9 4/17/2016~ 72.4 160. 25 27.5 TRUE 1.46e12
glimpse(weight_log)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
We can see that the in the Activity Dataset, the “ActivityDate” column have a data type of character. Similarly, the date column in both the sleep and weight datasets also have a data type of character. We will need to convert these columns to a data type of date. Furthermore, in the sleep and weight datasets we see that the date columns have both the date as well as the time so we will need to clean that up as well.
\(~\)
To start off, we will remove columns that are redundant or unnecessary.
In the daily_activity dataset, the columns “TrackerDistance” and “LoggedActivityDistance” are not needed since we already have a “TotalDistance” column which is the sum of the previous 2 columns mentioned. We’ll go ahead and remove those 2 columns.
<-
daily_activity_clean %>%
daily_activity select(-TrackerDistance,
-LoggedActivitiesDistance)
head(daily_activity_clean)
## # A tibble: 6 x 13
## Id ActivityDate TotalSteps TotalDistance VeryActiveDista~ ModeratelyActiv~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 1.88 0.550
## 2 1.50e9 4/13/2016 10735 6.97 1.57 0.690
## 3 1.50e9 4/14/2016 10460 6.74 2.44 0.400
## 4 1.50e9 4/15/2016 9762 6.28 2.14 1.26
## 5 1.50e9 4/16/2016 12669 8.16 2.71 0.410
## 6 1.50e9 4/17/2016 9705 6.48 3.19 0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
In the daily_sleep dataset, we can remove the “TotalSleepRecords” column as it is irrelevant for this analysis.
<-
daily_sleep_clean %>%
daily_sleep select(-TotalSleepRecords)
head(daily_activity_clean)
## # A tibble: 6 x 13
## Id ActivityDate TotalSteps TotalDistance VeryActiveDista~ ModeratelyActiv~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 1.88 0.550
## 2 1.50e9 4/13/2016 10735 6.97 1.57 0.690
## 3 1.50e9 4/14/2016 10460 6.74 2.44 0.400
## 4 1.50e9 4/15/2016 9762 6.28 2.14 1.26
## 5 1.50e9 4/16/2016 12669 8.16 2.71 0.410
## 6 1.50e9 4/17/2016 9705 6.48 3.19 0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
In the weight_log dataset, we can remove the “Fat” (only has 2 records, the rest are all NA), “WeightKg” (we don’t need 2 weight columns), “IsManualReport”, and “LogId” columns as they are unnecessary.
<-
weight_log_clean %>%
weight_log select(-Fat,
-WeightKg,
-IsManualReport,
-LogId)
head(weight_log_clean)
## # A tibble: 6 x 4
## Id Date WeightPounds BMI
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 116. 22.6
## 2 1503960366 5/3/2016 11:59:59 PM 116. 22.6
## 3 1927972279 4/13/2016 1:08:52 AM 294. 47.5
## 4 2873212765 4/21/2016 11:59:59 PM 125. 21.5
## 5 2873212765 5/12/2016 11:59:59 PM 126. 21.7
## 6 4319703577 4/17/2016 11:59:59 PM 160. 27.5
\(~\)
We’ll now check for duplicate rows in each dataset:
<-
daily_activity_clean %>%
daily_activity_clean distinct()
<-
daily_sleep_clean %>%
daily_sleep_clean distinct()
<-
weight_log_clean %>%
weight_log_clean distinct()
# Uncleaned dataset had 940 rows and 15 columns
nrow(daily_activity_clean)
## [1] 940
# Uncleaned dataset had 413 rows and 55 columns
nrow(daily_sleep_clean)
## [1] 410
# Uncleaned dataset had 67 rows and 8 columns
nrow(weight_log_clean)
## [1] 67
From the above we see that the “daily_sleep” dataset had 3 duplicate rows removed while the other 2 datasets were unmodified.
\(~\)
As mentioned earlier, the date type of the date column in all 3 tables were formatted as character data type. We will convert it into a date data type.
$ActivityDate <- mdy(daily_activity_clean$ActivityDate)
daily_activity_clean
<-
daily_activity_clean %>%
daily_activity_clean rename(Date = ActivityDate)
head(daily_activity_clean)
## # A tibble: 6 x 13
## Id Date TotalSteps TotalDistance VeryActiveDista~ ModeratelyActive~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 13162 8.5 1.88 0.550
## 2 1.50e9 2016-04-13 10735 6.97 1.57 0.690
## 3 1.50e9 2016-04-14 10460 6.74 2.44 0.400
## 4 1.50e9 2016-04-15 9762 6.28 2.14 1.26
## 5 1.50e9 2016-04-16 12669 8.16 2.71 0.410
## 6 1.50e9 2016-04-17 9705 6.48 3.19 0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
# Separate the date column into 2 rows, then drop the Time column
<-
daily_sleep_clean %>%
daily_sleep_clean separate(SleepDay, c("Date", "Time"), sep=" ") %>%
select(-Time)
# Convert Date column to mdy format
$Date <- mdy(daily_sleep_clean$Date)
daily_sleep_clean
head(daily_sleep_clean)
## # A tibble: 6 x 4
## Id Date TotalMinutesAsleep TotalTimeInBed
## <dbl> <date> <dbl> <dbl>
## 1 1503960366 2016-04-12 327 346
## 2 1503960366 2016-04-13 384 407
## 3 1503960366 2016-04-15 412 442
## 4 1503960366 2016-04-16 340 367
## 5 1503960366 2016-04-17 700 712
## 6 1503960366 2016-04-19 304 320
# Separate the date column into 2 rows, then drop the Time column
<-
weight_log_clean %>%
weight_log_clean separate(Date, c("Date", "Time"), sep=" ") %>%
select(-Time)
# Convert Date column to mdy format
$Date <- mdy(weight_log_clean$Date)
weight_log_clean
head(weight_log_clean)
## # A tibble: 6 x 4
## Id Date WeightPounds BMI
## <dbl> <date> <dbl> <dbl>
## 1 1503960366 2016-05-02 116. 22.6
## 2 1503960366 2016-05-03 116. 22.6
## 3 1927972279 2016-04-13 294. 47.5
## 4 2873212765 2016-04-21 125. 21.5
## 5 2873212765 2016-05-12 126. 21.7
## 6 4319703577 2016-04-17 160. 27.5
From the output above, all of the date columns were successfully formatted from character to date.
\(~\)
We will use summary() to get descriptive statistics to check for any obvious issues for each variable. Summary() also comes with the added benefit of providing us with the count of the NA’s for each column, if any.
%>%
daily_activity_clean select(-Id,
-Date) %>%
summary()
## TotalSteps TotalDistance VeryActiveDistance ModeratelyActiveDistance
## Min. : 0 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 7406 Median : 5.245 Median : 0.210 Median :0.2400
## Mean : 7638 Mean : 5.490 Mean : 1.503 Mean :0.5675
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 2.053 3rd Qu.:0.8000
## Max. :36019 Max. :28.030 Max. :21.920 Max. :6.4800
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## Min. : 0.000 Min. :0.000000 Min. : 0.00
## 1st Qu.: 1.945 1st Qu.:0.000000 1st Qu.: 0.00
## Median : 3.365 Median :0.000000 Median : 4.00
## Mean : 3.341 Mean :0.001606 Mean : 21.16
## 3rd Qu.: 4.782 3rd Qu.:0.000000 3rd Qu.: 32.00
## Max. :10.710 Max. :0.110000 Max. :210.00
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median : 6.00 Median :199.0 Median :1057.5 Median :2134
## Mean : 13.56 Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
%>%
daily_sleep select(TotalMinutesAsleep,
%>%
TotalTimeInBed) summary()
## TotalMinutesAsleep TotalTimeInBed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.0
## Median :433.0 Median :463.0
## Mean :419.5 Mean :458.6
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
%>%
weight_log_clean select(-Id,
-Date) %>%
summary()
## WeightPounds BMI
## Min. :116.0 Min. :21.45
## 1st Qu.:135.4 1st Qu.:23.96
## Median :137.8 Median :24.39
## Mean :158.8 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:25.56
## Max. :294.3 Max. :47.54
Based on the output, there does appear to be some high values, for example an individual logged 4900 calories burned. However, this does appear to be within reason. The summary output also states that there are no NA’s in any of out variables so we are good from that regard.
We can now merge the three datasets.
\(~\)
For each of the three datasets, we notice that the ID and Date fields are present so we can join the tables on these columns.
Let’s first check the number of unique IDs for each dataset:
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
n_distinct(weight_log$Id)
## [1] 8
From this we see that 33 individuals logged data for activity while only 24 and 8 users logged data for sleep and weight, respectively. In this case we’ll use 2 left joins to join the 3 datasets on the “Id” and “Date” columns so we don’t lose any data.
<- left_join(left_join(daily_activity_clean,
combined_data
daily_sleep_clean, by=c("Id", "Date")), weight_log_clean, by=c("Id", "Date"))
glimpse(combined_data)
## Rows: 940
## Columns: 17
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
## $ TotalMinutesAsleep <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32~
## $ TotalTimeInBed <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36~
## $ WeightPounds <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ BMI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
Everything looks good so far and we can proceed with our exploratory data analysis!
\(~\)
To begin, we will create some correlation matrices to get a general idea of what relationships exist between our variables.
Please note that I omitted the variables related to distance due to redundancy with the “Minutes” variables as well as improving the viewability of the visualization.
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any sleep logged
select(-Id, -Date, -WeightPounds, -BMI, -ends_with("Distance")) %>%
cor() %>%
round(2) %>%
ggcorrplot(hc.order = TRUE,
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"),
lab=TRUE)
Now this is a lot of information so we will focus on the relationships that have a r-value of > 0.5 or < -0.5.
We do see some moderately strong to strong positive correlations with the following:
These relationships intuitively all make sense as more activity results in more more calories burned and more steps taken. The relationship between total time in bed and time asleep is also obvious.
We also see some moderately strong negative correlations with the following:
This suggests that as the amount of sleep increases, the amount of minutes of being sedentary decreases.
\(~\)
We will also create a similar correlation matrix that includes the variables associated to weight. We are doing this because there are only 35 rows of data that include both sleep and weight information logged which may result in unreliable findings due to a small sample size. Nonetheless, let’s create the correlation matrix and see if any relationships exist between weight and the other variables. I have excluded BMI from the matrix due to the redundancy with weight.
#Subsetting the data for the correlation matrix
%>%
combined_data filter(!is.na(WeightPounds) & !is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any weight nor sleep logged
select(-Id, -Date, -BMI, -ends_with("Distance")) %>%
glimpse()
## Rows: 35
## Columns: 9
## $ TotalSteps <dbl> 14727, 15103, 356, 3428, 12231, 10199, 5652, 1551~
## $ VeryActiveMinutes <dbl> 41, 50, 0, 0, 200, 50, 8, 0, 0, 50, 5, 13, 35, 48~
## $ FairlyActiveMinutes <dbl> 15, 24, 0, 0, 37, 14, 24, 0, 0, 3, 13, 42, 41, 4,~
## $ LightlyActiveMinutes <dbl> 277, 254, 32, 190, 159, 189, 142, 86, 217, 280, 2~
## $ SedentaryMinutes <dbl> 798, 816, 986, 1121, 525, 796, 548, 862, 837, 741~
## $ Calories <dbl> 2004, 1990, 2151, 1692, 4552, 1994, 1718, 1466, 1~
## $ TotalMinutesAsleep <dbl> 277, 273, 398, 115, 549, 366, 630, 508, 370, 357,~
## $ TotalTimeInBed <dbl> 309, 296, 422, 129, 583, 387, 679, 535, 386, 366,~
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 154.1031, 199.9593,~
%>%
combined_data filter(!is.na(WeightPounds) & !is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any weight nor sleep logged
select(-Id, -Date, -BMI, -ends_with("Distance")) %>%
cor() %>%
round(1) %>%
ggcorrplot(hc.order = TRUE,
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"),
lab=TRUE)
Focusing only on the correlations related to weight, we can see that the only moderately strong relationship is a negative relationship between weight and minutes doing light activity.
\(~\)
Next, let’s see if there is any relationship between day of the week and activity intensity for the participants.
Creating a column for day of the week and setting it as a factor:
$DayOfWeek <-
combined_dataweekdays(combined_data$Date) %>%
factor(levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))
Creating a new dataframe, “activity_df”, based on the aggregated minutes active based on intensity and grouped by date. We will also need to convert this dataframe into a long format in order to create the visualization.
%>%
combined_data group_by(DayOfWeek) %>%
summarize(VeryActive = round(mean(VeryActiveMinutes), 2),
FairlyActive = round(mean(FairlyActiveMinutes), 2),
LightlyActive = round(mean(LightlyActiveMinutes), 2),
Sedentary = round(mean(SedentaryMinutes), 2)) %>%
pivot_longer(!DayOfWeek, names_to = "Intensity", values_to = "Minutes") -> activity_df
Setting intensity level as a factor:
$Intensity <-
activity_df$Intensity %>%
activity_dffactor(levels = c("VeryActive", "FairlyActive", "LightlyActive", "Sedentary"))
head(activity_df)
## # A tibble: 6 x 3
## DayOfWeek Intensity Minutes
## <fct> <fct> <dbl>
## 1 Sunday VeryActive 20.0
## 2 Sunday FairlyActive 14.5
## 3 Sunday LightlyActive 174.
## 4 Sunday Sedentary 990.
## 5 Monday VeryActive 23.1
## 6 Monday FairlyActive 14
Creating the visualization:
%>%
activity_df ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) +
geom_col(color='azure2', alpha=0.7, position="dodge") +
scale_fill_brewer(palette = "PuOr") +
theme_minimal() +
labs(title = "Average Active Minutes by Intensity and Day",
x = "Day of the Week",
y= "Minutes Active",
fill = "Intensity Level")
We obverse that the bulk of the activity minutes were spent being sedentary and there does not seem to be any specific pattern day by day.
Let’s also take a look at the visualization without the sedentary intensity level:
%>%
activity_df filter(Intensity != "Sedentary") %>%
ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) +
geom_col(color='azure2', position="dodge") +
scale_fill_brewer(palette = "Accent") +
theme_minimal() +
labs(title = "Average Active Minutes by Intensity and Day",
x = "Day of the Week",
y= "Minutes Active",
fill = "Intensity Level")
We see that minutes of light intensity activity doesn’t really have a particular pattern either.
Lastly, let’s remove lightly active intensity and focus on just the very active and fairly active intensities:
%>%
activity_df filter(!Intensity %in% c("LightlyActive", "Sedentary")) %>%
ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) +
geom_col(color='azure2') +
geom_text(aes(label = Minutes), vjust = 1.5) +
stat_summary(fun = sum, aes(label = ..y.., group = DayOfWeek), geom = "text", vjust= -.35) +
scale_fill_brewer(palette = "Spectral") +
theme_minimal() +
labs(title = "Average Active Minutes by Intensity and Day",
x = "Day of the Week",
y= "Minutes Active",
fill = "Intensity Level")
According to Mayo Clinic, most healthy adults should aim for at least 30 minutes of moderate activity a day1. This graph shows that on average these individuals are indeed hitting those 30 minutes of moderate activity per day (assuming fairly active intensity equates to at least moderate activity by Mayo Clinic’s definition).
\(~\)
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
ggplot(aes(x=TotalMinutesAsleep)) +
geom_histogram(binwidth= 25, fill='darkturquoise', color="azure1") +
theme_light() +
labs(title = "Distriubtion of Minutes Slept",
x = "Minutes Asleep",
y = "Count") +
geom_vline(aes(xintercept=median(TotalMinutesAsleep)), color="black", linetype="dashed") +
annotate("text", x=370, y=53, label="Median = 433")
The distribution of logged sleep roughly follows a normal distribution. The median amount of sleep logged for a given day is 7 hours and 13 minutes. According to the Sleep Foundation, adults between 18 and 64 are recommended to get 7 to 9 hours of sleep per day2.
Let’s also take a look at average sleep duration at different days of the week:
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
group_by(DayOfWeek) %>%
summarize(AvgSleep = mean(TotalMinutesAsleep)) %>%
ggplot(aes(x=DayOfWeek, y=AvgSleep, fill=DayOfWeek)) +
geom_col(color = "azure1", alpha=0.8) +
geom_text(aes(label = round(AvgSleep,2)), vjust = 1.5) +
scale_fill_brewer(palette = "Accent") +
theme_minimal() +
labs(title = "Average Sleep by Day",
x = "Day of Week",
y = "Average Sleep Duration (min)") +
theme(legend.position='none')
It appears that average sleep duration is below the recommended 7 hours for 5 out of the 7 days (Monday, Tuesday, Thursday, Friday, Saturday).
\(~\)
One last thing we can look into is sleep latency, that is the time it takes for you to fall asleep. According to Healthline, it typically takes 10 to 20 for an individual to fall asleep - a sleep latency outside of that range may indicate an underlying sleep condition3.
We will make the assumption that the variable “TotalTimeInBed” only includes the time sleeping and the time attempting to sleep. Let’s go ahead and calculate the average and median sleep latency and plot the distribution:
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
summarize(AverageSleepLatency = mean(TotalTimeInBed - TotalMinutesAsleep),
MedianSleepLatency = median(TotalTimeInBed - TotalMinutesAsleep))
## # A tibble: 1 x 2
## AverageSleepLatency MedianSleepLatency
## <dbl> <dbl>
## 1 39.3 25.5
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>%
ggplot(aes(x=SleepLatency)) +
geom_histogram(binwidth= 8, fill='coral1', color="azure1") +
theme_light() +
labs(title = "Distriubtion of Sleep Latency",
x = "Sleep Latency (mins)",
y = "Count") +
geom_vline(aes(xintercept=median(SleepLatency)), color="black", linetype="dashed") +
annotate("text", x=60, y=92, label="Median = 25.5")
The distribution of sleep is skewed right as the bulk of the observations are on the lower end with a few large outliers, which could indicate insomnia. This is also underlined by the fact that the median sleep latency is less than the mean sleep latency (39.3 vs 25.5). We note that even the median sleep latency is above the recommended sleep latency range of 10 to 20 minutes.
\(~\)
Since it seems like the average sleep latency is on the higher side it may be possible that sleep latency has an effect on minutes of activity. Let’s plot it vs the 4 different intensities to confirm.
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>%
ggplot(aes(x=SleepLatency, y=SedentaryMinutes)) +
geom_point() +
theme_light() +
labs(title = "Sedentary Minutes vs Sleep Latency",
x = "Sleep Latency (mins)",
y = "Time Spent Sedentary (mins)") +
annotate("text", x=275, y=1125, label="r = -0.17")
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>%
ggplot(aes(x=SleepLatency, y=LightlyActiveMinutes)) +
geom_point() +
theme_light() +
labs(title = "Lightly Active Minutes vs Sleep Latency",
x = "Sleep Latency (mins)",
y = "Lightly Active Minutes") +
annotate("text", x=275, y=425, label="r = -0.15")
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>%
ggplot(aes(x=SleepLatency, y=FairlyActiveMinutes)) +
geom_point() +
theme_light() +
labs(title = "Fairly Active Minutes vs Sleep Latency",
x = "Sleep Latency (mins)",
y = "Fairly Active Minutes") +
annotate("text", x=275, y=137, label="r = 0.32")
%>%
combined_data filter(!is.na(TotalMinutesAsleep)) %>%
mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>%
ggplot(aes(x=SleepLatency, y=VeryActiveMinutes)) +
geom_point() +
theme_light() +
labs(title = "Very Active Minutes vs Sleep Latency",
x = "Sleep Latency (mins)",
y = "Very Active Minutes") +
annotate("text", x=275, y=175, label="r = 0.08")
We notice that there is only a weak positive correlation between fairly active minutes and sleep latency while for the other intensities there is no relationship with sleep latency.
\(~\)
\(~\)
Findings:
Suggestions for Further Investigation:
There are definitely several things we can improve with regards to the data if we want to do any follow up analyses:
\(~\)
Thanks for reading! The files, data, and images for this case study can be found on GitHub.
Edward R. Laskowski, M. D. (2019, April 27). How much exercise do you really need? Mayo Clinic. https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/exercise/faq-20057916.↩︎
Sleep Statistics - Facts and Data About Sleep 2020 | Sleep Foundation. (2021). Retrieved from https://www.sleepfoundation.org/how-sleep-works/sleep-facts-statistics↩︎
Silver, N. (2020, June 5). How long does it take to fall asleep? Average time and tips. Healthline. https://www.healthline.com/health/healthy-sleep/how-long-does-it-take-to-fall-asleep#normal-sleep.↩︎