Bellabeat Case Study

Introduction

\(~\)

Bellabeat is a tech-driven wellness company founded in 2013 by Sandro Mur and Urška Sršen that specializes in health-focused smart products for women and offer a line of products that help empower users with knowledge about their own health and habits. These products include:

Bellabeat App: Provides users with a hub to track their health data related to their habits and activities.
Leaf: Wellness tracker that can be worn as a bracelet, necklace, or clip that tracks activity, sleep, and stress. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Time: A watch that tracks user activity, sleep, and stress. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Spring: A water bottle that tracks daily water intake. Connects to the Bellabeat app to provide the user insights into their daily wellness.
Membership: Gives users 24/7 access to fully personalized guidance based on their lifestyle and goals.

\(~\)

Objective

Using data gathered from non-Bellabeat smart devices, we will perform exploratory data analysis to try to identify trends in smart device usage. Some questions that we will attempt to answer include:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

\(~\)

The Data

Image Source

\(~\)

The data used in this analysis is from Kaggle and was made available by the user Möbius (CC0: Public Domain). These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk. 30 eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Upon initial examination of the datasets in Excel, we immediately notice a few things:

Data was collected between 3/12/2016 and 5/12/2016.
Datasets have differing levels of granularity, with some datasets containing minute-to-minute data and others containing daily data.
Several of the datasets are available in both long and wide formats
There does not appear to be any data regarding water intake.

We will keep these observations in mind as we delve deeper into the data.

\(~\)

Loading the Data

Let’s start by load the necessary libraries:

library(tidyverse)
library(lubridate)
library(ggcorrplot)

We will elect to use the datasets containing the daily data only.

Importing the datasets:

daily_activity <- read_csv("dailyActivity_merged.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   ActivityDate = col_character(),
##   TotalSteps = col_double(),
##   TotalDistance = col_double(),
##   TrackerDistance = col_double(),
##   LoggedActivitiesDistance = col_double(),
##   VeryActiveDistance = col_double(),
##   ModeratelyActiveDistance = col_double(),
##   LightActiveDistance = col_double(),
##   SedentaryActiveDistance = col_double(),
##   VeryActiveMinutes = col_double(),
##   FairlyActiveMinutes = col_double(),
##   LightlyActiveMinutes = col_double(),
##   SedentaryMinutes = col_double(),
##   Calories = col_double()
## )

daily_sleep <- read_csv("sleepDay_merged.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   SleepDay = col_character(),
##   TotalSleepRecords = col_double(),
##   TotalMinutesAsleep = col_double(),
##   TotalTimeInBed = col_double()
## )

weight_log <- read_csv("weightLogInfo_merged.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   Date = col_character(),
##   WeightKg = col_double(),
##   WeightPounds = col_double(),
##   Fat = col_double(),
##   BMI = col_double(),
##   IsManualReport = col_logical(),
##   LogId = col_double()
## )

\(~\)

Next, we’ll quickly examine the structure of each of the three tables using head() and glimpse():

Activity Dataset

head(daily_activity)

## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~

Sleep Dataset

head(daily_sleep)

## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320

glimpse(daily_sleep)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~

Weight Dataset

head(weight_log)

## # A tibble: 6 x 8
##         Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport     LogId
##      <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>              <dbl>
## 1   1.50e9 5/2/2016 ~     52.6         116.    22  22.6 TRUE             1.46e12
## 2   1.50e9 5/3/2016 ~     52.6         116.    NA  22.6 TRUE             1.46e12
## 3   1.93e9 4/13/2016~    134.          294.    NA  47.5 FALSE            1.46e12
## 4   2.87e9 4/21/2016~     56.7         125.    NA  21.5 TRUE             1.46e12
## 5   2.87e9 5/12/2016~     57.3         126.    NA  21.7 TRUE             1.46e12
## 6   4.32e9 4/17/2016~     72.4         160.    25  27.5 TRUE             1.46e12

glimpse(weight_log)

## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~

We can see that the in the Activity Dataset, the “ActivityDate” column have a data type of character. Similarly, the date column in both the sleep and weight datasets also have a data type of character. We will need to convert these columns to a data type of date. Furthermore, in the sleep and weight datasets we see that the date columns have both the date as well as the time so we will need to clean that up as well.

\(~\)

Cleaning the Data

Dropping unnecessary columns

To start off, we will remove columns that are redundant or unnecessary.

In the daily_activity dataset, the columns “TrackerDistance” and “LoggedActivityDistance” are not needed since we already have a “TotalDistance” column which is the sum of the previous 2 columns mentioned. We’ll go ahead and remove those 2 columns.

daily_activity_clean <- 
    daily_activity %>% 
    select(-TrackerDistance,
           -LoggedActivitiesDistance)

head(daily_activity_clean)

## # A tibble: 6 x 13
##       Id ActivityDate TotalSteps TotalDistance VeryActiveDista~ ModeratelyActiv~
##    <dbl> <chr>             <dbl>         <dbl>            <dbl>            <dbl>
## 1 1.50e9 4/12/2016         13162          8.5              1.88            0.550
## 2 1.50e9 4/13/2016         10735          6.97             1.57            0.690
## 3 1.50e9 4/14/2016         10460          6.74             2.44            0.400
## 4 1.50e9 4/15/2016          9762          6.28             2.14            1.26 
## 5 1.50e9 4/16/2016         12669          8.16             2.71            0.410
## 6 1.50e9 4/17/2016          9705          6.48             3.19            0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

In the daily_sleep dataset, we can remove the “TotalSleepRecords” column as it is irrelevant for this analysis.

daily_sleep_clean <- 
    daily_sleep %>% 
    select(-TotalSleepRecords)

head(daily_activity_clean)

## # A tibble: 6 x 13
##       Id ActivityDate TotalSteps TotalDistance VeryActiveDista~ ModeratelyActiv~
##    <dbl> <chr>             <dbl>         <dbl>            <dbl>            <dbl>
## 1 1.50e9 4/12/2016         13162          8.5              1.88            0.550
## 2 1.50e9 4/13/2016         10735          6.97             1.57            0.690
## 3 1.50e9 4/14/2016         10460          6.74             2.44            0.400
## 4 1.50e9 4/15/2016          9762          6.28             2.14            1.26 
## 5 1.50e9 4/16/2016         12669          8.16             2.71            0.410
## 6 1.50e9 4/17/2016          9705          6.48             3.19            0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

In the weight_log dataset, we can remove the “Fat” (only has 2 records, the rest are all NA), “WeightKg” (we don’t need 2 weight columns), “IsManualReport”, and “LogId” columns as they are unnecessary.

weight_log_clean <- 
    weight_log %>% 
    select(-Fat,
           -WeightKg,
           -IsManualReport,
           -LogId)

head(weight_log_clean)

## # A tibble: 6 x 4
##           Id Date                  WeightPounds   BMI
##        <dbl> <chr>                        <dbl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM          116.  22.6
## 2 1503960366 5/3/2016 11:59:59 PM          116.  22.6
## 3 1927972279 4/13/2016 1:08:52 AM          294.  47.5
## 4 2873212765 4/21/2016 11:59:59 PM         125.  21.5
## 5 2873212765 5/12/2016 11:59:59 PM         126.  21.7
## 6 4319703577 4/17/2016 11:59:59 PM         160.  27.5

\(~\)

Removing Duplicate Rows

We’ll now check for duplicate rows in each dataset:

daily_activity_clean <- 
    daily_activity_clean %>% 
    distinct()

daily_sleep_clean <- 
    daily_sleep_clean %>% 
    distinct()

weight_log_clean <- 
    weight_log_clean %>% 
    distinct()

# Uncleaned dataset had 940 rows and 15 columns
nrow(daily_activity_clean)

## [1] 940

# Uncleaned dataset had 413 rows and 55 columns
nrow(daily_sleep_clean)

## [1] 410

# Uncleaned dataset had 67 rows and 8 columns
nrow(weight_log_clean)

## [1] 67

From the above we see that the “daily_sleep” dataset had 3 duplicate rows removed while the other 2 datasets were unmodified.

\(~\)

Formatting the Date Columns

As mentioned earlier, the date type of the date column in all 3 tables were formatted as character data type. We will convert it into a date data type.

daily_activity_clean$ActivityDate <- mdy(daily_activity_clean$ActivityDate)

daily_activity_clean <- 
    daily_activity_clean %>% 
        rename(Date = ActivityDate)

head(daily_activity_clean)

## # A tibble: 6 x 13
##        Id Date       TotalSteps TotalDistance VeryActiveDista~ ModeratelyActive~
##     <dbl> <date>          <dbl>         <dbl>            <dbl>             <dbl>
## 1  1.50e9 2016-04-12      13162          8.5              1.88             0.550
## 2  1.50e9 2016-04-13      10735          6.97             1.57             0.690
## 3  1.50e9 2016-04-14      10460          6.74             2.44             0.400
## 4  1.50e9 2016-04-15       9762          6.28             2.14             1.26 
## 5  1.50e9 2016-04-16      12669          8.16             2.71             0.410
## 6  1.50e9 2016-04-17       9705          6.48             3.19             0.780
## # ... with 7 more variables: LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>

# Separate the date column into 2 rows, then drop the Time column
daily_sleep_clean <- 
    daily_sleep_clean %>% 
    separate(SleepDay, c("Date", "Time"), sep=" ") %>% 
    select(-Time)

# Convert Date column to mdy format
daily_sleep_clean$Date <- mdy(daily_sleep_clean$Date)
    

head(daily_sleep_clean)

## # A tibble: 6 x 4
##           Id Date       TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                  <dbl>          <dbl>
## 1 1503960366 2016-04-12                327            346
## 2 1503960366 2016-04-13                384            407
## 3 1503960366 2016-04-15                412            442
## 4 1503960366 2016-04-16                340            367
## 5 1503960366 2016-04-17                700            712
## 6 1503960366 2016-04-19                304            320

# Separate the date column into 2 rows, then drop the Time column
weight_log_clean <- 
    weight_log_clean %>% 
    separate(Date, c("Date", "Time"), sep=" ") %>% 
    select(-Time)

# Convert Date column to mdy format
weight_log_clean$Date <- mdy(weight_log_clean$Date)
    

head(weight_log_clean)

## # A tibble: 6 x 4
##           Id Date       WeightPounds   BMI
##        <dbl> <date>            <dbl> <dbl>
## 1 1503960366 2016-05-02         116.  22.6
## 2 1503960366 2016-05-03         116.  22.6
## 3 1927972279 2016-04-13         294.  47.5
## 4 2873212765 2016-04-21         125.  21.5
## 5 2873212765 2016-05-12         126.  21.7
## 6 4319703577 2016-04-17         160.  27.5

From the output above, all of the date columns were successfully formatted from character to date.

\(~\)

Summarizing the Data/Checking for NA’s

We will use summary() to get descriptive statistics to check for any obvious issues for each variable. Summary() also comes with the added benefit of providing us with the count of the NA’s for each column, if any.

daily_activity_clean %>% 
    select(-Id,
           -Date) %>% 
    summary()

##    TotalSteps    TotalDistance    VeryActiveDistance ModeratelyActiveDistance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000     Min.   :0.0000          
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 0.000     1st Qu.:0.0000          
##  Median : 7406   Median : 5.245   Median : 0.210     Median :0.2400          
##  Mean   : 7638   Mean   : 5.490   Mean   : 1.503     Mean   :0.5675          
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 2.053     3rd Qu.:0.8000          
##  Max.   :36019   Max.   :28.030   Max.   :21.920     Max.   :6.4800          
##  LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
##  Min.   : 0.000      Min.   :0.000000        Min.   :  0.00   
##  1st Qu.: 1.945      1st Qu.:0.000000        1st Qu.:  0.00   
##  Median : 3.365      Median :0.000000        Median :  4.00   
##  Mean   : 3.341      Mean   :0.001606        Mean   : 21.16   
##  3rd Qu.: 4.782      3rd Qu.:0.000000        3rd Qu.: 32.00   
##  Max.   :10.710      Max.   :0.110000        Max.   :210.00   
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :  6.00      Median :199.0        Median :1057.5   Median :2134  
##  Mean   : 13.56      Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900

daily_sleep %>% 
    select(TotalMinutesAsleep, 
           TotalTimeInBed) %>% 
    summary()

##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.0  
##  Median :433.0      Median :463.0  
##  Mean   :419.5      Mean   :458.6  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

weight_log_clean %>% 
    select(-Id,
           -Date) %>% 
    summary()

##   WeightPounds        BMI       
##  Min.   :116.0   Min.   :21.45  
##  1st Qu.:135.4   1st Qu.:23.96  
##  Median :137.8   Median :24.39  
##  Mean   :158.8   Mean   :25.19  
##  3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :294.3   Max.   :47.54

Based on the output, there does appear to be some high values, for example an individual logged 4900 calories burned. However, this does appear to be within reason. The summary output also states that there are no NA’s in any of out variables so we are good from that regard.

We can now merge the three datasets.

\(~\)

Merging the Datasets

For each of the three datasets, we notice that the ID and Date fields are present so we can join the tables on these columns.

Let’s first check the number of unique IDs for each dataset:

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(daily_sleep$Id)

## [1] 24

n_distinct(weight_log$Id)

## [1] 8

From this we see that 33 individuals logged data for activity while only 24 and 8 users logged data for sleep and weight, respectively. In this case we’ll use 2 left joins to join the 3 datasets on the “Id” and “Date” columns so we don’t lose any data.

combined_data <- left_join(left_join(daily_activity_clean, 
                                     daily_sleep_clean, 
                                     by=c("Id", "Date")), weight_log_clean, by=c("Id", "Date"))

glimpse(combined_data)

## Rows: 940
## Columns: 17
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
## $ TotalMinutesAsleep       <dbl> 327, 384, NA, 412, 340, 700, NA, 304, 360, 32~
## $ TotalTimeInBed           <dbl> 346, 407, NA, 442, 367, 712, NA, 320, 377, 36~
## $ WeightPounds             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ BMI                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

Everything looks good so far and we can proceed with our exploratory data analysis!

\(~\)

Exploratory Data Analysis

Correlation Matrices

To begin, we will create some correlation matrices to get a general idea of what relationships exist between our variables.

Please note that I omitted the variables related to distance due to redundancy with the “Minutes” variables as well as improving the viewability of the visualization.

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any sleep logged
    select(-Id, -Date, -WeightPounds, -BMI, -ends_with("Distance")) %>% 
    cor() %>% 
    round(2) %>% 
    ggcorrplot(hc.order = TRUE, 
               ggtheme = ggplot2::theme_gray,
               colors = c("#6D9EC1", "white", "#E46726"),
               lab=TRUE)

Now this is a lot of information so we will focus on the relationships that have a r-value of > 0.5 or < -0.5.

We do see some moderately strong to strong positive correlations with the following:

Total Steps vs Very Active Minutes (r = 0.54)
Calories vs Very Active Minutes (r = 0.61)
Fairly Active Minutes vs Total Steps (r = 0.57)
Total Time in Bed vs Total Minutes Asleep (r = 0.93)

These relationships intuitively all make sense as more activity results in more more calories burned and more steps taken. The relationship between total time in bed and time asleep is also obvious.

We also see some moderately strong negative correlations with the following:

Total Time in Bed vs Sedentary Minutes (r = -0.62)
Total Time Asleep vs Sedentary Minutes (r = -0.6)

This suggests that as the amount of sleep increases, the amount of minutes of being sedentary decreases.

\(~\)

We will also create a similar correlation matrix that includes the variables associated to weight. We are doing this because there are only 35 rows of data that include both sleep and weight information logged which may result in unreliable findings due to a small sample size. Nonetheless, let’s create the correlation matrix and see if any relationships exist between weight and the other variables. I have excluded BMI from the matrix due to the redundancy with weight.

#Subsetting the data for the correlation matrix
combined_data %>% 
    filter(!is.na(WeightPounds) & !is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any weight nor sleep logged
    select(-Id, -Date, -BMI, -ends_with("Distance")) %>%
    glimpse()

## Rows: 35
## Columns: 9
## $ TotalSteps           <dbl> 14727, 15103, 356, 3428, 12231, 10199, 5652, 1551~
## $ VeryActiveMinutes    <dbl> 41, 50, 0, 0, 200, 50, 8, 0, 0, 50, 5, 13, 35, 48~
## $ FairlyActiveMinutes  <dbl> 15, 24, 0, 0, 37, 14, 24, 0, 0, 3, 13, 42, 41, 4,~
## $ LightlyActiveMinutes <dbl> 277, 254, 32, 190, 159, 189, 142, 86, 217, 280, 2~
## $ SedentaryMinutes     <dbl> 798, 816, 986, 1121, 525, 796, 548, 862, 837, 741~
## $ Calories             <dbl> 2004, 1990, 2151, 1692, 4552, 1994, 1718, 1466, 1~
## $ TotalMinutesAsleep   <dbl> 277, 273, 398, 115, 549, 366, 630, 508, 370, 357,~
## $ TotalTimeInBed       <dbl> 309, 296, 422, 129, 583, 387, 679, 535, 386, 366,~
## $ WeightPounds         <dbl> 115.9631, 115.9631, 294.3171, 154.1031, 199.9593,~

    combined_data %>% 
    filter(!is.na(WeightPounds) & !is.na(TotalMinutesAsleep)) %>% # Filtering out the data that do not have any weight nor sleep logged
    select(-Id, -Date, -BMI, -ends_with("Distance")) %>%
    cor() %>% 
    round(1) %>% 
    ggcorrplot(hc.order = TRUE, 
               ggtheme = ggplot2::theme_gray,
               colors = c("#6D9EC1", "white", "#E46726"),
               lab=TRUE)

Focusing only on the correlations related to weight, we can see that the only moderately strong relationship is a negative relationship between weight and minutes doing light activity.

\(~\)

Trends in Activity

Next, let’s see if there is any relationship between day of the week and activity intensity for the participants.

Creating a column for day of the week and setting it as a factor:

combined_data$DayOfWeek <- 
    weekdays(combined_data$Date) %>%
        factor(levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
                          "Thursday", "Friday", "Saturday"))

Creating a new dataframe, “activity_df”, based on the aggregated minutes active based on intensity and grouped by date. We will also need to convert this dataframe into a long format in order to create the visualization.

combined_data %>% 
    group_by(DayOfWeek) %>% 
    summarize(VeryActive = round(mean(VeryActiveMinutes), 2),
              FairlyActive = round(mean(FairlyActiveMinutes), 2),
              LightlyActive = round(mean(LightlyActiveMinutes), 2),
              Sedentary = round(mean(SedentaryMinutes), 2)) %>% 
    pivot_longer(!DayOfWeek, names_to = "Intensity", values_to = "Minutes") -> activity_df

Setting intensity level as a factor:

activity_df$Intensity <- 
    activity_df$Intensity %>% 
        factor(levels = c("VeryActive", "FairlyActive", "LightlyActive", "Sedentary"))

head(activity_df)

## # A tibble: 6 x 3
##   DayOfWeek Intensity     Minutes
##   <fct>     <fct>           <dbl>
## 1 Sunday    VeryActive       20.0
## 2 Sunday    FairlyActive     14.5
## 3 Sunday    LightlyActive   174. 
## 4 Sunday    Sedentary       990. 
## 5 Monday    VeryActive       23.1
## 6 Monday    FairlyActive     14

Creating the visualization:

activity_df %>% 
    ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) + 
    geom_col(color='azure2', alpha=0.7, position="dodge") + 
    scale_fill_brewer(palette = "PuOr") +
    theme_minimal() +
    labs(title = "Average Active Minutes by Intensity and Day",
         x = "Day of the Week",
         y= "Minutes Active",
         fill = "Intensity Level")

We obverse that the bulk of the activity minutes were spent being sedentary and there does not seem to be any specific pattern day by day.

Let’s also take a look at the visualization without the sedentary intensity level:

activity_df %>% 
    filter(Intensity != "Sedentary") %>% 
    ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) + 
    geom_col(color='azure2', position="dodge") + 
    scale_fill_brewer(palette = "Accent") +
    theme_minimal() +
    labs(title = "Average Active Minutes by Intensity and Day",
         x = "Day of the Week",
         y= "Minutes Active",
         fill = "Intensity Level")

We see that minutes of light intensity activity doesn’t really have a particular pattern either.

Lastly, let’s remove lightly active intensity and focus on just the very active and fairly active intensities:

activity_df %>% 
    filter(!Intensity %in% c("LightlyActive", "Sedentary")) %>% 
    ggplot(aes(x=DayOfWeek, y=Minutes, fill=Intensity)) + 
    geom_col(color='azure2') + 
    geom_text(aes(label = Minutes), vjust = 1.5) +
    stat_summary(fun = sum, aes(label = ..y.., group = DayOfWeek), geom = "text", vjust= -.35) +
    scale_fill_brewer(palette = "Spectral") +
    theme_minimal() +
    labs(title = "Average Active Minutes by Intensity and Day",
         x = "Day of the Week",
         y= "Minutes Active",
         fill = "Intensity Level")

According to Mayo Clinic, most healthy adults should aim for at least 30 minutes of moderate activity a day¹. This graph shows that on average these individuals are indeed hitting those 30 minutes of moderate activity per day (assuming fairly active intensity equates to at least moderate activity by Mayo Clinic’s definition).

\(~\)

Trends in Sleep

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>% 
    ggplot(aes(x=TotalMinutesAsleep)) +
    geom_histogram(binwidth= 25, fill='darkturquoise', color="azure1") +
    theme_light() +
    labs(title = "Distriubtion of Minutes Slept",
         x = "Minutes Asleep",
         y = "Count") +
    geom_vline(aes(xintercept=median(TotalMinutesAsleep)), color="black", linetype="dashed") +
    annotate("text", x=370, y=53, label="Median = 433")

The distribution of logged sleep roughly follows a normal distribution. The median amount of sleep logged for a given day is 7 hours and 13 minutes. According to the Sleep Foundation, adults between 18 and 64 are recommended to get 7 to 9 hours of sleep per day².

Let’s also take a look at average sleep duration at different days of the week:

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>% 
    group_by(DayOfWeek) %>% 
    summarize(AvgSleep = mean(TotalMinutesAsleep)) %>% 
    ggplot(aes(x=DayOfWeek, y=AvgSleep, fill=DayOfWeek)) +
    geom_col(color = "azure1", alpha=0.8) +
    geom_text(aes(label = round(AvgSleep,2)), vjust = 1.5) +
    scale_fill_brewer(palette = "Accent") +
    theme_minimal() + 
        labs(title = "Average Sleep by Day",
         x = "Day of Week",
         y = "Average Sleep Duration (min)") +
    theme(legend.position='none')

It appears that average sleep duration is below the recommended 7 hours for 5 out of the 7 days (Monday, Tuesday, Thursday, Friday, Saturday).

\(~\)

One last thing we can look into is sleep latency, that is the time it takes for you to fall asleep. According to Healthline, it typically takes 10 to 20 for an individual to fall asleep - a sleep latency outside of that range may indicate an underlying sleep condition³.

We will make the assumption that the variable “TotalTimeInBed” only includes the time sleeping and the time attempting to sleep. Let’s go ahead and calculate the average and median sleep latency and plot the distribution:

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    summarize(AverageSleepLatency = mean(TotalTimeInBed - TotalMinutesAsleep),
              MedianSleepLatency = median(TotalTimeInBed - TotalMinutesAsleep))

## # A tibble: 1 x 2
##   AverageSleepLatency MedianSleepLatency
##                 <dbl>              <dbl>
## 1                39.3               25.5

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>% 
    ggplot(aes(x=SleepLatency)) +
    geom_histogram(binwidth= 8, fill='coral1', color="azure1") +
    theme_light() +
    labs(title = "Distriubtion of Sleep Latency",
         x = "Sleep Latency (mins)",
         y = "Count") +
    geom_vline(aes(xintercept=median(SleepLatency)), color="black", linetype="dashed") +
    annotate("text", x=60, y=92, label="Median = 25.5")

The distribution of sleep is skewed right as the bulk of the observations are on the lower end with a few large outliers, which could indicate insomnia. This is also underlined by the fact that the median sleep latency is less than the mean sleep latency (39.3 vs 25.5). We note that even the median sleep latency is above the recommended sleep latency range of 10 to 20 minutes.

\(~\)

Since it seems like the average sleep latency is on the higher side it may be possible that sleep latency has an effect on minutes of activity. Let’s plot it vs the 4 different intensities to confirm.

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>% 
    ggplot(aes(x=SleepLatency, y=SedentaryMinutes)) +
    geom_point() +
    theme_light() +
    labs(title = "Sedentary Minutes vs Sleep Latency",
         x = "Sleep Latency (mins)",
         y = "Time Spent Sedentary (mins)") +
    annotate("text", x=275, y=1125, label="r = -0.17")

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>% 
    ggplot(aes(x=SleepLatency, y=LightlyActiveMinutes)) +
    geom_point() +
    theme_light() +
    labs(title = "Lightly Active Minutes vs Sleep Latency",
         x = "Sleep Latency (mins)",
         y = "Lightly Active Minutes") +
    annotate("text", x=275, y=425, label="r = -0.15")

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>% 
    ggplot(aes(x=SleepLatency, y=FairlyActiveMinutes)) +
    geom_point() +
    theme_light() +
    labs(title = "Fairly Active Minutes vs Sleep Latency",
         x = "Sleep Latency (mins)",
         y = "Fairly Active Minutes") +
    annotate("text", x=275, y=137, label="r = 0.32")

combined_data %>% 
    filter(!is.na(TotalMinutesAsleep)) %>%
    mutate(SleepLatency = TotalTimeInBed - TotalMinutesAsleep) %>% 
    ggplot(aes(x=SleepLatency, y=VeryActiveMinutes)) +
    geom_point() +
    theme_light() +
    labs(title = "Very Active Minutes vs Sleep Latency",
         x = "Sleep Latency (mins)",
         y = "Very Active Minutes") +
    annotate("text", x=275, y=175, label="r = 0.08")

We notice that there is only a weak positive correlation between fairly active minutes and sleep latency while for the other intensities there is no relationship with sleep latency.

\(~\)

Conclusion and Parting Thoughts

Image Source

\(~\)

Findings:

There were a fair number of individuals that were getting too much sleep, too little sleep, or had a high sleep latency. Bellabeat could look into providing users recommendations when any of their sleeping patterns fall out of the optimal range.
Logging was inconsistent especially with recording weight. Bellabeat should look into ways to make logging more consistent, possibly sending a notification to reminding users via the Bellabeat app. With weight in particular, potentially a smart scale could help users record weight by automatically updating their weight in the app when they use the scale.

Suggestions for Further Investigation:

There are definitely several things we can improve with regards to the data if we want to do any follow up analyses:

Survey more individuals over a longer period of time.
For any data that needs manual logging, make sure individuals are consistently logging them.
Collect some basic demographic information with regards to the participants to ensure we have a representative sample.
Since we did not have any water intake data it would definitely be helpful to obtain this, if available.

\(~\)

Thanks for reading! The files, data, and images for this case study can be found on GitHub.

Edward R. Laskowski, M. D. (2019, April 27). How much exercise do you really need? Mayo Clinic. https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/exercise/faq-20057916.↩︎
Sleep Statistics - Facts and Data About Sleep 2020 | Sleep Foundation. (2021). Retrieved from https://www.sleepfoundation.org/how-sleep-works/sleep-facts-statistics ↩︎
Silver, N. (2020, June 5). How long does it take to fall asleep? Average time and tips. Healthline. https://www.healthline.com/health/healthy-sleep/how-long-does-it-take-to-fall-asleep#normal-sleep.↩︎

Bellabeat Case Study

Brian Wu

8/6/2021

Introduction

Objective

The Data

Loading the Data

Activity Dataset

Sleep Dataset

Weight Dataset

Cleaning the Data

Dropping unnecessary columns

Removing Duplicate Rows

Formatting the Date Columns

Summarizing the Data/Checking for NA’s

Merging the Datasets

Exploratory Data Analysis

Correlation Matrices

Trends in Activity

Trends in Sleep

Conclusion and Parting Thoughts