1 BUSINESS TASK

  1. Business Task assigned: Determine How Casual riders different from Members.
  2. Accounting has determined Members are more profitable than Casual riders.
  3. From these findings come up with three recommendations for converting Casual Riders into Members.

1.0.0.1 If Using .Rmd File Directly

To get directly to Data Viz, just run CHUNKs A, B and C.

To evaluate Prepare, Clean & Process phases continue step-by-step beginning with CHUNK A.

  • A is next
  • For B, use Table of Contents on your right >>>
  • CHUNK C follows B right after PROCESS DATA

1.0.1 CHUNK A: Setup Environment

## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"
## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"

2 PREPARE DATA

##    ride_id          rideable_type       started_at          ended_at        
##  Length:103771      Length:103771      Length:103771      Length:103771     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:103771      Length:103771      Length:103771      Length:103771     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.65   Min.   :-87.83   Min.   :41.65   Min.   :-87.83  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.89   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.07   Max.   :-87.52  
##  NA's   :1       NA's   :1        NA's   :87      NA's   :87      
##  member_casual     
##  Length:103771     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

2.0.0.1 Bad Data Located:

  • From the above we see an out-of-bounds start_lng of -73.80. It’s probably -87.38 See how many lng values are above -87.
  • We expect an error if none are found.
##    Mode   FALSE    TRUE    NA's 
## logical  103769       1       1
## # A tibble: 60 × 13
##    ride_id       ridea…¹ start…² ended…³ start…⁴ start…⁵ end_s…⁶ end_s…⁷ start…⁸
##    <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>
##  1 C2F7DD78E82E… electr… 1/13/2… 1/13/2… Glenwo… 525     Clark … RP-007     42.0
##  2 A6CF8980A652… electr… 1/10/2… 1/10/2… Glenwo… 525     Clark … RP-007     42.0
##  3 BD0F91DFF741… classi… 1/25/2… 1/25/2… Sheffi… TA1306… Greenv… TA1307…    41.9
##  4 CBB80ED41910… classi… 1/4/20… 1/4/20… Clark … KA1504… Paulin… TA1309…    42.0
##  5 DDC963BFDDA5… classi… 1/20/2… 1/20/2… Michig… TA1309… State … TA1305…    41.9
##  6 A39C6F6CC058… classi… 1/11/2… 1/11/2… Wood S… 637     Honore… TA1305…    41.9
##  7 BDC4AB637EDF… classi… 1/30/2… 1/30/2… Oakley… KA1504… Broadw… 13323      42.0
##  8 81751A3186E5… classi… 1/22/2… 1/22/2… Sheffi… TA1306… Damen … 13271      41.9
##  9 154222B86A33… electr… 1/17/2… 1/17/2… Racine… 13304   Clinto… WL-012     41.9
## 10 72DC25B2DD46… classi… 1/28/2… 1/28/2… LaSall… TA1309… Clinto… WL-012     41.9
## # … with 50 more rows, 4 more variables: start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​started_at, ³​ended_at, ⁴​start_station_name,
## #   ⁵​start_station_id, ⁶​end_station_name, ⁷​end_station_id, ⁸​start_lat

The Summary shows it was just the one single error being out-of-bounds.

2.0.0.2 Load year round data into SQL:

  • It was requested to use most recent 12 months which is all of 2022
  • There are far fewer trips during winter months indicating altered behavior we need to understand
  • RStudio desktop loads all data into memory and all data turns out to be over 1 GB and around 5,000,000 rows. Therefore, I elected to load all months into SQL Server. In the process of loading, the data failed to load on the 9th month with undetermined errors. I was going to skip the extra months, but testing without them left a hole in the analysis. I eventually discovered the last four months all had undetectable errors in datetime (field indicated as having an error did not have any visible problems).

2.0.0.3 FIX: Load each month into R and clean them using this code.

  • This has already been done and the resulting ‘*_clean.csv’ files are in ‘Monthly_csv_files’
  • MAKE SURE YOU ARE IN THE RIGHT PROJECT AND MODIFY folder/file VARIABLES AS NEEDED
  • The following code loops through files listed using myfile_2 vector, in this case the last four months.
  • The full path and filename is constructed on the fly.
  • You will be asked if you wish to do this even if you elect to run all chunks.
## Do you want to import and clean Divvy 09-12 again? (Yes/no/cancel) 
## [1] "Monthly_csv_files/202209-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202210-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202211-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202212-divvy-tripdata.csv"

2.0.0.4 With clean data I was able to finish loading the last four months into SQL

2.0.1 Data Cleaning on ~5 million records

COUNT(DISTINCT[field]) All Start|End pairs should match. Can’t have a trip start without an end! Here’s what I found:

Results are somewhat reasonable except for start_lat and start_lng indicating there is some kind of issue here. I suspect issues with insignificant decimal resolution. I tried rounding these two values. It turns out there is an optimal rounding value of 4 places. just very weird that only 70.1% of start lat/lng match while 99.96% of end lat/lng match.

Rounding gives the following results:

Much closer, but still significantly off. Further, I would expect Start|End LAT|LNG to approximately equal Start|End station names and station_id. Again, all close but not identical. Could be missing station info.

The LAT|LNG discrepancy could also be explained by variances in exactly where the bike was parked when the prior trip ended, i.e., different ends of the bike rack, as these positions must be reported electronically by the bike’s instrumentation and not manually transcribed.

Another probability is LAT or LNG values that “line-up” for one or the other values without both values lining up , e.g., multiple stops on a North/South street where LNGs are the same while LATs differ. This does not explain why End LAT|LNG are a near perfect match.

I will plot this data geographically and compute straight line distance traveled from lat|lng as well. If it’s way off we will see it on the plots. If it’s good we will have lines departing and ending at essentially the same locations.

2.1 CONCLUSION:

This data must have been transmitted electronically from GPS units, therefore, we have to assume this data is Reliable, Original, Current and Cited.

However, there is an issue with “Comprehensive”. There is no column identifying the customer. I would not expect, nor want any personal data, but a Customer_ID with gender, age, monthly billing, trip_IDs, etc. would greatly improve Data Analysis. Without a customer_id I cannot tell how many rides were taken per customer nor the duration (lifetime) of a customer’s use of the product, how much money they spent, and so on.

Another failure of the supplied data is bike station names and locations. There are more blanks here than data. Fortunately, we have close to 100% latitude / longitude data so we know where rides start and finish, we just don’t know names or addresses of the stations or an easy way to tabulate the busiest stations (not a Business Task of this project). This data could be merged if we had a list of stations and their latitude / longitude.

2.1.1 Number and Cost of Trips by Customer

We need to find which Casual Customers are getting close to spending enough money that a Membership plan would make sense, especially if they think they would ride more often on the unlimited plan. I have already seen enough to know this will be part of my proposed Marketing Plan, but it’s going to require this data to know which casual users to contact.

I would also like to know customer’s “lifetime” as users. This should be in the customer database as well.

2.1.2 Follow-up with customer records when they become available

------------------------------------------

2.1.2.1 Statistical Sampling of Data

Looking at the amount of data loaded it is 5,666,173 rows covering all of 2022. From this I can take a statistical sample using SQL of 9,442 rows to get 99.99% accuracy with 2% error factor. This will easily fit into RStudio for continued analysis. Sample calculation from https://www.calculator.net/sample-size-calculator.html?type=1&cl=99.99&ci=2&pp=50&ps=5666173&x=58&y=21

  • Using SQL, get every 5,666,173 / 9,442 = 600th row starting at Random value between 1 and 600. Place results in a .csv file for import to Excel/SQL/RStudio as needed.
  • Add calculated columns for:
    • start_date
    • start_time
    • start_MON(th)
    • start_month
    • Duration_Min(utes)
    • Distance_Feet
  • To re-create the above, use ’SQL_Scripts/Grab_Statistical_No_Rows.sql’ if you loaded data into SQL or just run the next Chunk as I have already created that file for you as data/Cyclistic_Nth_Rows.csv
##      RowNum        rideable_type        started_at                    
##  Min.   :    405   Length:9443        Min.   :2022-01-01 01:35:00.00  
##  1st Qu.:1416705   Class :character   1st Qu.:2022-05-28 19:14:30.00  
##  Median :2833005   Mode  :character   Median :2022-07-22 14:33:00.00  
##  Mean   :2833005                      Mean   :2022-07-20 06:42:44.34  
##  3rd Qu.:4249305                      3rd Qu.:2022-09-16 03:57:00.00  
##  Max.   :5665605                      Max.   :2022-12-31 20:45:00.00  
##                                                                       
##    start_date          start_time        start_MON         start_month       
##  Min.   :2022-01-01   Length:9443       Length:9443        Length:9443       
##  1st Qu.:2022-05-28   Class1:hms        Class :character   Class :character  
##  Median :2022-07-22   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2022-07-19   Mode  :numeric                                         
##  3rd Qu.:2022-09-16                                                          
##  Max.   :2022-12-31                                                          
##                                                                              
##     ended_at                       Duration_Min       start_lat    
##  Min.   :2022-01-01 01:40:00.00   Min.   :   0.00   Min.   :41.66  
##  1st Qu.:2022-05-28 19:26:00.00   1st Qu.:   6.00   1st Qu.:41.88  
##  Median :2022-07-22 14:37:00.00   Median :  10.00   Median :41.90  
##  Mean   :2022-07-20 07:00:41.97   Mean   :  17.96   Mean   :41.90  
##  3rd Qu.:2022-09-16 04:07:00.00   3rd Qu.:  18.00   3rd Qu.:41.93  
##  Max.   :2022-12-31 21:02:00.00   Max.   :6702.00   Max.   :42.06  
##                                                                    
##    start_lng         end_lat         end_lng       Distance_Feet  
##  Min.   :-87.83   Min.   :41.65   Min.   :-87.93   Min.   :    0  
##  1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66   1st Qu.: 2851  
##  Median :-87.64   Median :41.90   Median :-87.64   Median : 5099  
##  Mean   :-87.65   Mean   :41.90   Mean   :-87.65   Mean   : 6938  
##  3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.: 8971  
##  Max.   :-87.52   Max.   :42.07   Max.   :-87.53   Max.   :96052  
##                   NA's   :6       NA's   :6        NA's   :6      
##  member_casual     
##  Length:9443       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
##      RowNum        rideable_type        started_at                    
##  Min.   :    405   Length:9437        Min.   :2022-01-01 01:35:00.00  
##  1st Qu.:1417605   Class :character   1st Qu.:2022-05-28 19:46:00.00  
##  Median :2833005   Mode  :character   Median :2022-07-22 14:33:00.00  
##  Mean   :2833438                      Mean   :2022-07-20 07:08:17.13  
##  3rd Qu.:4250205                      3rd Qu.:2022-09-16 07:06:00.00  
##  Max.   :5665605                      Max.   :2022-12-31 20:45:00.00  
##    start_date          start_time        start_MON         start_month       
##  Min.   :2022-01-01   Length:9437       Length:9437        Length:9437       
##  1st Qu.:2022-05-28   Class1:hms        Class :character   Class :character  
##  Median :2022-07-22   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2022-07-19   Mode  :numeric                                         
##  3rd Qu.:2022-09-16                                                          
##  Max.   :2022-12-31                                                          
##     ended_at                      Duration_Min      start_lat    
##  Min.   :2022-01-01 01:40:00.0   Min.   :   0.0   Min.   :41.66  
##  1st Qu.:2022-05-28 19:56:00.0   1st Qu.:   6.0   1st Qu.:41.88  
##  Median :2022-07-22 14:37:00.0   Median :  10.0   Median :41.90  
##  Mean   :2022-07-20 07:24:23.0   Mean   :  16.1   Mean   :41.90  
##  3rd Qu.:2022-09-16 07:23:00.0   3rd Qu.:  18.0   3rd Qu.:41.93  
##  Max.   :2022-12-31 21:02:00.0   Max.   :1318.0   Max.   :42.06  
##    start_lng         end_lat         end_lng       Distance_Feet  
##  Min.   :-87.83   Min.   :41.65   Min.   :-87.93   Min.   :    0  
##  1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66   1st Qu.: 2851  
##  Median :-87.64   Median :41.90   Median :-87.64   Median : 5099  
##  Mean   :-87.65   Mean   :41.90   Mean   :-87.65   Mean   : 6938  
##  3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.: 8971  
##  Max.   :-87.52   Max.   :42.07   Max.   :-87.53   Max.   :96052  
##  member_casual     
##  Length:9437       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

**If you wish to re-import the monthly Divvy files you will have to re-run the above chunks. Otherwise, just run CHUNK B.

2.1.3 CHUNK B: Load cleaned data for visualizations

On first pass the above had quotation marks around some of both rideable_type and member_casual records. This was causing duplicates in the plots because they are seen separately with and without quotes. This fragments the charts into ‘fake’ groups. Solution: simply remove the “quotes” from those columns.

Turns out fixing this is not a simple problem using R. I spent a literally hours researching and was not able to find a solution. I could spend much more time looking, but this is not productive. Using all the tools available to me, I loaded Cyclistic_clean.csv into Excel and used ctrl-h to find and replace any ” with nothing. This took about 10 seconds and we’re back in business.

3 PROCESS DATA

3.0.0.1 Now that the data is clean, let’s look at some stats.

3.0.1 CHUNK C: Tabular Summary of Data

## $casual
##        time  (-0.1,100] (100,500] (500,1320] (1320,2640] (2640,5280]
##   (-0.1,15]  8.4% (192)  0.3% (6)  1.9% (43) 11.8% (271) 40.0% (916)
##     (15,30]  5.8%  (53)  0.1% (1)  1.2% (11)  2.9%  (27) 11.6% (106)
##     (30,45] 12.7%  (39)  0.0% (0)  1.6%  (5)  2.9%   (9) 16.0%  (49)
##     (45,60] 16.9%  (23)  0.0% (0)  5.9%  (8)  5.1%   (7) 17.6%  (24)
##    (60,120] 23.1%  (39)  0.0% (0)  3.0%  (5)  7.7%  (13) 15.4%  (26)
##  (120,1000] 25.0%  (14)  1.8% (1)  7.1%  (4) 19.6%  (11)  8.9%   (5)
##        <NA>  0.0%   (0)  0.0% (0)  0.0%  (0) 33.3%   (1)  0.0%   (0)
##  (5280,100000]
##    37.6% (860)
##    78.4% (719)
##    66.7% (204)
##    54.4%  (74)
##    50.9%  (86)
##    37.5%  (21)
##    66.7%   (2)
## 
## $member
##        time  (-0.1,100] (100,500] (500,1320] (1320,2640]  (2640,5280]
##   (-0.1,15]  5.0% (214) 0.4% (16) 3.5% (149) 17.8% (754) 38.2% (1620)
##     (15,30]  3.0%  (30) 0.1%  (1) 0.9%   (9)  2.0%  (20)  6.3%   (62)
##     (30,45]  9.3%  (23) 0.0%  (0) 1.2%   (3)  4.9%  (12)  8.5%   (21)
##     (45,60] 16.3%   (8) 0.0%  (0) 2.0%   (1) 12.2%   (6)  4.1%    (2)
##    (60,120]  3.8%   (1) 0.0%  (0) 7.7%   (2) 11.5%   (3) 19.2%    (5)
##  (120,1000] 27.3%   (3) 0.0%  (0) 0.0%   (0) 27.3%   (3)  9.1%    (1)
##  (5280,100000]
##   35.1% (1486)
##   87.7%  (868)
##   76.1%  (188)
##   65.3%   (32)
##   57.7%   (15)
##   36.4%    (4)
##   Duration_Min    Distance_Feet   member_casual              time     
##  Min.   :   0.0   Min.   :    0   Length:9437        (-0.1,15] :6527  
##  1st Qu.:   6.0   1st Qu.: 2851   Class :character   (15,30]   :1907  
##  Median :  10.0   Median : 5099   Mode  :character   (30,45]   : 553  
##  Mean   :  16.1   Mean   : 6938                      (45,60]   : 185  
##  3rd Qu.:  18.0   3rd Qu.: 8971                      (60,120]  : 195  
##  Max.   :1318.0   Max.   :96052                      (120,1000]:  67  
##                                                      NA's      :   3  
##             dist     
##  (-0.1,100]   : 639  
##  (100,500]    :  25  
##  (500,1320]   : 240  
##  (1320,2640]  :1137  
##  (2640,5280]  :2837  
##  (5280,100000]:4559  
## 

3.0.1.1 Notes on Summary Data:

  • Distance_Feet < 100’ in [-0.1,100] likely means the trips started and ended at the same Station.

  • 70% of all rides are < 15 minutes

  • 90% of all rides are < 30 minutes

  • Longest ride is 20 miles

  • The raw numbers are hard to compare.

  • See charts below for a clearer picture.

4 VISUALIZE DATA

  • Members take more lengthy rides.
  • Might be a cost factor since they have unlimited use.