To get directly to Data Viz, just run CHUNKs A, B and C.
To evaluate Prepare, Clean & Process phases continue step-by-step beginning with CHUNK A.
## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"
## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"
## ride_id rideable_type started_at ended_at
## Length:103771 Length:103771 Length:103771 Length:103771
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:103771 Length:103771 Length:103771 Length:103771
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.65 Min. :-87.83 Min. :41.65 Min. :-87.83
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.89 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.07 Max. :-87.52
## NA's :1 NA's :1 NA's :87 NA's :87
## member_casual
## Length:103771
## Class :character
## Mode :character
##
##
##
##
## Mode FALSE TRUE NA's
## logical 103769 1 1
## # A tibble: 60 × 13
## ride_id ridea…¹ start…² ended…³ start…⁴ start…⁵ end_s…⁶ end_s…⁷ start…⁸
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 C2F7DD78E82E… electr… 1/13/2… 1/13/2… Glenwo… 525 Clark … RP-007 42.0
## 2 A6CF8980A652… electr… 1/10/2… 1/10/2… Glenwo… 525 Clark … RP-007 42.0
## 3 BD0F91DFF741… classi… 1/25/2… 1/25/2… Sheffi… TA1306… Greenv… TA1307… 41.9
## 4 CBB80ED41910… classi… 1/4/20… 1/4/20… Clark … KA1504… Paulin… TA1309… 42.0
## 5 DDC963BFDDA5… classi… 1/20/2… 1/20/2… Michig… TA1309… State … TA1305… 41.9
## 6 A39C6F6CC058… classi… 1/11/2… 1/11/2… Wood S… 637 Honore… TA1305… 41.9
## 7 BDC4AB637EDF… classi… 1/30/2… 1/30/2… Oakley… KA1504… Broadw… 13323 42.0
## 8 81751A3186E5… classi… 1/22/2… 1/22/2… Sheffi… TA1306… Damen … 13271 41.9
## 9 154222B86A33… electr… 1/17/2… 1/17/2… Racine… 13304 Clinto… WL-012 41.9
## 10 72DC25B2DD46… classi… 1/28/2… 1/28/2… LaSall… TA1309… Clinto… WL-012 41.9
## # … with 50 more rows, 4 more variables: start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²started_at, ³ended_at, ⁴start_station_name,
## # ⁵start_station_id, ⁶end_station_name, ⁷end_station_id, ⁸start_lat
The Summary shows it was just the one single error being out-of-bounds.
## Do you want to import and clean Divvy 09-12 again? (Yes/no/cancel)
## [1] "Monthly_csv_files/202209-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202210-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202211-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202212-divvy-tripdata.csv"
COUNT(DISTINCT[field]) All Start|End pairs should match. Can’t have a trip start without an end! Here’s what I found:
Results are somewhat reasonable except for start_lat and start_lng indicating there is some kind of issue here. I suspect issues with insignificant decimal resolution. I tried rounding these two values. It turns out there is an optimal rounding value of 4 places. just very weird that only 70.1% of start lat/lng match while 99.96% of end lat/lng match.
Rounding gives the following results:
Much closer, but still significantly off. Further, I would expect Start|End LAT|LNG to approximately equal Start|End station names and station_id. Again, all close but not identical. Could be missing station info.
The LAT|LNG discrepancy could also be explained by variances in exactly where the bike was parked when the prior trip ended, i.e., different ends of the bike rack, as these positions must be reported electronically by the bike’s instrumentation and not manually transcribed.
Another probability is LAT or LNG values that “line-up” for one or the other values without both values lining up , e.g., multiple stops on a North/South street where LNGs are the same while LATs differ. This does not explain why End LAT|LNG are a near perfect match.
I will plot this data geographically and compute straight line distance traveled from lat|lng as well. If it’s way off we will see it on the plots. If it’s good we will have lines departing and ending at essentially the same locations.
This data must have been transmitted electronically from GPS units, therefore, we have to assume this data is Reliable, Original, Current and Cited.
However, there is an issue with “Comprehensive”. There is no column identifying the customer. I would not expect, nor want any personal data, but a Customer_ID with gender, age, monthly billing, trip_IDs, etc. would greatly improve Data Analysis. Without a customer_id I cannot tell how many rides were taken per customer nor the duration (lifetime) of a customer’s use of the product, how much money they spent, and so on.
Another failure of the supplied data is bike station names and locations. There are more blanks here than data. Fortunately, we have close to 100% latitude / longitude data so we know where rides start and finish, we just don’t know names or addresses of the stations or an easy way to tabulate the busiest stations (not a Business Task of this project). This data could be merged if we had a list of stations and their latitude / longitude.
We need to find which Casual Customers are getting close to spending enough money that a Membership plan would make sense, especially if they think they would ride more often on the unlimited plan. I have already seen enough to know this will be part of my proposed Marketing Plan, but it’s going to require this data to know which casual users to contact.
I would also like to know customer’s “lifetime” as users. This should be in the customer database as well.
------------------------------------------
Looking at the amount of data loaded it is 5,666,173 rows covering all of 2022. From this I can take a statistical sample using SQL of 9,442 rows to get 99.99% accuracy with 2% error factor. This will easily fit into RStudio for continued analysis. Sample calculation from https://www.calculator.net/sample-size-calculator.html?type=1&cl=99.99&ci=2&pp=50&ps=5666173&x=58&y=21
## RowNum rideable_type started_at
## Min. : 405 Length:9443 Min. :2022-01-01 01:35:00.00
## 1st Qu.:1416705 Class :character 1st Qu.:2022-05-28 19:14:30.00
## Median :2833005 Mode :character Median :2022-07-22 14:33:00.00
## Mean :2833005 Mean :2022-07-20 06:42:44.34
## 3rd Qu.:4249305 3rd Qu.:2022-09-16 03:57:00.00
## Max. :5665605 Max. :2022-12-31 20:45:00.00
##
## start_date start_time start_MON start_month
## Min. :2022-01-01 Length:9443 Length:9443 Length:9443
## 1st Qu.:2022-05-28 Class1:hms Class :character Class :character
## Median :2022-07-22 Class2:difftime Mode :character Mode :character
## Mean :2022-07-19 Mode :numeric
## 3rd Qu.:2022-09-16
## Max. :2022-12-31
##
## ended_at Duration_Min start_lat
## Min. :2022-01-01 01:40:00.00 Min. : 0.00 Min. :41.66
## 1st Qu.:2022-05-28 19:26:00.00 1st Qu.: 6.00 1st Qu.:41.88
## Median :2022-07-22 14:37:00.00 Median : 10.00 Median :41.90
## Mean :2022-07-20 07:00:41.97 Mean : 17.96 Mean :41.90
## 3rd Qu.:2022-09-16 04:07:00.00 3rd Qu.: 18.00 3rd Qu.:41.93
## Max. :2022-12-31 21:02:00.00 Max. :6702.00 Max. :42.06
##
## start_lng end_lat end_lng Distance_Feet
## Min. :-87.83 Min. :41.65 Min. :-87.93 Min. : 0
## 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.: 2851
## Median :-87.64 Median :41.90 Median :-87.64 Median : 5099
## Mean :-87.65 Mean :41.90 Mean :-87.65 Mean : 6938
## 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 8971
## Max. :-87.52 Max. :42.07 Max. :-87.53 Max. :96052
## NA's :6 NA's :6 NA's :6
## member_casual
## Length:9443
## Class :character
## Mode :character
##
##
##
##
## RowNum rideable_type started_at
## Min. : 405 Length:9437 Min. :2022-01-01 01:35:00.00
## 1st Qu.:1417605 Class :character 1st Qu.:2022-05-28 19:46:00.00
## Median :2833005 Mode :character Median :2022-07-22 14:33:00.00
## Mean :2833438 Mean :2022-07-20 07:08:17.13
## 3rd Qu.:4250205 3rd Qu.:2022-09-16 07:06:00.00
## Max. :5665605 Max. :2022-12-31 20:45:00.00
## start_date start_time start_MON start_month
## Min. :2022-01-01 Length:9437 Length:9437 Length:9437
## 1st Qu.:2022-05-28 Class1:hms Class :character Class :character
## Median :2022-07-22 Class2:difftime Mode :character Mode :character
## Mean :2022-07-19 Mode :numeric
## 3rd Qu.:2022-09-16
## Max. :2022-12-31
## ended_at Duration_Min start_lat
## Min. :2022-01-01 01:40:00.0 Min. : 0.0 Min. :41.66
## 1st Qu.:2022-05-28 19:56:00.0 1st Qu.: 6.0 1st Qu.:41.88
## Median :2022-07-22 14:37:00.0 Median : 10.0 Median :41.90
## Mean :2022-07-20 07:24:23.0 Mean : 16.1 Mean :41.90
## 3rd Qu.:2022-09-16 07:23:00.0 3rd Qu.: 18.0 3rd Qu.:41.93
## Max. :2022-12-31 21:02:00.0 Max. :1318.0 Max. :42.06
## start_lng end_lat end_lng Distance_Feet
## Min. :-87.83 Min. :41.65 Min. :-87.93 Min. : 0
## 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.: 2851
## Median :-87.64 Median :41.90 Median :-87.64 Median : 5099
## Mean :-87.65 Mean :41.90 Mean :-87.65 Mean : 6938
## 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 8971
## Max. :-87.52 Max. :42.07 Max. :-87.53 Max. :96052
## member_casual
## Length:9437
## Class :character
## Mode :character
##
##
##
**If you wish to re-import the monthly Divvy files you will have to re-run the above chunks. Otherwise, just run CHUNK B.
On first pass the above had quotation marks around some of both rideable_type and member_casual records. This was causing duplicates in the plots because they are seen separately with and without quotes. This fragments the charts into ‘fake’ groups. Solution: simply remove the “quotes” from those columns.
Turns out fixing this is not a simple problem using R. I spent a literally hours researching and was not able to find a solution. I could spend much more time looking, but this is not productive. Using all the tools available to me, I loaded Cyclistic_clean.csv into Excel and used ctrl-h to find and replace any ” with nothing. This took about 10 seconds and we’re back in business.
## $casual
## time (-0.1,100] (100,500] (500,1320] (1320,2640] (2640,5280]
## (-0.1,15] 8.4% (192) 0.3% (6) 1.9% (43) 11.8% (271) 40.0% (916)
## (15,30] 5.8% (53) 0.1% (1) 1.2% (11) 2.9% (27) 11.6% (106)
## (30,45] 12.7% (39) 0.0% (0) 1.6% (5) 2.9% (9) 16.0% (49)
## (45,60] 16.9% (23) 0.0% (0) 5.9% (8) 5.1% (7) 17.6% (24)
## (60,120] 23.1% (39) 0.0% (0) 3.0% (5) 7.7% (13) 15.4% (26)
## (120,1000] 25.0% (14) 1.8% (1) 7.1% (4) 19.6% (11) 8.9% (5)
## <NA> 0.0% (0) 0.0% (0) 0.0% (0) 33.3% (1) 0.0% (0)
## (5280,100000]
## 37.6% (860)
## 78.4% (719)
## 66.7% (204)
## 54.4% (74)
## 50.9% (86)
## 37.5% (21)
## 66.7% (2)
##
## $member
## time (-0.1,100] (100,500] (500,1320] (1320,2640] (2640,5280]
## (-0.1,15] 5.0% (214) 0.4% (16) 3.5% (149) 17.8% (754) 38.2% (1620)
## (15,30] 3.0% (30) 0.1% (1) 0.9% (9) 2.0% (20) 6.3% (62)
## (30,45] 9.3% (23) 0.0% (0) 1.2% (3) 4.9% (12) 8.5% (21)
## (45,60] 16.3% (8) 0.0% (0) 2.0% (1) 12.2% (6) 4.1% (2)
## (60,120] 3.8% (1) 0.0% (0) 7.7% (2) 11.5% (3) 19.2% (5)
## (120,1000] 27.3% (3) 0.0% (0) 0.0% (0) 27.3% (3) 9.1% (1)
## (5280,100000]
## 35.1% (1486)
## 87.7% (868)
## 76.1% (188)
## 65.3% (32)
## 57.7% (15)
## 36.4% (4)
## Duration_Min Distance_Feet member_casual time
## Min. : 0.0 Min. : 0 Length:9437 (-0.1,15] :6527
## 1st Qu.: 6.0 1st Qu.: 2851 Class :character (15,30] :1907
## Median : 10.0 Median : 5099 Mode :character (30,45] : 553
## Mean : 16.1 Mean : 6938 (45,60] : 185
## 3rd Qu.: 18.0 3rd Qu.: 8971 (60,120] : 195
## Max. :1318.0 Max. :96052 (120,1000]: 67
## NA's : 3
## dist
## (-0.1,100] : 639
## (100,500] : 25
## (500,1320] : 240
## (1320,2640] :1137
## (2640,5280] :2837
## (5280,100000]:4559
##
Distance_Feet < 100’ in [-0.1,100] likely means the trips started and ended at the same Station.
70% of all rides are < 15 minutes
90% of all rides are < 30 minutes
Longest ride is 20 miles
The raw numbers are hard to compare.
See charts below for a clearer picture.