To get directly to Data Viz, just run CHUNKs A, B and C.
To evaluate Prepare, Clean & Process phases continue step-by-step beginning with CHUNK A.
## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"
## [1] "C:/Users/rock/iCloudDrive/R/RStudio/Cyclistic_Case_Study"
## ride_id rideable_type started_at ended_at
## Length:103771 Length:103771 Length:103771 Length:103771
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:103771 Length:103771 Length:103771 Length:103771
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.65 Min. :-87.83 Min. :41.65 Min. :-87.83
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.89 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.07 Max. :-87.52
## NA's :1 NA's :1 NA's :87 NA's :87
## member_casual
## Length:103771
## Class :character
## Mode :character
##
##
##
##
## Mode FALSE TRUE NA's
## logical 103769 1 1
## # A tibble: 60 × 13
## ride_id ridea…¹ start…² ended…³ start…⁴ start…⁵ end_s…⁶ end_s…⁷ start…⁸
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 C2F7DD78E82E… electr… 1/13/2… 1/13/2… Glenwo… 525 Clark … RP-007 42.0
## 2 A6CF8980A652… electr… 1/10/2… 1/10/2… Glenwo… 525 Clark … RP-007 42.0
## 3 BD0F91DFF741… classi… 1/25/2… 1/25/2… Sheffi… TA1306… Greenv… TA1307… 41.9
## 4 CBB80ED41910… classi… 1/4/20… 1/4/20… Clark … KA1504… Paulin… TA1309… 42.0
## 5 DDC963BFDDA5… classi… 1/20/2… 1/20/2… Michig… TA1309… State … TA1305… 41.9
## 6 A39C6F6CC058… classi… 1/11/2… 1/11/2… Wood S… 637 Honore… TA1305… 41.9
## 7 BDC4AB637EDF… classi… 1/30/2… 1/30/2… Oakley… KA1504… Broadw… 13323 42.0
## 8 81751A3186E5… classi… 1/22/2… 1/22/2… Sheffi… TA1306… Damen … 13271 41.9
## 9 154222B86A33… electr… 1/17/2… 1/17/2… Racine… 13304 Clinto… WL-012 41.9
## 10 72DC25B2DD46… classi… 1/28/2… 1/28/2… LaSall… TA1309… Clinto… WL-012 41.9
## # … with 50 more rows, 4 more variables: start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹rideable_type, ²started_at, ³ended_at, ⁴start_station_name,
## # ⁵start_station_id, ⁶end_station_name, ⁷end_station_id, ⁸start_lat
The Summary shows it was just the one single error being out-of-bounds.
## Do you want to import and clean Divvy 09-12 again? (Yes/no/cancel)
## [1] "Monthly_csv_files/202209-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202210-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202211-divvy-tripdata.csv"
## [1] "Monthly_csv_files/202212-divvy-tripdata.csv"
COUNT(DISTINCT[field]) All Start|End pairs should match. Can’t have a trip start without an end! Here’s what I found:
Results are somewhat reasonable except for start_lat and start_lng indicating there is some kind of issue here. I suspect issues with insignificant decimal resolution. I tried rounding these two values. It turns out there is an optimal rounding value of 4 places. just very weird that only 70.1% of start lat/lng match while 99.96% of end lat/lng match.
Rounding gives the following results:
Much closer, but still significantly off. Further, I would expect Start|End LAT|LNG to approximately equal Start|End station names and station_id. Again, all close but not identical. Could be missing station info.
The LAT|LNG discrepancy could also be explained by variances in exactly where the bike was parked when the prior trip ended, i.e., different ends of the bike rack, as these positions must be reported electronically by the bike’s instrumentation and not manually transcribed.
Another probability is LAT or LNG values that “line-up” for one or the other values without both values lining up , e.g., multiple stops on a North/South street where LNGs are the same while LATs differ. This does not explain why End LAT|LNG are a near perfect match.
I will plot this data geographically and compute straight line distance traveled from lat|lng as well. If it’s way off we will see it on the plots. If it’s good we will have lines departing and ending at essentially the same locations.
This data must have been transmitted electronically from GPS units, therefore, we have to assume this data is Reliable, Original, Current and Cited.
However, there is an issue with “Comprehensive”. There is no column identifying the customer. I would not expect, nor want any personal data, but a Customer_ID with gender, age, monthly billing, trip_IDs, etc. would greatly improve Data Analysis. Without a customer_id I cannot tell how many rides were taken per customer nor the duration (lifetime) of a customer’s use of the product, how much money they spent, and so on.
Another failure of the supplied data is bike station names and locations. There are more blanks here than data. Fortunately, we have close to 100% latitude / longitude data so we know where rides start and finish, we just don’t know names or addresses of the stations or an easy way to tabulate the busiest stations (not a Business Task of this project). This data could be merged if we had a list of stations and their latitude / longitude.
We need to find which Casual Customers are getting close to spending enough money that a Membership plan would make sense, especially if they think they would ride more often on the unlimited plan. I have already seen enough to know this will be part of my proposed Marketing Plan, but it’s going to require this data to know which casual users to contact.
I would also like to know customer’s “lifetime” as users. This should be in the customer database as well.
------------------------------------------
Looking at the amount of data loaded it is 5,666,173 rows covering all of 2022. From this I can take a statistical sample using SQL of 9,442 rows to get 99.99% accuracy with 2% error factor. This will easily fit into RStudio for continued analysis. Sample calculation from https://www.calculator.net/sample-size-calculator.html?type=1&cl=99.99&ci=2&pp=50&ps=5666173&x=58&y=21
## RowNum rideable_type started_at
## Min. : 405 Length:9443 Min. :2022-01-01 01:35:00.00
## 1st Qu.:1416705 Class :character 1st Qu.:2022-05-28 19:14:30.00
## Median :2833005 Mode :character Median :2022-07-22 14:33:00.00
## Mean :2833005 Mean :2022-07-20 06:42:44.34
## 3rd Qu.:4249305 3rd Qu.:2022-09-16 03:57:00.00
## Max. :5665605 Max. :2022-12-31 20:45:00.00
##
## start_date start_time start_MON start_month
## Min. :2022-01-01 Length:9443 Length:9443 Length:9443
## 1st Qu.:2022-05-28 Class1:hms Class :character Class :character
## Median :2022-07-22 Class2:difftime Mode :character Mode :character
## Mean :2022-07-19 Mode :numeric
## 3rd Qu.:2022-09-16
## Max. :2022-12-31
##
## ended_at Duration_Min start_lat
## Min. :2022-01-01 01:40:00.00 Min. : 0.00 Min. :41.66
## 1st Qu.:2022-05-28 19:26:00.00 1st Qu.: 6.00 1st Qu.:41.88
## Median :2022-07-22 14:37:00.00 Median : 10.00 Median :41.90
## Mean :2022-07-20 07:00:41.97 Mean : 17.96 Mean :41.90
## 3rd Qu.:2022-09-16 04:07:00.00 3rd Qu.: 18.00 3rd Qu.:41.93
## Max. :2022-12-31 21:02:00.00 Max. :6702.00 Max. :42.06
##
## start_lng end_lat end_lng Distance_Feet
## Min. :-87.83 Min. :41.65 Min. :-87.93 Min. : 0
## 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.: 2851
## Median :-87.64 Median :41.90 Median :-87.64 Median : 5099
## Mean :-87.65 Mean :41.90 Mean :-87.65 Mean : 6938
## 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 8971
## Max. :-87.52 Max. :42.07 Max. :-87.53 Max. :96052
## NA's :6 NA's :6 NA's :6
## member_casual
## Length:9443
## Class :character
## Mode :character
##
##
##
##
## RowNum rideable_type started_at
## Min. : 405 Length:9437 Min. :2022-01-01 01:35:00.00
## 1st Qu.:1417605 Class :character 1st Qu.:2022-05-28 19:46:00.00
## Median :2833005 Mode :character Median :2022-07-22 14:33:00.00
## Mean :2833438 Mean :2022-07-20 07:08:17.13
## 3rd Qu.:4250205 3rd Qu.:2022-09-16 07:06:00.00
## Max. :5665605 Max. :2022-12-31 20:45:00.00
## start_date start_time start_MON start_month
## Min. :2022-01-01 Length:9437 Length:9437 Length:9437
## 1st Qu.:2022-05-28 Class1:hms Class :character Class :character
## Median :2022-07-22 Class2:difftime Mode :character Mode :character
## Mean :2022-07-19 Mode :numeric
## 3rd Qu.:2022-09-16
## Max. :2022-12-31
## ended_at Duration_Min start_lat
## Min. :2022-01-01 01:40:00.0 Min. : 0.0 Min. :41.66
## 1st Qu.:2022-05-28 19:56:00.0 1st Qu.: 6.0 1st Qu.:41.88
## Median :2022-07-22 14:37:00.0 Median : 10.0 Median :41.90
## Mean :2022-07-20 07:24:23.0 Mean : 16.1 Mean :41.90
## 3rd Qu.:2022-09-16 07:23:00.0 3rd Qu.: 18.0 3rd Qu.:41.93
## Max. :2022-12-31 21:02:00.0 Max. :1318.0 Max. :42.06
## start_lng end_lat end_lng Distance_Feet
## Min. :-87.83 Min. :41.65 Min. :-87.93 Min. : 0
## 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.: 2851
## Median :-87.64 Median :41.90 Median :-87.64 Median : 5099
## Mean :-87.65 Mean :41.90 Mean :-87.65 Mean : 6938
## 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 8971
## Max. :-87.52 Max. :42.07 Max. :-87.53 Max. :96052
## member_casual
## Length:9437
## Class :character
## Mode :character
##
##
##
**If you wish to re-import the monthly Divvy files you will have to re-run the above chunks. Otherwise, just run CHUNK B.
On first pass the above had quotation marks around some of both rideable_type and member_casual records. This was causing duplicates in the plots because they are seen separately with and without quotes. This fragments the charts into ‘fake’ groups. Solution: simply remove the “quotes” from those columns.
Turns out fixing this is not a simple problem using R. I spent a literally hours researching and was not able to find a solution. I could spend much more time looking, but this is not productive. Using all the tools available to me, I loaded Cyclistic_clean.csv into Excel and used ctrl-h to find and replace any ” with nothing. This took about 10 seconds and we’re back in business.
## $casual
## time (-0.1,100] (100,500] (500,1320] (1320,2640] (2640,5280]
## (-0.1,15] 8.4% (192) 0.3% (6) 1.9% (43) 11.8% (271) 40.0% (916)
## (15,30] 5.8% (53) 0.1% (1) 1.2% (11) 2.9% (27) 11.6% (106)
## (30,45] 12.7% (39) 0.0% (0) 1.6% (5) 2.9% (9) 16.0% (49)
## (45,60] 16.9% (23) 0.0% (0) 5.9% (8) 5.1% (7) 17.6% (24)
## (60,120] 23.1% (39) 0.0% (0) 3.0% (5) 7.7% (13) 15.4% (26)
## (120,1000] 25.0% (14) 1.8% (1) 7.1% (4) 19.6% (11) 8.9% (5)
## <NA> 0.0% (0) 0.0% (0) 0.0% (0) 33.3% (1) 0.0% (0)
## (5280,100000]
## 37.6% (860)
## 78.4% (719)
## 66.7% (204)
## 54.4% (74)
## 50.9% (86)
## 37.5% (21)
## 66.7% (2)
##
## $member
## time (-0.1,100] (100,500] (500,1320] (1320,2640] (2640,5280]
## (-0.1,15] 5.0% (214) 0.4% (16) 3.5% (149) 17.8% (754) 38.2% (1620)
## (15,30] 3.0% (30) 0.1% (1) 0.9% (9) 2.0% (20) 6.3% (62)
## (30,45] 9.3% (23) 0.0% (0) 1.2% (3) 4.9% (12) 8.5% (21)
## (45,60] 16.3% (8) 0.0% (0) 2.0% (1) 12.2% (6) 4.1% (2)
## (60,120] 3.8% (1) 0.0% (0) 7.7% (2) 11.5% (3) 19.2% (5)
## (120,1000] 27.3% (3) 0.0% (0) 0.0% (0) 27.3% (3) 9.1% (1)
## (5280,100000]
## 35.1% (1486)
## 87.7% (868)
## 76.1% (188)
## 65.3% (32)
## 57.7% (15)
## 36.4% (4)
## Duration_Min Distance_Feet member_casual time
## Min. : 0.0 Min. : 0 Length:9437 (-0.1,15] :6527
## 1st Qu.: 6.0 1st Qu.: 2851 Class :character (15,30] :1907
## Median : 10.0 Median : 5099 Mode :character (30,45] : 553
## Mean : 16.1 Mean : 6938 (45,60] : 185
## 3rd Qu.: 18.0 3rd Qu.: 8971 (60,120] : 195
## Max. :1318.0 Max. :96052 (120,1000]: 67
## NA's : 3
## dist
## (-0.1,100] : 639
## (100,500] : 25
## (500,1320] : 240
## (1320,2640] :1137
## (2640,5280] :2837
## (5280,100000]:4559
##
Distance_Feet < 100’ in [-0.1,100] likely means the trips started and ended at the same Station.
70% of all rides are < 15 minutes
90% of all rides are < 30 minutes
Longest ride is 20 miles
The raw numbers are hard to compare.
See charts below for a clearer picture.
The main take-a ways from this chart are:
Casual Rider use drops off significantly during Winter months
Only Casual Riders use “Docked Bikes”
I interpret these findings to mean Members rely more on Bikes all year meaning a good portion of them must be using them to commute to work.
We can see more obviously in the above chart, usage falls considerably in the winter months…more so with Casual Riders than Members. Nothing unexpected here for a Northern climate city, but targeting people where they work might yield an increase in Members.
This confirms, from the spikes around 7-9AM and again 4-6PM, that a lot of members use their bikes for going to/from work.
Also note that casual rider usage peaks in the same 4-6PM time frame. It is not clear why.
Finally member rides in the 4-6PM time frame volume do not match the morning volume so there must be additional member riders in the afternoon than just those commuting to and from work. Could this be partially due to an evening shift of workers arriving?
In light of the above, we should target casual users for getting to/from work from either their home, if in the area, or from mass-transit locations. Possibly using billboard ads at mass-transit points and signs and flyers in major office buildings as well as at our Bike Stations.
Casual Riders also use the bike service more often on weekends approximately matching Member rider weekend usage. Member riders not only have higher use M-F, it’s even higher than weekend use. This confirms again that Member riders are using our service for work and pleasure riding. It’s also possible that some casual riders simply do not work in the Chicago area served by Cyclistic Bike Services. It would be useful to plot Casual and Member user’s home location from address or zip code data in Customer records.
More confirmation of bikes being used for work as weekend use has a nice curve around the best riding times for pleasure while M-F use dips between 9 and 5 while most people are working.
Member usage drops a little on Thursdays and even more on Fridays…a 4-day work week?
Interesting Chart, but only confirms the obvious; Winter months are not good for Bike rentals. This may be partially a financial deterrent for Casual Riders. I would need pricing and customer data to confirm.
90% of all rides are less than 30 minutes, 70% are less than 15 minutes
Longest ride is 20 miles
Casual Rider use drops off significantly during Winter months
Member use drops during Winter months too, just not as much -> using bikes for work
Only Casual Riders use “Docked Bikes”
Pleasure rides for all members are still big
use their bikes for going to/from work
take more lengthy rides (see limitations below)
ride more (see limitations below)
Since I do not (yet) have access to Customer records I can’t compare rides/customer, just rides per customer “type”.
Trip distance is point-to-point. Actual trip distance is not included.
Post billboards at all mass-transit locations in our service area.
Post signs and flyers in major office buildings as well as at our Bike Stations.
Direct mail/email to Casual users with the same message.
Use customer database to prioritize Casual users that spend more money and are closer to the price of a Membership.
Point out once they become a Member they can ride when ever and where ever they like at no extra cost.
Do not forget to cultivate new users, Casual and Members.
Customers have a defined lifetime and we need a constant supply of new customers, especially if enticing Casual users to convert to Members.
I can make these recommendations once I have access to Customer data.
We don’t want to run out of opportunities to upgrade!