About me
My name is Kinsley, and I am a Data Professional. I hold a bachelor's degree in Business Management Studies from the University of Cape Town. Last year, I earned the Google Data Analytics Professional Certificate, & I am also enrolled with the Microsoft Power BI Data Analysis Certificate programme.
As a little child growing up & discovering the world, my curiosity, was once almost a nuisance to my dad. However, to his relief now my inquisitive spirit has found a new home in the world of Data Analytics. Through deep dives into datasets, I uncover insights that explain phenomena and offer data-driven insights for your organization.
Driven by my love for insights, I've been working at Drapac, an Atlanta-based company, where I’ve played a key role in driving data-driven decision-making. Within just 3-months, I earned recognition as a 'superstar' for my contributions to both Drapac & the Global Employment Challenge programme. So, I’m eager to bring the same degree of excellence to your company & ensuring that your data processes are top-notch and high like the STARS. Last year, I completed the Google Data Analytics Capstone Project, which is displayed below in the portfolio section.
The Project focused on leveraging the Divvy bikeshare dataset to enhance membership and revenue generation for the company. In 2022 I also worked for about a year at MSCI as a Financial Data Analyst, where I gathered, analyzed and stored large volumes of financial data.
Portfolio
The Google Data Analytics Capstone Project is an optional but recommended Project that allows graduates of the course to showcase their acquired skills.
The Capstone Project seeks to simulate a real-world business challenge faced by a data analyst working for a fictional company called "Cyclistic". The company operates as a bike share company based in Chicago.
In addressing the business challenge on the Project, I mainly utilized 3 tools: Microsoft Excel, R, and Tableau - despite also being equipped with SQL skills, through the Certificate.
Before we delve into the project, I want to outline my analytic flow, which is divided into the following 5 phases:
1. Ask
This is a diagnostic phase. So, I ask questions and understand the business settings/context in which the problem exists. This phase is not a once off phase, rather it's an ongoing phase and forms part of the fabric of my analytic flow. Hence an open line of communication will be kept throughout the analysis, between me, my team and other relevant key stakeholders who have stakes in the business problem.
2. Prepare (Extraction)
The data preparation phase marks the beginning of ensuring that the data is reliable and suitable to be extracted for subsequent processing (& analysis), from its various sources. Data can be extracted from various sources; this includes sources such as databases and web pages. When extracting data from databases, I typically utilize SQL, while for web pages, I can employ methods such as direct downloads or web scraping. For this phase I only ensure reliability of the data through assessing factors such as the representativeness of the data set, and the quality & quantity of the data's sources.
3. Process (Transform)
Here, I still continue to ensure the data's reliability, but through processing the data from being dirty to clean. However, I am not only just going to clean the prepared data, but I am also going to document the steps I took to clean the data. Depending on how big the data set is I might opt to use SQL, R, Python or Excel to carry out the cleaning.
4. Analyse & Share (Load)
At this point the data is ready to be loaded for further analytic purposes, as the whole ETL(Extract, Tranform,Load) framework is done. Here I typically present visuals generated via Tableau, as well as unveil the interesting insights, that I deemed to be key for both our analysis and subsequent conclusions (& recommendations).
5.Act (Conclusions & Recommendations)
1.Ask phase
The business settings/context
The setting is that I am a data analyst working in the marketing analyst team at Cyclistic. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Also, the finance team says annual member riders are more profitable than casual bike riders. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members.
Questions I asked:
· How do annual members and casual riders use Cyclistic bikes differently?
· Why would casual riders buy Cyclistic annual memberships?
· How can Cyclistic use digital media to influence casual riders to become members?
With the above phase, done the below sections will seek to answer these questions eventually.
2. Prepare phase
The data shows no significant issues of bias or credibility, as the data is collected from a from a fairly representative sample and verified for accuracy by both the University and City of Chicago. The data meets the ROCCC criteria, being reliable, original, comprehensive, coherent, and cited. Furthermore, it was also reviewed by experts. Licensing, privacy, and security are appropriately addressed, with the data being both freely accessible and protected. Its integrity was verified by cross-checking with other sources such as the City of Chicago's transportation data portal and statistical testing.
With some reliability on the data established I proceeded to download the dataset from the source which is the divvy website https://divvytripdata.s3.amazonaws.com. The dataset included 12 separate spreadsheets. Each spreadsheet represents bike trips data for 12 consecutive months (May 2022 to April 2023). Via Excel, I was able to notice that the bikeshare data in each spreadsheet is well-organized, and clearly labelled into self-explanatory columns such as the ‘started_at” and “ended_at” column to indicate both the start time and end time of the bike trips. The data appears to be able to help me answer questions established on the ask phase.
However, I noticed that the dataset was too large for Excel to load, so I switched to using R for further inspection. The R code below shows how I loaded the data into R:
# # # # # # # # # # # # # # # # # # # # # # # #### ##<<__Install & load required packages__>>##### # # # # # # # # # # # # # # # # # # # # # # ## install.packages("data.table")# For loading and subsequently, combining the spreadsheets conveniently( and efficiently).install.packages("readr") # For reading the spreadsheetsinstall.packages("tidyverse")# tidyverse for data import and wrangling!install.packages("lubridate")# lubridate for date functions! library(data.table)library(readr)library(tidyverse) library(lubridate) #=================================# STEP 1: COLLECT DATA#=================================#Set windows destination & Upload Divvy datasets (csv files) here setwd("C:/cyclistic/12 months bike trips data")trips <- list.files(pattern="*.csv")print(trips_)#Rename the above printed files to an appropriate dataframe nameapr_2023<- read_csv("202304-Trips.csv")mar_2023<- read_csv("202303-Trips.csv")feb_2023<- read_csv("202302-Trips.csv")jan_2023<- read_csv("202301-Trips.csv")dec_2022 <- read_csv("202212-Trips.csv")nov_2022<- read_csv("202211-Trips.csv")oct_2022<- read_csv("202210-Trips.csv")sep_2022<- read_csv("202209-Trips.csv")aug_2022<- read_csv("202208-Trips.csv")jul_2022<- read_csv("202207-Trips.csv")jun_2022<- read_csv("202206-Trips.csv")may_2022 <- read_csv("202205-Trips.csv") Through my inspection of the data in excel, I saw that all the columns in each spreadsheet are identical. So, in the below R script I will verify this assertion before I can consider combining the data into one spreadsheet: #================================================================# STEP 2: INSPECTED THE DATA FOR#================================================================# Compare column names each of the files# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file str (apr_2023)str (mar_2023)str (feb_2023)str (jan_2023)str (dec_2022)str (nov_2022)str (oct_2022)str (sep_2022)str (jul_2022)str (jun_2022)str (jun_2022)str (may_2022) Because there were no inconsistencies or discrepancies in the data’s column names, I continued to bind the data in the coding script below:trips_<- list.files("C:/cyclistic/12 months bike trips data", pattern="*.csv") print(filenames) #inspect the loaded files for any incongruencies. trips_ <- rbindlist(lapply(trips_, fread))
3. Process (Transform)
In processing(transforming) the data from dirty to clean I started by removing columns that I deemed as not useful, for answering the questions asked in the 1st phase, through the following script:
#===============================================
# STEP 3: CLEAN AND ADD MORE LAYERS TO THE DATA
#===============================================
# Remove ride_id, rideable_type,start_station_id, end_station_id
trips_ <- trips_%>%
select(-c(ride_id, rideable_type,start_station_id, end_station_id))
Then I Inspected the new table columns that remained after removing those columns:
str(trips_)#See list of columns and data types (numeric, character, etc)
summary(trips_) #Statistical summary of data. Mainly numerical
Upon inspection I saw it fit to rename the "member_casual" column to "rider_type" as its a more precisely descriptive column name:
colnames(trips_)[colnames(trips_) == "member_casual"] <- "rider_type"
names(trips_) #Inspect to see if indeed the member_casual column is changed to rider_type
I also removed duplicate rows:
nrow(trips_)
[1] 5859061
trips_<- distinct(trips_) #Remove duplicate rows.
nrow(trips_)
[1] 5859027
#The above is showing that 21 rows (5859061-5859027) around were removed, <-cleaning effort highlight.
After the above, I did addition of insightful layers(columns) to the data and one of the main layers I dimmed necessary to add was a column for length of trip and I named it "trip_length”:trips_$trip_length <- difftime(trips_$ended_at, trips_$started_at, units = "hours")
trips_<- arrange(trips_, trip_length)
#Inspect the new added column for any anomalies.
summary(trips_$trip_length)
#I found that I am not able to obtain statistical summaries from the new added column then that means we need to convert it to format which we can be able to obtain statistical summaries from
Converted "trip_length" from Factor to numeric so we can run statistical summaries and calculations on the data:is.factor(trips_$ride_length)
trips_$trip_length <- as.numeric(as.character(trips_$trip_length))
is.numeric(trips_$trip_length)
#The above step is very paramount otherwise you won’t be able to find the statistical attributes from the column data like mean,min,max and Q3.
summary(trips_$trip_length)
Now, from the above code I was able to obtain statistical summaries and also found anomalies in the data such as a minimum value of -172 hours of ride length ,this obviously calls for further inspection of the data:negative_values <- trips_ [trips_$ trip_length <= 0, ]
nrow(negative_values) #found out that there are 544 negative values or values with zero
print(negative_values$trip_length)
summary(negative_values$trip_length)
# Remove "bad" data
nrow(trips_)= 5859027
trips_ <- trips_[!(trips_$trip_length <= 0)]
nrow(trips_)= 5858483
#so around 544 trip duration values with zero or negative figures were removed, ->cleaning effort highlight.
I continued to add other additional columns(layers) to the data such as "date", "day_of_week" and "month_year", that provided me with additional insights about the data:# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level
trips_$date <- as.Date(trips_$started_at) #The default format is yyyy-mm-dd
trips_$month_year <- format(as.Date(trips_$started_at), " %Y-%m")
trips_$day_of_week <- format(as.Date(trips_$date), "%A")
# Inspect the structure of the columns
str(trips_)
After cleaning the data, I then convert the clean data to csv format for analyses and visualization within Tableau:
# Specify the file path and name for the CSV file
csv_file <- "trips_data.csv"
# Write the data frame to a CSV file
write.csv(trips_, file = csv_file, row.names = FALSE)
# Print a message to confirm the CSV file creation
cat("CSV file created:", csv_file, "\n")
4. Analyse & Share (Load)
Please note that for this section, all the displayed (dashboard) visuals are interactive. Click on them to open the website for a fully interactive experience.
Throughout my breakdown of each of the above visuals, I will mainly compare Casual riders Versus (VS) Member riders in attempt to answer the questions established on the first phase. In doing this I will make commentary on the insights each respective visual unveils:
The above visual illustrates that the cumulative number of bike trips taken by Casual riders during our period of interest (May 2022 to April 2023) is approximately 2.3 million. This number is less than the 3.5 million number of trips taken by member riders. So, member bike riders exceed casual riders by 1.1 million in terms of the number of bike trips taken for the period measured.
However, l also explored the trip length visuals for the two kinds of riders:
The above indicates that the Member bike riders have cumulatively travelled less hours compared to the Casual bike riders, as the Member bike riders cumulatively ridden -/+ 730 thousand hours. Whereas the Casual bike riders ridden a total of around 1.1 million hours.
In order for us to get a clear understanding of which type of bike rider is spending more time riding on the bike trips, I found the average trip length in hours per ride, of the two riders:
Average Trip length for Casual Riders=1 119 831/2 358 035=0,47 Hours, which is about 28,5 minutes, per Trip.
Average Trip length for Member Riders=729 330/350 0461=0.21 Hours, which is about 12.5 minutes per Trip
So, on average Casual bikers are taking more than two times longer, in their bike trips compared to Member bike riders. This is because:
Average Trip length for Casual/ Average Trip length for members = 28.5/12.5 is 2.28 times.
Now let's examine the performance of Casual riders VS Member riders in terms of both the quantity and length of their trips per month, and then per weekday.
As we can see from above, both the number of trips taken and trip length of the two types of riders each month has a generally cyclic pattern and peaks during summer (2022-06 to 2022-09) and goes to a minimum during winter (2022-11 to 2023-04).
The above observations might be because during summer, the weather is usually warmer and daylight hours are longer. These factors make biking a more appealing option for recreational activities such as bike tour excursions. In contrast, colder temperatures, and shorter daylight hours in winter, discourage people from riding bikes more often, leading to reduced usage.
The visual above, displays that the number of casual cyclists peak on weekends that is Saturday (467 869 trips) and Sunday (388 777 trips). However, for the member cyclists the number of trips taken peaks on Thursday (560 836 trips) and Wednesday (556 880 trips), followed closely by Tuesday.
Similarly, from the above graph we can see from that the trip length of casual bike riders also peaks on weekends. However, for the member riders just like the number of trips, the trip length also peaks on Thursday (113 118 Hours) and Wednesday (110 373 Hours), followed closely by other days. It is clearly noticeable that the member riders are more consistent throughout the week in terms of both trip length and number of trips taken than casual bike riders.
A possible explanation of the observation in the data, is that the annual Member bike riders are using their bicycles to commute daily to their workplaces/errands. Hence the number of trips and their lengths is appearing to be more consistent as displayed on the above two visuals.
Whereas the Casual cyclists are using their bicycles to go on excursions or recreational trips, which are typically done on weekends when most people don’t have work. This clearly justifies why high number of Trips and trip length generally occur on weekends for Casual riders.
The visuals above have answered a key question formulated on the Ask phase of "How do annual members and casual riders use Cyclistic bikes differently?". Therefore, we can then proceed to answer the other questions via the Act phase (Conclusions and Recommendations) as we propose to Cyclistic, a way to ultimately boost their revenue through converting casual riders to member riders.