About me


My name is Kinsley, and I am a Data Professional. I hold a bachelor's degree in Business Management Studies from the University of Cape Town. Last year, I earned the Google Data Analytics Professional Certificate, & I am also enrolled with the Microsoft Power BI Data Analysis Certificate programme.

Kinsley_Intro.mp4

As a little child growing up & discovering the world, my curiosity, was once almost a nuisance to my dad. However, to his relief now my inquisitive spirit has found a new home in the world of Data Analytics. Through deep dives into datasets, I uncover insights that explain phenomena and offer data-driven insights for your organization.

Driven by my love for insights, I've been working at Drapac, an Atlanta-based company, where I’ve played a key role in driving data-driven decision-making. Within just 3-months, I earned recognition as a 'superstar' for my contributions to both Drapac & the Global Employment Challenge programme. So, I’m eager to bring the same degree of excellence to your company & ensuring that your data processes are top-notch and high like the STARS. Last year, I completed the Google Data Analytics Capstone Project, which is displayed below in the portfolio section.

The Project focused on leveraging the Divvy bikeshare dataset to enhance membership and revenue generation for the company. In 2022 I also worked for about a year at MSCI as a Financial Data Analyst, where I gathered, analyzed and stored large volumes of financial data.

Portfolio


The Google Data Analytics Capstone Project is an optional but recommended Project that allows graduates of the course to showcase their acquired skills.


The Capstone Project seeks to simulate a real-world business challenge faced by a data analyst working for a fictional company called "Cyclistic". The company operates as a bike share company based in Chicago.

In addressing the business challenge on the Project, I mainly utilized 3 tools: Microsoft Excel, R, and Tableau - despite also being equipped with SQL skills, through the Certificate.

Before we delve into the project, I want to outline my analytic flow, which is divided into the following 5 phases:

1. Ask

This is a diagnostic phase. So, I ask questions and understand the business settings/context in which the problem exists. This phase is not a once off phase, rather it's an ongoing phase and forms part of the fabric of my analytic flow. Hence an open line of communication will be kept throughout the analysis, between me, my team and other relevant key stakeholders who have stakes in the business problem.


2. Prepare (Extraction)

The data preparation phase marks the beginning of ensuring that the data is reliable and suitable to be extracted for subsequent processing (& analysis), from its various sources. Data can be extracted from various sources; this includes sources such as databases and web pages. When extracting data from databases, I typically utilize SQL, while for web pages, I can employ methods such as direct downloads or web scraping. For this phase I only ensure reliability of the data through assessing factors such as the representativeness of the data set, and the quality & quantity of the data's sources. 

3. Process (Transform)

Here, I still continue to ensure the data's reliability, but through processing the data from being dirty to clean. However, I am not only just going to clean the prepared data, but I am also going to document the steps I took to clean the data. Depending on how big the data set is I might opt to use SQL, R, Python or Excel to carry out the cleaning.

4. Analyse & Share (Load)

At this point the data is ready to be loaded for further analytic purposes, as the whole ETL(Extract, Tranform,Load) framework is done. Here I typically present visuals generated via Tableau, as well as unveil the interesting insights, that I deemed to be key for both our analysis and subsequent conclusions (& recommendations).

5.Act (Conclusions & Recommendations)

1.Ask phase

The business settings/context 

The setting is that I am a data analyst working in the marketing analyst team at Cyclistic. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Also, the finance team says annual member riders are  more profitable than casual bike riders. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members.

Questions I asked:

·        How do annual members and casual riders use Cyclistic bikes differently?

·        Why would casual riders buy Cyclistic annual memberships?

·        How can Cyclistic use digital media to influence casual riders to become members?

 With the above phase, done the below sections will seek to answer these questions eventually.

2. Prepare phase


The data shows no significant issues of bias or credibility, as the data is collected from a from a fairly representative sample and verified for accuracy by both the University and City of Chicago. The data meets the ROCCC criteria, being reliable, original, comprehensive, coherent, and cited. Furthermore, it was also reviewed by experts. Licensing, privacy, and security are appropriately addressed, with the data being both freely accessible and protected. Its integrity was verified by cross-checking with other sources such as the City of Chicago's transportation data portal and statistical testing. 

With some reliability on the data established I proceeded to download the dataset from the source which is the divvy website https://divvytripdata.s3.amazonaws.com. The dataset included 12 separate spreadsheets. Each spreadsheet represents bike trips data for 12 consecutive months (May 2022 to April 2023). Via Excel, I was able to notice that the bikeshare data in each spreadsheet is well-organized, and clearly labelled into self-explanatory columns such as the ‘started_at” and “ended_at” column to indicate both the start time and end time of the bike trips. The data appears to be able to help me answer questions established on the ask phase.

However, I noticed that the dataset was too large for Excel to load, so I switched to using R for further inspection. The R code below shows how I loaded the data into R:

# # # # # # # # # # # # # # # # # # # # # # # #### ##<<__Install & load required packages__>>##### # # # # # # # # # # # # # # # # # # # # # # ## install.packages("data.table")# For loading and subsequently, combining the spreadsheets conveniently( and efficiently).install.packages("readr") # For reading the spreadsheetsinstall.packages("tidyverse")# tidyverse for data import and wrangling!install.packages("lubridate")# lubridate for date functions! library(data.table)library(readr)library(tidyverse) library(lubridate)   #=================================# STEP 1: COLLECT DATA#=================================#Set windows destination & Upload Divvy datasets (csv files) here setwd("C:/cyclistic/12 months bike trips data")trips <- list.files(pattern="*.csv")print(trips_)#Rename the above printed files to an appropriate dataframe name
apr_2023<- read_csv("202304-Trips.csv")mar_2023<- read_csv("202303-Trips.csv")feb_2023<- read_csv("202302-Trips.csv")jan_2023<- read_csv("202301-Trips.csv")dec_2022 <- read_csv("202212-Trips.csv")nov_2022<- read_csv("202211-Trips.csv")oct_2022<- read_csv("202210-Trips.csv")sep_2022<- read_csv("202209-Trips.csv")aug_2022<- read_csv("202208-Trips.csv")jul_2022<- read_csv("202207-Trips.csv")jun_2022<- read_csv("202206-Trips.csv")may_2022 <- read_csv("202205-Trips.csv") Through my inspection of the data in excel, I saw that all the columns in each spreadsheet are identical. So, in the below R script I will verify this assertion before I can consider combining the data into one spreadsheet: #================================================================# STEP 2: INSPECTED THE DATA FOR#================================================================# Compare column names each of the files# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file str (apr_2023)str (mar_2023)str (feb_2023)str (jan_2023)str (dec_2022)str (nov_2022)str (oct_2022)str (sep_2022)str (jul_2022)str (jun_2022)str (jun_2022)str (may_2022) Because there were no inconsistencies or discrepancies in the data’s column names, I continued to bind the data in the coding script below:trips_<- list.files("C:/cyclistic/12 months bike trips data", pattern="*.csv") print(filenames) #inspect the loaded files for any incongruencies. trips_ <- rbindlist(lapply(trips_, fread)) 

3. Process (Transform)


In processing(transforming) the data from dirty to clean I started by removing columns that I deemed as not useful, for answering the questions asked in the 1st phase, through the following script:

#===============================================

# STEP 3: CLEAN  AND ADD MORE LAYERS TO THE DATA

#===============================================

 

# Remove ride_id, rideable_type,start_station_id, end_station_id

trips_ <- trips_%>%  

select(-c(ride_id, rideable_type,start_station_id, end_station_id))

Then I Inspected the new table columns that remained after removing those columns:

str(trips_)#See list of columns and data types (numeric, character, etc)

summary(trips_) #Statistical summary of data. Mainly numerical 

Upon inspection I saw it fit to rename the "member_casual" column to "rider_type" as its a more precisely descriptive column name:

colnames(trips_)[colnames(trips_) == "member_casual"] <- "rider_type"

 

names(trips_) #Inspect to see if indeed the member_casual column is changed to rider_type 

I also removed duplicate rows:

nrow(trips_)

[1] 5859061

trips_<- distinct(trips_) #Remove duplicate rows.

nrow(trips_)

[1] 5859027

#The above is showing that 21 rows (5859061-5859027) around were removed, <-cleaning effort highlight. 

After the above, I did addition of insightful layers(columns) to the data and one of the main layers I dimmed necessary to add was a column for length of trip and I named it "trip_length”:

trips_$trip_length <- difftime(trips_$ended_at, trips_$started_at, units = "hours")

 

trips_<- arrange(trips_, trip_length)

 

#Inspect the new added column for any anomalies.

summary(trips_$trip_length)

 

#I found that I am not able to obtain statistical summaries from the new added column then that means we need to convert it to format which we can be able to obtain statistical summaries from 

Converted "trip_length" from Factor to numeric so we can run statistical summaries and calculations on the data:

is.factor(trips_$ride_length)

trips_$trip_length <- as.numeric(as.character(trips_$trip_length))

is.numeric(trips_$trip_length)

 

#The above step is very paramount otherwise you won’t be able to find the statistical attributes from the column data like mean,min,max and Q3.

 

summary(trips_$trip_length) 

Now, from the above code I was able to obtain statistical summaries and also found anomalies in the data such as a minimum value of -172 hours of ride length ,this obviously calls for further inspection of the data:

negative_values <- trips_ [trips_$ trip_length <= 0, ]

nrow(negative_values) #found out that there are 544 negative values or values with zero

print(negative_values$trip_length)

summary(negative_values$trip_length)

 

# Remove "bad" data

nrow(trips_)= 5859027

trips_ <- trips_[!(trips_$trip_length <= 0)]

nrow(trips_)= 5858483

#so around 544 trip duration values with zero or negative figures were removed, ->cleaning effort highlight. 

I continued to add other additional columns(layers) to the data such as "date", "day_of_week" and "month_year", that provided me with additional insights about the data:

# Add columns that list the date, month, day, and year of each ride

# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level

trips_$date <- as.Date(trips_$started_at) #The default format is yyyy-mm-dd

trips_$month_year <- format(as.Date(trips_$started_at), " %Y-%m")

trips_$day_of_week <- format(as.Date(trips_$date), "%A")

   

# Inspect the structure of the columns

str(trips_) 

After cleaning the data, I then convert the clean data to csv format for analyses and visualization within Tableau:

# Specify the file path and name for the CSV file

 csv_file <- "trips_data.csv"

 # Write the data frame to a CSV file

write.csv(trips_, file = csv_file, row.names = FALSE)

# Print a message to confirm the CSV file creation

 cat("CSV file created:", csv_file, "\n")

4. Analyse & Share (Load)

Please note that for this section, all the displayed (dashboard) visuals are interactive. Click on them to open the website for a fully interactive experience. 

Throughout my breakdown of each of the above visuals, I will mainly compare Casual riders Versus (VS) Member riders in attempt to answer the questions established on the first phase. In doing this I will make commentary on the insights each respective visual unveils:

The above visual illustrates that the cumulative number of bike trips taken by Casual riders during our period of interest (May 2022 to April 2023) is approximately 2.3 million. This number is less than the 3.5 million number of trips taken by member riders. So, member bike riders exceed casual riders by 1.1 million in terms of the number of bike trips taken for the period measured. 

However, l also explored the trip length visuals for the two kinds of riders:

The above indicates that the Member bike riders have cumulatively travelled less hours compared to the Casual bike riders, as the Member bike riders cumulatively ridden -/+ 730 thousand hours. Whereas the Casual bike riders ridden a total of around 1.1 million hours.


In order for us to get a clear understanding of which type of bike rider is spending more time riding on the bike trips, I found the average trip length in hours per ride, of the two riders:

So, on average Casual bikers are taking more than two times longer, in their bike trips compared to Member bike riders. This is because:

 

Now let's examine the performance of Casual riders VS Member riders in terms of both the quantity and length of their trips per month, and then per weekday.

As we can see from above, both the number of trips taken and trip length of the two types of riders each month has a generally cyclic pattern and peaks during summer (2022-06 to 2022-09) and goes to a minimum during winter (2022-11 to 2023-04).

The above observations might be because during summer, the weather is usually warmer and daylight hours are longer. These factors make biking a more appealing option for recreational activities such as bike tour excursions. In contrast, colder temperatures, and shorter daylight hours in winter, discourage people from riding bikes more often, leading to reduced usage.

The visual above, displays that the number of casual cyclists peak on weekends that is Saturday (467 869 trips) and Sunday (388 777 trips). However, for the member cyclists the number of trips taken peaks on Thursday (560 836 trips) and Wednesday (556 880 trips), followed closely by Tuesday.

Similarly, from the above graph we can see from that the trip length of casual bike riders also peaks on weekends. However, for the member riders just like the number of trips, the trip length also peaks on Thursday (113 118 Hours) and Wednesday (110 373 Hours), followed closely by other days. It is clearly noticeable that the member riders are more consistent throughout the week in terms of both trip length and number of trips taken than casual bike riders.

A possible explanation of the observation in the data, is that the annual Member bike riders are using their bicycles to commute daily to their workplaces/errands. Hence the number of trips and their lengths is appearing to be more consistent as displayed on the above two visuals. 

Whereas the Casual cyclists are using their bicycles to go on excursions or recreational trips, which are typically done on weekends when most people don’t have work. This clearly justifies why high number of Trips and trip length generally occur on weekends for Casual riders.

The visuals above have answered a key question formulated on the Ask phase of   "How do annual members and casual riders use Cyclistic bikes differently?". Therefore, we can then proceed to answer the other questions via the Act phase (Conclusions and Recommendations) as we propose to Cyclistic, a way to ultimately boost their revenue through converting casual riders to member riders.

5.Act 


Conclusions 

The total number of bike trips taken by casual bike riders is significantly lessor than that of member bike riders.

However, on the flipside, average length in hours taken by casual bike riders is at least 2 times more than that of member bike riders.

The number and length of bike trips taken by casual bike riders are both higher during the summer and on weekends

Member bike riders’ number of bike trips and their duration is also higher, during summer but relatively stable throughout the week.


Recommendations


The data has shown that the average trip time per ride for casual cyclists is significantly more than that of member cyclists. So, I recommend that the company introduces rewards for being an annual member rider, with specific target on cyclists who take longer trips. This should be matched by escalating fees that disincentivizes riders from taking longer trips without becoming a member. Below are some of the steps to implement this:

Discounted fees for longer trips, this can be matched with    escalating fees for longer trips to non-member riders. 

Premium bikes tailor designed for longer distances in terms of both comfort and efficiency, should be made available only member riders.

A gamified riding experience can be a huge source of making the value and rewards for taking longer rides more interesting by:

·Establishing an online community where riders can interact, share achievements, and compete with each other in the game.

·Offering virtual game trophies and incentives to riders who exceed certain trip lengths or mileage.

·The preceding point should be matched by accelerated reward system for members taking longer trips 

The answer to the second question is well embedded in the above recommendations. This is due to the fact that casual riders on average take longer bike trips, so they would want to switch to being a member in order to experience discounts, premium bikes, and an accelerated progression in a gamified bike riding experience, as outlined above.

The recommendations below, will conclude this analysis by answering the question " How can Cyclistic use digital media to influence casual riders to become members?". They will do so by also leveraging insights deduced from the data about both peak weekdays and month via the following:

Create special discounted membership rates specifically targeted to casual riders during the summer season. This discount should make it more cost-effective and convenient for frequent summer rides.

Introduce weekend-specific discounts that cater to casual riders who primarily use the service on weekends. This can include reduced membership fees or bonus ride credits exclusively available for weekend rides. 

Advertise the perks of being an annual member during high-demand weekends, such as guaranteed bike availability and faster check-out processes, as you can book and reserve for your preferred bike type in advance

Finally, advertise all of the above perks and discounts offered to members on social media mainly just before and during summer as the weather would be conducive for people to highly consider biking and other out-door activities.


"Ignite Growth and Propel Your Business with Actionable Insights!"

Kindly contact me via email for inquiries.