In processing(transforming) the data from dirty to clean I started by removing columns that I deemed as not useful, for answering the questions asked in the 1st phase, through the following script:
#===============================================
# STEP 3: CLEAN AND ADD MORE LAYERS TO THE DATA
#===============================================
# Remove ride_id, rideable_type,start_station_id, end_station_id
trips_ <- trips_%>%
select(-c(ride_id, rideable_type,start_station_id, end_station_id))
Then I Inspected the new table columns that remained after removing those columns:
str(trips_)#See list of columns and data types (numeric, character, etc)
summary(trips_) #Statistical summary of data. Mainly numerical
Upon inspection I saw it fit to rename the "member_casual" column to "rider_type" as its a more precisely descriptive column name:
colnames(trips_)[colnames(trips_) == "member_casual"] <- "rider_type"
names(trips_) #Inspect to see if indeed the member_casual column is changed to rider_type
I also removed duplicate rows:
nrow(trips_)
[1] 5859061
trips_<- distinct(trips_) #Remove duplicate rows.
nrow(trips_)
[1] 5859027
#The above is showing that 21 rows (5859061-5859027) around were removed, <-cleaning effort highlight.
After the above, I did addition of insightful layers(columns) to the data and one of the main layers I dimmed necessary to add was a column for length of trip and I named it "trip_length”:trips_$trip_length <- difftime(trips_$ended_at, trips_$started_at, units = "hours")
trips_<- arrange(trips_, trip_length)
#Inspect the new added column for any anomalies.
summary(trips_$trip_length)
#I found that I am not able to obtain statistical summaries from the new added column then that means we need to convert it to format which we can be able to obtain statistical summaries from
Converted "trip_length" from Factor to numeric so we can run statistical summaries and calculations on the data:is.factor(trips_$ride_length)
trips_$trip_length <- as.numeric(as.character(trips_$trip_length))
is.numeric(trips_$trip_length)
#The above step is very paramount otherwise you won’t be able to find the statistical attributes from the column data like mean,min,max and Q3.
summary(trips_$trip_length)
Now, from the above code I was able to obtain statistical summaries and also found anomalies in the data such as a minimum value of -172 hours of ride length ,this obviously calls for further inspection of the data:negative_values <- trips_ [trips_$ trip_length <= 0, ]
nrow(negative_values) #found out that there are 544 negative values or values with zero
print(negative_values$trip_length)
summary(negative_values$trip_length)
# Remove "bad" data
nrow(trips_)= 5859027
trips_ <- trips_[!(trips_$trip_length <= 0)]
nrow(trips_)= 5858483
#so around 544 trip duration values with zero or negative figures were removed, ->cleaning effort highlight.
I continued to add other additional columns(layers) to the data such as "date", "day_of_week" and "month_year", that provided me with additional insights about the data:# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level
trips_$date <- as.Date(trips_$started_at) #The default format is yyyy-mm-dd
trips_$month_year <- format(as.Date(trips_$started_at), " %Y-%m")
trips_$day_of_week <- format(as.Date(trips_$date), "%A")
# Inspect the structure of the columns
str(trips_)
After cleaning the data, I then convert the clean data to csv format for analyses and visualization within Tableau:
# Specify the file path and name for the CSV file
csv_file <- "trips_data.csv"
# Write the data frame to a CSV file
write.csv(trips_, file = csv_file, row.names = FALSE)
# Print a message to confirm the CSV file creation
cat("CSV file created:", csv_file, "\n")