Bus ride times¶

In summer 2024 I got a job downtown and started to take the bus. I was curious about the average ride times and other information, so I began logging the time I enter and exit the bus to a Google spreadsheet via a Google form. The bus ride is approximately 6.25 miles one-way.

The code below does some manipulation to that data and produces a few graphs. Generally, the main conclusion is that morning ride times (to work) are a little shorter than afternoon ride times (from work).

Get data¶

Either download data from the google sheet or use cached copy in data.csv.

In [1]:
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sb
from scipy import stats

warnings.filterwarnings("ignore")

sb.set_theme()

DOWNLOAD_FRESH_DATA = True
data_file = "data.csv"

if DOWNLOAD_FRESH_DATA:
    from pathlib import Path

    import requests
    url = Path("url.txt").read_text()
    print("downloading fresh data")
    resp = requests.get(url)
    resp.raise_for_status()
    Path(data_file).write_text(resp.text)


df = pd.read_csv(data_file)

display(df.head())
display(df.describe())
downloading fresh data
Timestamp Event
0 7/16/2024 8:01:00 Enter
1 7/16/2024 8:33:00 Exit
2 7/16/2024 19:33:00 Enter
3 7/16/2024 20:05:00 Exit
4 7/17/2024 17:13:29 Enter
Timestamp Event
count 716 716
unique 716 2
top 6/19/2025 20:19:01 Enter
freq 1 358

Destructure time¶

Extract meaningful variables from the logged time.

In [2]:
df.Timestamp = pd.to_datetime(df.Timestamp)
df["morning"] = df.Timestamp.dt.hour < 12
df["hour"] = df.Timestamp.dt.hour + df.Timestamp.dt.minute/60 + df.Timestamp.dt.second/3600
df["day"] = df.Timestamp.dt.date
df.head()
Out[2]:
Timestamp Event morning hour day
0 2024-07-16 08:01:00 Enter True 8.016667 2024-07-16
1 2024-07-16 08:33:00 Exit True 8.550000 2024-07-16
2 2024-07-16 19:33:00 Enter False 19.550000 2024-07-16
3 2024-07-16 20:05:00 Exit False 20.083333 2024-07-16
4 2024-07-17 17:13:29 Enter False 17.224722 2024-07-17
In [3]:
# ensure no more errant spaces break the notebook
df.Event = df.Event.str.strip()

See some basic summary stats for each of the 4 combinations of morning and Event corresponding to when I get on or off the bus on the way to or from work.

In [4]:
from itertools import product


g = df.set_index(["morning", "Event"])
for index in product([True, False], ["Enter", "Exit"]):
    display(index)
    display(g.loc[index].hour.describe())
(True, 'Enter')
count    186.000000
mean       7.124108
std        0.541980
min        5.168333
25%        6.940417
50%        7.121806
75%        7.287847
max        9.043056
Name: hour, dtype: float64
(True, 'Exit')
count    186.000000
mean       7.645827
std        0.567747
min        5.620556
25%        7.467847
50%        7.655833
75%        7.924722
max        9.507222
Name: hour, dtype: float64
(False, 'Enter')
count    172.000000
mean      17.367600
std        1.098670
min       13.498889
25%       16.648264
50%       17.422500
75%       17.894861
max       20.247222
Name: hour, dtype: float64
(False, 'Exit')
count    172.000000
mean      17.943891
std        1.070946
min       14.117500
25%       17.225347
50%       17.977361
75%       18.466042
max       20.777778
Name: hour, dtype: float64

Transform¶

Reshape data to a more useful form where each day will have 1 or 2 rows representing each ride instead of each event. The primary variables are now:

  • ridetime : bus trip length in minutes
  • Direction : pretty-formatted label for rides to or from work
In [5]:
# reshape data
trip = df.pivot_table(index=["day", "morning"], columns="Event").reset_index()
# flatten columns
trip.columns = ["_".join(c) if c[1] else c[0] for c in trip.columns.to_flat_index()]
# calculate new information
trip["ridetime"] = (trip.hour_Exit - trip.hour_Enter) * 60.0
trip["Direction"] = trip.morning.map({True: "To work", False: "From work"})
# preview and save copy
display(trip.head())
trip.to_csv("trip.csv", index=False)
day morning Timestamp_Enter Timestamp_Exit hour_Enter hour_Exit ridetime Direction
0 2024-07-16 False 2024-07-16 19:33:00 2024-07-16 20:05:00 19.550000 20.083333 32.000000 From work
1 2024-07-16 True 2024-07-16 08:01:00 2024-07-16 08:33:00 8.016667 8.550000 32.000000 To work
2 2024-07-17 False 2024-07-17 17:13:29 2024-07-17 17:49:28 17.224722 17.824444 35.983333 From work
3 2024-07-18 False 2024-07-18 17:30:37 2024-07-18 18:06:03 17.510278 18.100833 35.433333 From work
4 2024-07-18 True 2024-07-18 07:25:46 2024-07-18 07:56:16 7.429444 7.937778 30.500000 To work
In [6]:
display(trip[trip.morning == True].ridetime.describe())
display(trip[trip.morning == False].ridetime.describe())
count    186.000000
mean      31.303136
std        3.523828
min       23.983333
25%       29.300000
50%       31.066667
75%       32.541667
max       51.066667
Name: ridetime, dtype: float64
count    172.000000
mean      34.577422
std        3.954678
min       25.466667
25%       31.704167
50%       33.983333
75%       36.737500
max       45.216667
Name: ridetime, dtype: float64

Plots¶

Distribution of all ride times¶

In [7]:
ax: plt.Axes = sb.histplot(data=trip, x="ridetime")
ax.set_xlabel("Ride length (min)")
ax.figure.tight_layout()
No description has been provided for this image

Distribution of entry time by direction¶

Clearly, I have a more regular schedule in the morning.

In [8]:
ax: plt.Axes = sb.histplot(data=trip, x="hour_Enter", hue="Direction", binwidth=0.5)
ax.set_xlabel("Enter time, 30 min bins")
ax.figure.tight_layout()
No description has been provided for this image

Box plot of ride times by direction¶

In [9]:
ax: plt.Axes = sb.boxplot(data=trip, x="ridetime", hue="Direction")
ax.figure.set_size_inches(w=10, h=4)
ax.set_xlabel("Ride length (min)")
ax.figure.tight_layout()
No description has been provided for this image

Time series of ride times by direction¶

I wanted to see if there were any obvious temporal trends or spikes.

In [10]:
ax: plt.Axes = sb.lineplot(data=trip, x="Timestamp_Enter", y="ridetime", hue="Direction")
ax.figure.set_size_inches(w=14, h=4)
ax.set_xlabel("Date")
ax.set_ylabel("Ride length (min)")
ax.figure.tight_layout()
No description has been provided for this image

T-Test of mean directional ride times¶

The summary stats above show that directional ride times are different only by about 4 minutes. This is not really a noticable difference to a person on the bus, but is the difference statistically significant?

$$\begin{align*} H_0&: \mu_{to} = \mu_{from} \\ H_a&: \mu_{to} \neq \mu_{from} \\ \alpha&: 0.01 \end{align*}$$

In [11]:
ride_to_work = trip[trip.morning == True].ridetime
ride_from_work = trip[trip.morning == False].ridetime
if ride_to_work.hasnans or ride_from_work.hasnans:
    print("NaNs found in ridetimes")

n = min(len(ride_to_work), len(ride_from_work))
print(f"{n=}")

result = stats.ttest_ind(ride_to_work.to_list()[:n], ride_from_work.to_list()[:n], equal_var=False)
print(result)
print(result.confidence_interval())

alpha = 0.01  # 99%
if result.pvalue < alpha:
    print("Reject H0 in favor of HA that average to-work and from-work ride times are not equal.")
else:
    print("Do not reject H0 that average to-work and from-work ride times are equal.")
n=172
TtestResult(statistic=np.float64(-7.923175942018825), pvalue=np.float64(3.375544224704507e-14), df=np.float64(338.3723532885828))
ConfidenceInterval(low=np.float64(-4.014870168078449), high=np.float64(-2.417881769906071))
Reject H0 in favor of HA that average to-work and from-work ride times are not equal.

Conclusion¶

It appears that to-work and from-work ride times are different at a 99% confidence level.