Introduction
I live on a busy street. Many people who drive by are in a hurry, and speeding has been a chronic problem. However, it's been difficult to describe how bad the speeding is. Our only evidence has come from accident reports on our street. Therefore, I'd like to use some statistics to see if I can prove that people routinely speed down our residential street.
To do this, I'll set up an experiment to sample the speeds of cars driving past our house and do some hypothesis tests.
I'll test whether the average speed of a car driving by our house is greater than 35 mph (the speed limit is 30 mph).
Read the full analysis alongside the Python code used for calculations on Github
Experiment
- The population parameter we want to estimate is from an infinite population, so we won't have to make any corrections for finite populations.
- The speed of cars is a continuous random variable
- The population distribution is unknown
- The population standard deviation is unknown
When I first thought of the experiment, I wanted to use the margin of error calculation to determine the sample size necessary for my desired margin of error for a confidence interval. However, not knowing whether the population was normally distributed and not knowing the population standard deviation meant I would have to use the t-distribution for this calculation. Unfortunately, I would need a sample size to determine the degrees of freedom (DOF) to get the t-statistic necessary, which I didn't have. Additionally, I didn't have the standard deviation but I could have done a pilot sample to remedy that limitation. Consequently, I was unable to perform this part of the experiment.
But, if we were to assume:
- The population is normally distributed
- The population standard deviation is 5.07 (from our pilot sample)
Then, we could choose a margin of error, say 1.5 mph, and calculate the need for 44 samples to achieve such a margin.
Making Observations
There are several options for measuring speed. However, I used a less accurate approach because this experiment was more about understanding statistical concepts. Consequently, to calculate speed I measured the distance of the road outside my house and timed how long it took cars to travel that distance. Then, I converted the measurements and divided the distance traveled by the time (in seconds).
The distance I used in the experiment was 1,382 in. I came to this number by walking the distance from one lane dividing line (yellow dash in the middle of the roadway) to the start of a different lane dividing line. I took 45 paces between these two-lane dividing lines. Next, I looked up how to estimate the length of a person's stride based on their height. I'm 6'2" and I read one can estimate their stride length by multiplying their height by 0.415. Subsequently, I took my approximate stride length and multiplied it by how many strides I took. This calculation gave me the distance in inches. Finally, I converted the value to miles.
After calculating the distance I sampled car speeds using my iPhone's stopwatch app. When a car reached the end of the first lane dividing line I started the timer, and when it reached the start of the other lane dividing line I stopped the timer. I did this for 31 cars.
I chose 31 cars as a minimum for using the t-distribution when the population distribution is unknown. Alternatively, I could have increased the accuracy by sampling 101 cars because the t-distribution and the normal distribution are almost identical at 100 degrees of freedom.
Descriptive Statistics
Before calculating descriptive statistics on the data I want to convert the time values to MPH. To do this, I divided each value by 3600, because there are 60 seconds in a minute and 60 minutes in an hour.
Next, I could take our converted distances in miles and divide it by our converted times in hours to get MPH. Finally, we can run a few calculations and check out some stats from our sample:
- count: 31.000000
- mean: 32.431042
- std: 5.067071
- min: 24.461024
- 25%: 29.518754
- 50%: 31.661245
- 75%: 34.441197
- max: 53.780744
Confidence Interval
A confidence interval is a range built around a sample mean that we might suspect holds the true population parameter given a certain level of confidence, typically defined as a percent.
For example, we might say we are 95% percent confident the true population average is between 34-37 MPH based on our sample mean. In other words, if we took a large number of sample averages and built 95% confidence intervals around each, we would expect about 95% of the confidence intervals to contain the true population mean.
To get our 95% confidence interval around the mean we can need to calculate the margin of error from a t-statistic, standard error of the mean, and the sample size. Using these values I calculated the margin of error at ~1.86 MPH.
To get the confidence interval I simply added and subtracted the margin of error from the sample mean. Therefore we can say that with 95% confidence the true population mean for the speed of cars driving outside my house is between 31-34 (after rounding to the nearest whole number).
# Using scipy to calculate the confidence interval
CI = stats.t.interval(0.95, dof(number_of_samples), sampled_times.mean(), stats.sem(sampled_times))
# Print the confidence interval
print(f'Confidence interval: {CI[0]} to {CI[1]}')
Hypothesis Test
Critical Values
From the very beginning, I wanted to test the claim that cars average more than 35 MPH driving by my house despite the speed limit being 30 MPH. We can write the null hypothesis like this:
- Avg. speed of cars less than or equal to 35 MPH.
The first way to test this claim is to compare the t-stat to the t-critical statistic.
- T-Critical Value: 1.697
- T-Test Stat: -2.8228051488149637
t_crit_value = stats.t.ppf(0.95, dof(number_of_samples))
# Print the critical value
print(f'Critical value: {t_crit_value}')
# Critical value: 1.6972608865939571
sample_mean = sampled_times.mean()
t_test_stat = div(sub(sample_mean, 35), div(sample_std_dev, math.sqrt(number_of_samples)))
# Print the t-test statistic
print(f'T-test statistic: {t_test_stat}')
# T-test statistic: -2.8228051488149637
The test statistic is nowhere near our critical value. Consequently, we fail to reject the null hypothesis.
P-Value
The second way to test our hypothesis is to calculate and compare a p-value. A p-value is the probability of observing a result at least as extreme as the one observed if the null hypothesis is true.
Given our desire for 95% confidence, our alpha value is 0.05 (1 - 0.95). We would need to observe a p-value less than 0.05 to reject the null hypothesis. Unfortunately for us, the p-value is huge: 0.9958.
# Calculate the p-value
p_value = stats.t.sf(t_test_stat, dof(number_of_samples))
# Print the p-value
print(f'P-value: {p_value}')
# P-value: 0.9958139556652821
This observed p-value would cause us to once again fail to reject the null hypothesis.
Compare Sample Means
Early on in the analysis, we observed that the sample mean was approx. 32 MPH. The final way we can conduct the hypothesis test is to calculate what sample mean we would need to observe to reject the null and compare that to our observed sample mean.
We can do this by solving the sample mean using our t-test stat equation. After a bit of algebra, I discovered a sample mean of about 36 MPH.
critical_sample_mean = add(35, mul(float(t_crit_value), div(sample_std_dev, math.sqrt(number_of_samples))))
# Print the critical sample mean
print(f'Critical sample mean: {critical_sample_mean}')
# Critical sample mean: 36.544631119869806
# Use scipy to calculate the critical sample mean
critical_sample_mean_scipy = stats.t.ppf(0.95, dof(number_of_samples), loc=35, scale=div(sample_std_dev, math.sqrt(number_of_samples)))
# Print the critical sample mean
print(f'Critical sample mean: {critical_sample_mean_scipy}')
# Critical sample mean: 36.544631119869806
Because our sample mean is well below the needed sample mean to reject the null, we must fail to reject the null hypothesis.
Rule Breakers
Although most people are not speeding down our street some go quite fast. Using our observed data, we can build a normal distribution and find that the probability of a car going greater than 45 MPH down our street is less than 1% given the true population mean is ~32 MPH and the true standard deviation is ~5 MPH.
prob_greater_than_45 = 1 - stats.norm.cdf(45, sample_mean, scale=sample_std_dev)
# Print the probability
print(f'Probability of getting a speed greater than 45 mph: {prob_greater_than_45}')
# Probability of getting a speed greater than 45 mph: 0.00655959427893249
Conclusion
Despite what our eyes tell us, we don't have a consistent speeding problem on our street. Still, just after I took these sample observations, construction started to modify an intersection just up the street as a "traffic calming" measure. In the future, when the construction is done (if ever), it will be fun to take another sample mean and see if the population mean has changed.
Additionally, it would be interesting to perform a test to determine if the population is indeed normally distributed.