Crafting an Artificial Dataset
- R
In this post, I explain how I simulated vehicle records for the Mopac Express Lane in Austin, Texas.
Motivation
I am currently working towards the release of my second R package, sift. This package is essentially an amalgam of various ideas I’ve had over the past few years. Since grouping these ideas into one package was very important to me, I needed a common theme to tie all of the functions together.
The dataset created in this post, express
, serves as a realistic case study where one can effectively utilize the full capability of sift.
library(tidyverse)
library(lubridate)
library(jsonlite)
options(readr.default_locale = locale(tz = "US/Central"))
theme_set(theme_minimal())
Step 1: Reconnaissance
How frequent are various vehicle types on Mopac? Lots of F-150s? Toyota Camrys?
By no means does the intended application of the dataset require 100% accuracy. However, achieving something somewhat realistic was still important to me. After coming up dry on Google, I decided to walk out my front door and get the answer from the primary source.
Discreetly nestled in a grove overlooking Mopac, I aimed my camera at rush hour traffic and let the shutter snap for 2.5 minutes. After repeating this over the course of 1 week, I had collected over 600 frames.
These photos present a clear opportunity to apply an Image Recognition ML algorithm - but I’m saving that for a future post. 😉
The resulting dataset is named rush_hour
.
rush_hour <- read_csv("https://raw.githubusercontent.com/sccmckenzie/mopac/master/inst/extdata/rush_hour.csv")
# also available in mopac::rush_hour
rush_hour %>%
sample_n(5)
rush_hour | ||||||
---|---|---|---|---|---|---|
day | time1 | commercial | color | type | make | model |
Sun | 17:29:15 | FALSE | Grey | SUV | Lexus | GX |
Sat | 15:06:09 | TRUE | White | Truck | GMC | Sierra |
Thu | 18:49:33 | TRUE | Blue | Semi | Volvo | VNL 670 |
Mon | 18:24:21 | FALSE | Black | Truck | Ford | F-250 |
Mon | 18:25:37 | FALSE | White | Sedan | Hyundai | NA |
1
<dttm> converted to <hms> for display
|
If you live in Texas, the below results shouldn’t come as a surprise.
rush_hour %>%
unite(make_model, make, model, sep = " ") %>%
drop_na(make_model) %>%
count(make_model) %>%
slice_max(order_by = n, n = 10) %>%
ggplot(aes(n, fct_reorder(make_model, n))) +
geom_col()
Although rush_hour
satisfies the need for realistic vehicle make/model frequencies, it cannot serve as a basis for inferring express lane traffic density since observations were obtained from Mopac mainlanes.
To avoid further speculation, I made another trip to acquire express lane timestamps (collected during afternoon rush hour).
express_counts <- read_csv("https://raw.githubusercontent.com/sccmckenzie/mopac/master/inst/extdata/express_counts.csv")
glimpse(express_counts)
## Rows: 250
## Columns: 1
## $ time <dttm> 2021-03-04 17:57:16, 2021-03-04 17:57:18, 2021-03-04 17:57:22, 2~
Step 2: Defining Scope
Mopac contains an express lane stretching 11 miles between downtown Austin to Parmer Lane. There is an intermediate access point near RM 2222, boxed in pink below.
- Our mock dataset,
express
, will feature vehicle descriptions + timestamps as if they are captured at the RM 2222 checkpoint, - We’ll obtain peak traffic distribution by bootstrapping
express_counts
, then adjusting based on City of Austin data. - Vehicle make/model/color frequencies will be inferred from
rush_hour
.1
Step 3: Traffic Distribution
First, we extract timestamp spacing from express_counts
.
set.seed(10)
t_delta <- express_counts %>%
mutate(t_delta = time_length(time - lag(time))) %>%
drop_na(t_delta) %>%
mutate(t_delta = t_delta + 2 * rbeta(n(), 2, 3)) %>%
# ^ we add some jitter into timestamps
# (observations were recorded with 1 sec resolution)
pull(t_delta)
t_delta %>%
qplot(binwidth = 0.25) # histogram
To simulate timestamps for express
, we perform bootstrap sampling from the above distribution.
# set timeframe (5am - 8pm)
t1 <- 5
t2 <- 20
total_seconds <- (t2 - t1) * 3600
set.seed(20)
express <- tibble(direction = c("North", "South")) %>%
rowwise(direction) %>%
summarize(vehicle_spacing = sample(t_delta, size = total_seconds, replace = TRUE)) %>%
# ^ generate temporal vehicle spacing
transmute(time = make_datetime(2020, 5, 20, t1, tz = "US/Central") + cumsum(vehicle_spacing)) %>%
# ^ add temporal vehicle spacing together
filter(time < make_datetime(2020, 5, 20, t2, tz = "US/Central"))
# ^ cut off timestamps later than 8pm
express %>%
group_by(direction,
t15 = floor_date(time, unit = "15 minutes")) %>%
summarize(volume = n()) %>%
ggplot(aes(t15, volume)) +
geom_line(aes(color = direction))
While we have achieved some randomness, the consistent baseline around 150 is unrealistic. For example, around 6:00am, nobody is actually using the express lane.
Luckily, I managed to find a table containing real traffic counts from the City of Austin Open Data Portal. The low-level details here aren’t super exciting - I’ve only included the code for the sake of reproducibility. Essentially, I am projecting the real traffic density profile onto the above baseline distribution.
steck <- jsonlite::fromJSON('https://data.austintexas.gov/resource/sh59-i6y9.json?atd_device_id=6409&year=2020&month=5&day=20&heavy_vehicle=false') %>%
as_tibble() %>%
janitor::clean_names() %>%
filter(direction == "SOUTHBOUND") %>%
transmute(read_date = as_datetime(read_date, tz = "US/Central"),
direction,
movement,
volume = as.integer(volume)) %>%
with_groups(read_date,
summarize,
volume = sum(volume)) %>%
transmute(id = row_number(),
read_date,
volume = volume/max(volume))
# join Steck volume with express timestamps
set.seed(25)
express <- express %>%
mutate(id = findInterval(time, steck$read_date)) %>%
left_join(steck, by = "id") %>%
rowwise() %>%
mutate(keep = sample(c(FALSE, TRUE), prob = c(1 - volume, volume), size = 1)) %>%
# ^ treat volume as probability of keeping row in express
ungroup() %>%
filter(keep) %>%
select(direction, time) %>%
arrange_all()
express %>%
group_by(direction,
t15 = floor_date(time, unit = "15 minutes")) %>%
summarize(volume = n()) %>%
ggplot(aes(t15, volume)) +
geom_line(aes(color = direction))
Voila! The above distribution is much more believable: starting around 6am, traffic increases, levels out, and finally recedes around 6pm.
Step 4: Trip Modeling
We’ve arrived at the main course.
Though we could arbitrarily assign a unique vehicle for each row in express
, this wouldn’t be realistic. Consider the below scenarios:
- Vehicle uses Express Lane once in one direction.
- Vehicle uses Express Lane once in both directions (e.g. commuting to & from work).
- Vehicle uses Express Lane thrice in any combination of directions (e.g. rideshare).
Before we jump in, let’s create a helper to extract hour of day as a decimal. This will improve readability of our downstream code.
hourday <- function(t) {
time_length(t - make_datetime(2020, 5, 20, tz = "US/Central"), unit = "hours")
}
# example
hourday(as_datetime("2020-05-20 12:30:00", tz = "US/Central"))
## [1] 12.5
For Scenario #1, each vehicle corresponds to one timestamp. We’ll obtain 5000 observations using a single unbiased sample.
set.seed(254)
scenario_1 <- express %>%
sample_n(size = 5000) %>%
mutate(v_id = row_number())
scenario_1 %>%
ggplot(aes(time)) +
geom_histogram(aes(fill = direction), binwidth = 900)
Note the above distribution contains roughly the same amount of north & south observations.
For Scenario 2, we’ll need to enforce actual equality between directions so that each north timestamp has a corresponding south timestamp.
Each vehicle will contain 2 timestamps:
- Timestamp A (driving to work)
- Timestamp B (driving home)
We sample A Timestamps weighted by \(\mathcal{N}(7\textrm{am}, 4\textrm{hr})\).
set.seed(400)
scenario_2A <- express %>%
anti_join(scenario_1) %>%
# ^ exclude observations already sampled into scenario_1
group_by(direction) %>%
# ^ need equal amounts of north & south samples
sample_n(size = 1000, weight = dnorm(hourday(time), mean = 7, sd = 2)) %>%
mutate(id = row_number()) %>%
ungroup()
We’ll assume most trips take ~10 hours (9 hour workday + 30 minute commute each way). This implies a B timestamp distribution of \(\mathcal{N}(5\textrm{pm}, 4\textrm{hr})\).
scenario_2B <- express %>%
anti_join(scenario_1) %>%
anti_join(scenario_2A) %>%
group_by(direction) %>%
sample_n(size = 1000, weight = dnorm(hourday(time), mean = 17, sd = 2)) %>%
mutate(id = row_number()) %>%
ungroup()
bind_rows(
mutate(scenario_2A, grp = "A"),
mutate(scenario_2B, grp = "B")
) %>%
ggplot(aes(time)) +
geom_histogram(aes(fill = grp))
Notice there is some overlap between the A & B distribution. This is intentional, as there will inevitably be some drivers that start their commute in the afternoon.
Currently, there are no unique vehicles tying observations together between scenario_2A
& scenario_2B
. Timestamp pairing is deceptively challenging if we wish to avoid inefficient for-loop structures. After some trial-and-error, I found the below procedure to work quite nicely:
- Randomly generate multiple sets - each containing 2500 pairs of observations (only pairing opposite directions together).
# Example of one 'set' for North -> South
# Within one set, each observation occurs once
# North[i] is 'paired' with South[i]
set.seed(451)
North <- sample(filter(scenario_2A, direction == "North")$time)
South <- sample(filter(scenario_2B, direction == "South")$time)
- Eliminate sets that contain impossible pairs (e.g. negative trip length).
- Examine trip length distributions and select set that appears normal.
The full implementation of this procedure is shown below.
set.seed(90)
# I arbitrarily set the number of repetitions to 100,
# which is ultimately more than enough to achieve desired result
scenario_2_sim <- bind_rows(
map_dfr(1:100, ~ {
tibble(i = ..1,
direction = "NS",
North = filter(scenario_2A, direction == "North")$time %>% sample(),
South = filter(scenario_2B, direction == "South")$time %>% sample())
}),
# now we flip the directions
map_dfr(1:100, ~ {
tibble(i = ..1,
direction = "SN",
South = filter(scenario_2A, direction == "South")$time %>% sample(),
North = filter(scenario_2B, direction == "North")$time %>% sample())
})
) %>%
mutate(l = if_else(direction == "NS",
time_length(South - North, unit = "hours"),
time_length(North - South, unit = "hours"))) %>%
with_groups(c(direction, i), filter, !any(l < 0.5))
scenario_2_sim %>%
ggplot(aes(l)) +
geom_density(aes(color = factor(i))) +
facet_wrap(~ direction, ncol = 1)
Most of these distributions appear normal. We’ll pick the one with the highest Shapiro-Wilk normality test statistic.
library(broom)
scenario_2 <- scenario_2_sim %>%
group_by(direction, i) %>%
summarize(vec = list(l)) %>%
rowwise() %>%
mutate(shapiro.test(vec) %>% tidy()) %>%
filter(p.value < 0.1) %>%
group_by(direction) %>%
slice_max(order_by = statistic) %>%
semi_join(scenario_2_sim, .) %>%
transmute(North, South, v_id = max(scenario_1$v_id) + row_number()) %>%
pivot_longer(North:South, names_to = "direction", values_to = "time")
For Scenario 3, we won’t impose any distributional or directional constraints on timestamp selection, simplifying the procedure considerably.
Since the average driver is unlikely to use the express lane more than twice / day, we will assign 100 vehicles to Scenario 3.
First, we shuffle the remaining rows in express
and assign 3 timestamps per vehicle.
set.seed(100)
scenario_3 <- express %>%
anti_join(scenario_1) %>%
anti_join(scenario_2) %>%
sample_n(size = n()) %>%
mutate(v_id = (row_number() - 1) %% 3 == 0,
v_id = cumsum(v_id) + max(scenario_2$v_id)) %>%
arrange(v_id, time)
scenario_3 %>%
slice_head(n = 9)
direction | time | v_id |
---|---|---|
North | 2020-05-20 11:19:17 | 7001 |
South | 2020-05-20 15:07:51 | 7001 |
South | 2020-05-20 17:54:40 | 7001 |
South | 2020-05-20 08:47:28 | 7002 |
South | 2020-05-20 13:21:30 | 7002 |
South | 2020-05-20 17:14:47 | 7002 |
South | 2020-05-20 13:08:55 | 7003 |
South | 2020-05-20 13:24:35 | 7003 |
South | 2020-05-20 17:55:40 | 7003 |
All that remains is to exclude vehicles with unrealistic timestamps. This time, we’ll use a lower threshold of 1 hour.
scenario_3 <- scenario_3 %>%
group_by(v_id) %>%
mutate(delta = time_length(time - lag(time), unit = "hours")) %>%
filter(!any(delta < 1, na.rm = TRUE)) %>%
ungroup() %>%
slice_head(n = 300) %>%
# ^ 3 rows / vehicle * 100 vehicles = 300 rows
select(direction, time, v_id)
There is also an Easter Egg I wish to plant: 2 vehicles that travel close together twice in one day. This is part of a fictional “bank robbery.”
We’ll sample a pair of timestamps each from noon and 5:00pm (assuming the robbery takes place within that timeframe).
set.seed(99)
robbery_A <- express %>%
anti_join(scenario_1) %>%
anti_join(scenario_2) %>%
anti_join(scenario_3) %>%
sample_n(size = 2, weight = dnorm(hourday(time), mean = 12, sd = 0.1)) %>%
mutate(v_id = max(scenario_3$v_id) + row_number())
robbery_B <- express %>%
anti_join(scenario_1) %>%
anti_join(scenario_2) %>%
anti_join(scenario_3) %>%
sample_n(size = 2, weight = dnorm(hourday(time), mean = 17, sd = 0.1)) %>%
mutate(v_id = max(scenario_3$v_id) + row_number())
robbery <- bind_rows(robbery_A, robbery_B)
Finally, we’ll assign the remaining timestamps to Scenario 1.
express <- express %>%
anti_join(scenario_1) %>%
anti_join(scenario_2) %>%
anti_join(scenario_3) %>%
anti_join(robbery) %>%
mutate(v_id = max(robbery$v_id) + row_number()) %>%
bind_rows(scenario_1, scenario_2, scenario_3, robbery) %>%
arrange(direction, time)
Step 5: Make/model assignment
We’ve successfully assigned a v_id
to each timestamp - but v_id
is only an integer. Now, we need to assign vehicle make/model/color to each v_id
.
It’s finally time to leverage rush_hour
from Step 1.
For simplicity, we’ll assume the true frqeuency for each make/model does not depend upon day of the week. Hence, we treat each day as an independent sample. The below code calculates the weight of each make/model within each sample, then averages the weights between samples.
vehicle_probs <- rush_hour %>%
drop_na() %>%
count(day, make, model) %>%
group_by(day) %>%
mutate(wt = n / sum(n)) %>%
group_by(make, model) %>%
summarize(wt_mean = mean(wt), .groups = "drop") %>%
mutate(wt = wt_mean / sum(wt_mean), .keep = "unused")
For colors, we’ll sum across the entire dataset.
color_probs <- rush_hour %>%
drop_na() %>%
count(make, model, color) %>%
group_by(make, model) %>%
mutate(wt = n / sum(n)) %>%
ungroup()
We use sample_n
to assign make/model/color in a vectorized fashion. Here, dplyr
allows us to seamlessly change grouping as we progress through the operation.
set.seed(98)
express <- express %>%
distinct(v_id) %>%
# make/model
bind_cols(sample_n(vehicle_probs,
size = nrow(.),
weight = wt,
replace = TRUE)) %>%
select(!wt) %>%
# color
full_join(color_probs) %>%
group_by(v_id) %>%
sample_n(size = 1, weight = wt) %>%
select(!(n:wt)) %>%
inner_join(express, .) # join it all back with express
express %>%
sample_n(5)
express | |||||
---|---|---|---|---|---|
direction | time1 | v_id | make | model | color |
North | 11:24:54 | 5126 | Kia | Santa Fe | White |
South | 14:47:39 | 10309 | Nissan | Pathfinder | White |
North | 12:06:57 | 7852 | Scion | xB | Yellow |
South | 17:02:42 | 3597 | Mazda | CX-3 | Black |
South | 15:46:51 | 5130 | Subaru | Forester | Grey |
1
<dttm> converted to <hms> for display
|
Step 6: Plate Assignment
Investigating license plate statistics (e.g. from where are the most frequent out-of-state plates) is certainly worth investigating - but for now, I’m choosing to generate plates that solely adhere to Texas formatting: 3 characters + 4 numbers. Again, the tidyverse saves us from expensive for-loop operations.
set.seed(97)
plate_letters <- crossing(L1 = LETTERS, L2 = LETTERS, L3 = LETTERS) %>%
mutate(st = str_c(L1, L2, L3, sep = "")) %>%
pull(st) %>%
sample(., n_distinct(express$v_id), replace = TRUE)
plate_numbers <- 0:9999
plate_numbers <- str_pad(plate_numbers, side = "left", pad = "0", width = 4) %>%
sample(., n_distinct(express$v_id), replace = TRUE)
plates <- str_c(plate_letters, plate_numbers, sep = "-")
At last, plate
replaces the role of v_id
.
express <- express %>%
distinct(v_id) %>%
mutate(plate = plates) %>%
inner_join(express, .) %>%
relocate(plate, .before = make) %>%
select(!v_id)
express %>%
sample_n(5)
express | |||||
---|---|---|---|---|---|
direction | time1 | plate | make | model | color |
North | 08:05:41 | PZD-7441 | Chevy | Avalanche | White |
North | 06:16:58 | INR-7989 | Mazda | 6 | Black |
South | 12:30:03 | IGV-1988 | Nissan | Murano | Black |
North | 12:19:49 | UYD-5351 | Ford | Mustang | Black |
North | 09:49:00 | WLM-5323 | Honda | Civic | White |
1
<dttm> converted to <hms> for display
|
Although public service vehicles like metro buses utilize the Express Lane, we will exclude them here.↩︎