The same five data-shape questions, asked of the same dataset, in
Ibex, Pandas, Polars,
DuckDB SQL, and R. No installation
beyond each language's own runtime — every snippet pulls the data
straight from a public HTTPS URL, so you can paste any block into a REPL
and see the result immediately.
January 2024 yellow-cab trips from the NYC Taxi & Limousine
Commission — about 3 million rows, 50 MB Parquet, served as
a public ZSTD-compressed file from cloudfront. The zone lookup is a
separate CSV from the same host.
Each task shows Ibex
first, then the Pandas
equivalent inline. The Polars, DuckDB, and R versions are tucked into
collapsible sections — click to expand. Every snippet is a leaf
expression, written for an interactive REPL where the trailing value
auto-prints (Ibex ;, Python's >>>,
the DuckDB CLI, R's >). All Ibex outputs shown below
are real — captured from running the same script against the
live URL.
Ibex setup. The Ibex snippets use
import "parquet" and import "csv", so the REPL
must be able to find the bundled plugins. After a release build, start
it with
./build-release/tools/ibex --plugin-path ./build-release/tools.
Without that plugin path, read_parquet and
read_csv are not registered.
Heads up — Pandas + CloudFront. CloudFront
occasionally returns HTTP 403 to Python's default
urllib user-agent after many requests in a row. If
pd.read_parquet(URL) fails with a 403, either download
the file once with curl -A 'Mozilla/5.0' … and
point pd.read_parquet at the local copy, or use
pyarrow.parquet.read_table(URL).to_pandas() which uses
Arrow's HTTP client. Polars (via object_store), DuckDB
(via its httpfs extension), and Ibex (via libcurl) don't
hit this.
Task 1
Trip count and average fare by pickup hour
How does the city's taxi load vary across the 24 hours of the day?
Ibex
import "parquet";
let trips = read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet");
trips[head 5];
trips[select { trips = count(), avg_fare = mean(fare_amount) },
by { hour = hour(tpep_pickup_datetime) },
order { hour asc }];
INSTALL httpfs; LOAD httpfs;
SELECT hour(tpep_pickup_datetime) AS hour,
count(*) AS trips,
avg(fare_amount) AS avg_fare
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
GROUP BY hour
ORDER BY hour;
R (arrow + dplyr)
# R's arrow reads naive Parquet timestamps as UTC and lubridate's hour()
# returns the hour in the system-local timezone — pin to UTC first so
# hour() matches the values stored in the file.
Sys.setenv(TZ = "UTC")
library(arrow); library(dplyr); library(lubridate)
trips <- read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet")
trips |>
mutate(hour = hour(tpep_pickup_datetime)) |>
group_by(hour) |>
summarise(trips = n(), avg_fare = mean(fare_amount)) |>
arrange(hour)
Task 2a
Top 10 pickup zones by revenue
Which pickup zones brought in the most fare revenue?
Ibex
let zones = trips[select { revenue = sum(total_amount), n = count() },
by PULocationID];
zones[order { revenue desc }, head 10];
SELECT PULocationID,
sum(total_amount) AS revenue,
count(*) AS n
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
GROUP BY PULocationID
ORDER BY revenue DESC
LIMIT 10;
Same idea, but with the twist every analyst eventually hits: top-N
inside each group. SQL needs a window function; Pandas pushes you
into groupby().apply(lambda x: x.nlargest(...)); Ibex says
head 3, by zone.
Ibex
let by_zh = trips[select { n = count() },
by { zone = PULocationID,
h = hour(tpep_pickup_datetime) }];
by_zh[order { n desc }, head 3, by zone];
WITH counted AS (
SELECT PULocationID AS zone,
hour(tpep_pickup_datetime) AS h,
count(*) AS n,
row_number() OVER (PARTITION BY PULocationID
ORDER BY count(*) DESC) AS rk
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
GROUP BY zone, h
)
SELECT zone, h, n FROM counted WHERE rk <= 3;
R (arrow + dplyr)
Sys.setenv(TZ = "UTC") # see Task 1 R note
trips |>
mutate(h = hour(tpep_pickup_datetime), zone = PULocationID) |>
count(zone, h, name = "n") |>
group_by(zone) |>
slice_max(n, n = 3, with_ties = FALSE)
Task 3
Rolling 1-hour average fare at JFK pickups
For every JFK pickup, what was the average fare and trip count in the
previous hour? This is the canonical streaming-window shape — Ibex
needs the table promoted to a TimeFrame, then a
window clause does the rest.
Ibex
let jfk = trips[filter PULocationID == 132];
let jfk_ts = as_timeframe(jfk, "tpep_pickup_datetime");
let smoothed = jfk_ts[window 1h,
update { rolling_avg_fare = rolling_mean(fare_amount),
rolling_trips = rolling_count() }];
smoothed[select { tpep_pickup_datetime, fare_amount,
rolling_avg_fare, rolling_trips },
head 10];
SELECT tpep_pickup_datetime, fare_amount,
avg(fare_amount) OVER w AS rolling_avg_fare,
count(*) OVER w AS rolling_trips
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
WHERE PULocationID = 132
WINDOW w AS (ORDER BY tpep_pickup_datetime
RANGE BETWEEN INTERVAL 1 HOUR PRECEDING AND CURRENT ROW)
ORDER BY tpep_pickup_datetime
LIMIT 10;
A naive mean(tip_amount / fare_amount) is a mess: zero-fare
rows give infinite ratios, refunds give negatives, and the occasional
5000% tip (probably a typo) swamps the mean. Strip those first, then
average per pickup hour.
Ibex
let clean = trips[filter fare_amount > 5.0
&& tip_amount >= 0.0
&& tip_amount < fare_amount];
clean[select { n = count(),
avg_tip_rate = mean(tip_amount / fare_amount) },
by { hour = hour(tpep_pickup_datetime) },
order { hour asc }];
SELECT hour(tpep_pickup_datetime) AS hour,
count(*) AS n,
avg(tip_amount / fare_amount) AS avg_tip_rate
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
WHERE fare_amount > 5
AND tip_amount >= 0
AND tip_amount < fare_amount
GROUP BY hour
ORDER BY hour;
Locations are encoded as integer zone IDs; borough names live in a
second file. Join the two and roll up to borough level.
Ibex
import "csv";
let zones = read_csv(
"https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv");
let joined = trips join zones on PULocationID == LocationID;
joined[select { revenue = sum(total_amount), n = count() },
by Borough,
order { revenue desc }];
# keep_default_na=False so the literal string "N/A" (a real Borough value)
# isn't silently turned into a NaN and dropped at groupby time.
zones = pd.read_csv(
"https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv",
keep_default_na=False)
joined = trips.merge(zones, left_on="PULocationID", right_on="LocationID",
how="inner")
(joined.groupby("Borough", as_index=False)
.agg(revenue=("total_amount", "sum"), n=("total_amount", "size"))
.sort_values("revenue", ascending=False))
SELECT z.Borough,
sum(t.total_amount) AS revenue,
count(*) AS n
FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet' t
JOIN read_csv_auto('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv') z
ON t.PULocationID = z.LocationID
GROUP BY z.Borough
ORDER BY revenue DESC;
R (arrow + dplyr)
zones <- read_csv_arrow(
"https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv")
trips |>
inner_join(zones, by = c("PULocationID" = "LocationID")) |>
group_by(Borough) |>
summarise(revenue = sum(total_amount), n = n()) |>
arrange(desc(revenue))
Try it
Run any of this locally
Ibex needs only a build of the REPL and its parquet/csv plugins:
Paste any of the Ibex blocks above. Everything else (the Parquet file,
the zone lookup CSV) is fetched over HTTPS — no AWS credentials,
no environment setup, no Docker.
The other languages each take their normal install
(pip install pandas pyarrow polars duckdb,
install.packages(c("arrow", "dplyr", "lubridate", "slider")));
the snippets above are otherwise self-contained too.