OpenStreetMap (OSM) is an invaluable resource for all kinds of geographic applications and research. This collection of voluntary geographic information provides highly detailed and semantically diverse data. For a recent paper, for example, I used OSM to derive local-scale public green availability in a nationwide analysis.

At the start of the analysis, the question arose whether it was feasible at all to perform millions of spatial queries within an acceptable timeframe. After some testing, it became clear that the analyses I had in mind was feasible using a PostgreSQL/PostGIS database. This, however, came with the need for some tedious setup – at least for this use-case.

Recently, I learned about DuckDB and found that it would have been a perfect fit for this kind of analysis without the need of much setup. With the promise of being as easy as SQLite but fast for data analytics, I was immediately hooked. Here, I want to document how easy it is to use DuckDB for analyzing large amounts of OSM data with ease.

DuckDB

DuckDB is an in-process SQL OLAP database management system

For me as a geographic data scientist and programmer who is using SQLite in all kinds of places (e.g. as GPKG to store geodata), this basically means that DuckDB could provide help speed analyses up without the overhead of more elaborate DBMS like PostgreSQL. Some promises from the website sound very enticing:

  • Simple and portable (In-process, serverless; C++11, no dependencies, single-file build; APIs for Python, R, Java, Julia, Swift, …)
  • Feature-rich (Transactions, persistence; Extensive SQL support; Direct Parquet, CSV, and JSON querying; Joins, aggregates, window functions)
  • Fast (Optimized for analytics; Vectorized and parallel engine; Larger than memory processing; Parallel Parquet, CSV, and NDJSON loaders)
  • Free and extensible (Free & open-source; Permissive MIT License; Flexible extension mechanism)

Installing DuckDB

Detailed installation instructions can be found in the official documentation. I tend to download the binary from GitHub and put it on my $PATH (e.g. into ~/.local/bin). Thus, I can start duckdb from anywhere.

Using DuckDB

Once installed, DuckDB provides a minimal CLI that is perfect for testing some queries. It can be launched via duckdb.

$ duckdb
v1.1.0 fa5c2fe15f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

Inside this prompt, one could craft queries, perform some simple tests for development even perform some analyses. To quit DuckDB, just press Ctrl + D or type .quit.

Many DuckDB client APIs exist, but I am probably going to use it most from Python and R, as it bridges the gap between convenient development and performance, where the languages have their known limitations.

DuckDB Spatial

Of special interest for me in geographic applications is the Spatial extension which enables spatial data analyses in DuckDB.

INSTALL spatial;
LOAD spatial;

This extension provides support for spatial geometry types, spatial scalar functions for creating, manipulating, aggregating and analyzing spatial objects, as well as read / write functionality for common spatial formats such as GeoPackages. Also, OSM data can be queried directly from .osm.pbf files.

Analyzing OSM with DuckDB

For an interactive example on how to use DuckDB for analyzing OSM data, let's acquire a decently sized OSM file. As the time of writing, OSM data for Germany as provided by Geofabrik, is about 4 GiB.

$ wget https://download.geofabrik.de/europe/germany-latest.osm.pbf

With the data acquired, we can now start duckdb. Specifying a database file, here osm.duckdb, will store all tables we create inside this single file.

duckdb osm.duckdb

Setting the timer option to on will report the processing times of each query.

.timer on

With the Spatial extension loaded (via LOAD spatial;, see above), we can now directly query data from within the .osm.pbf file:

FROM 'germany-latest.osm.pbf' LIMIT 10;
/*┌──────────────────────┬────────┬──────────────────────┬───┬───────────┬──────────────────────┐
  │         kind         │   id   │         tags         │ … │ ref_roles │      ref_types       │
  │ enum('node', 'way'…  │ int64  │ map(varchar, varch…  │   │ varchar[] │ enum('node', 'way'…  │
  ├──────────────────────┼────────┼──────────────────────┼───┼───────────┼──────────────────────┤
  │ node                 │    100 │ {description=Weltk…  │ … │           │                      │
  │ node                 │ 122317 │ {TMC:cid_58:tabcd_…  │ … │           │                      │
  │ node                 │ 122318 │                      │ … │           │                      │
  │ node                 │ 122320 │                      │ … │           │                      │
  │ node                 │ 122322 │                      │ … │           │                      │
  │ node                 │ 122323 │                      │ … │           │                      │
  │ node                 │ 122324 │                      │ … │           │                      │
  │ node                 │ 122329 │ {TMC:cid_58:tabcd_…  │ … │           │                      │
  │ node                 │ 122332 │                      │ … │           │                      │
  │ node                 │ 122338 │                      │ … │           │                      │
  ├──────────────────────┴────────┴──────────────────────┴───┴───────────┴──────────────────────┤
  │ 10 rows                                                                 8 columns (5 shown) │
  └─────────────────────────────────────────────────────────────────────────────────────────────┘*/

While this is certainly possible, querying the .osm.pbf file is certainly slower than a native DuckDB format. In my case, this takes about 28.4 seconds.

select count(*) from 'germany-latest.osm.pbf';
/* 100% ▕████████████████████████████████████████████████████████████▏
   ┌──────────────┐
   │ count_star() │
   │    int64     │
   ├──────────────┤
   │    458058020 │
   └──────────────┘
   Run Time (s): real 28.412 user 202.569638 sys 9.707245 */

QuackOSM for converting OSM data to DuckDB

Converting the data for convenient use in DuckDB is made simple by the tool QuackOSM. It encodes all the logic for converting the OSM-style relational data storage of nodes, ways (open & closed), and relations into the more familiar points, lines, and polygons. Running quackosm on a file converts it to a geoparquet file.

pip install quackosm
quackosm germany-latest.osm.pbf

Although, I have recently been working on a Pull Request to allow for direct import into a native DuckDB database, too, data can easily be imported from the parquet files:

CREATE TABLE osm AS
SELECT * REPLACE (st_geomfromwkb(geometry) as geometry)
FROM read_parquet('files/germany-latest_nofilter_noclip_compact.geoparquet');

Easy access to all Tags

The feature that makes OSM especially valuable are the diverse tags, key=value pairs that are used to annotate features in the OSM database. DuckDB makes these tags accessible inside one column as a Map data type. This closely mimics how attribute data are stored in OSM tags.

SELECT tags FROM osm WHERE tags IS NOT NULL LIMIT 10;
/*┌───────────────────────────────────────────────────────────────────────────────────────────────────────┐
  │                                                 tags                                                  │
  │                                         map(varchar, varchar)                                         │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │ {description=Weltkriegsdenkmal, historic=memorial, memorial=war_memorial}                             │
  │ {TMC:cid_58:tabcd_1:Class=Point, TMC:cid_58:tabcd_1:Direction=positive, TMC:cid_58:tabcd_1:LCLversi…  │
  │ {TMC:cid_58:tabcd_1:Class=Point, TMC:cid_58:tabcd_1:Direction=positive, TMC:cid_58:tabcd_1:LCLversi…  │
  │ {bus=yes, name=Michaeliskirche, network=HVV, public_transport=stop_position}                          │
  │ {bicycle=yes, button_operated=yes, crossing=traffic_signals, highway=crossing, tactile_paving=no, t…  │
  │ {TMC:cid_58:tabcd_1:Class=Point, TMC:cid_58:tabcd_1:Direction=negative, TMC:cid_58:tabcd_1:LCLversi…  │
  │ {crossing:activation=local, crossing:barrier=no, crossing:chicane=no, crossing:light=yes, crossing:…  │
  │ {TMC:cid_58:tabcd_1:Class=Point, TMC:cid_58:tabcd_1:Direction=positive, TMC:cid_58:tabcd_1:LCLversi…  │
  │ {crossing=no, highway=traffic_signals, traffic_signals=signal, traffic_signals:direction=forward}     │
  │ {TMC:cid_58:tabcd_1:Class=Point, TMC:cid_58:tabcd_1:Direction=both, TMC:cid_58:tabcd_1:LCLversion=8…  │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────┤
  │                                                10 rows                                                │
  └───────────────────────────────────────────────────────────────────────────────────────────────────────┘
  Run Time (s): real 0.003 user 0.009330 sys 0.002276*/

In DuckDB they can be accessed with similar syntax to Python's Pandas or R's DataFrames. This makes them very easy to query. The following example shows how to count the number of OSM objects in Germany are tagged with leisure=park.

SELECT count(*) FROM osm WHERE list_contains(tags['leisure'], 'park');
/*┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        44931 │
  └──────────────┘
  Run Time (s): real 5.900 user 44.860880 sys 0.090244*/

Spatial Analyses

Of course, the true power of the combination of OSM and DuckDB in my view lies in the ability to perform spatial analyses. Similar to PostgreSQL with the PostGIS extension, the spatial extension allows for spatial operations on the geometry column in the database. For example, let's find out, what the total area in m² is of all buildings in Germany (at least the ones mapped in OSM). But first, let's check how many buildings are in the dataset.

  SELECT COUNT(*) FROM osm WHERE tags['building'] != [];
  /*┌──────────────┐
    │ count_star() │
    │    int64     │
    ├──────────────┤
    │     37451546 │
    └──────────────┘ */
  SELECT sum(ST_Area(ST_Transform(geom, 'EPSG:4326', 'EPSG:3035')))
  FROM osm
  WHERE tags['building'] != [] AND ST_GeometryType(geometry)::VARCHAR LIKE 'POLYGON';
  /*┌────────────────────────────────────────────────────────────────┐
    │ sum(st_area(st_transform(geometry, 'EPSG:4326', 'EPSG:3035'))) │
    │                             double                             │
    ├────────────────────────────────────────────────────────────────┤
    │                                              9664382739.825466 │
    └────────────────────────────────────────────────────────────────┘*/

What still blows my mind is how fast DuckDB handles these kinds of queries. Afterall, the query first had to find all the buildings using WHERE inside nested tags and filter out all geometries but polygons, then ST_Transform the geometries, then calculate the ST_Area per building, and last, add them up. All that on a 5 year old laptop in under 5 minutes seconds. This just makes me gitty. It enables large scale spatial analyses without the overhead of a server-hosted PostGIS database.

Currently, the spatial extension provides most functionality for doing basic spatial operations and manipulations, but it will surely evolve in the future. The ecosystem around spatial data is promising. One honorable mention here must be the H3 community extension.

Bottom Line – why I think DuckDB is awesome, not just for OSM

The main reason why I think DuckDB is such a blessing for spatial data analytics is that it makes "just trying things out" easy. After I had the idea for the last paper in my PhD (the code of which is published here), I started out with the question "Is this even possible in a reasonable amount of processing time?". The admittedly not too pretty solution (due to this also being my first bigger Python project), eventually used PostgreSQL/PostGIS to do the heavy lifting of several million spatial operations.

Would I have known about DuckDB at the time, it would have saved me from having to set up a whole Postgres server. And being a OLAP database, it would probably be faster in doing the queries, too.

If it's not clear by now, I am stoked on finding out where DuckDB will bring the data curious and how it will help me solve the odd overly complicated spatial questions.