Customizing OSM Data Loading with Lua Scripts in Osm2pgsql

Sam Chance

on

Osm2pgsql is a powerful command-line tool designed to efficiently import OSM data into a PostgreSQL/PostGIS database. One of its noteworthy features is the ability to use the Lua programming language at runtime to execute user-defined scripts. This allows for extensive customization and transformation of the OSM data as it is being loaded by the tool. This enables us to create very specific tables, and store the data exactly how we need it.

In this blog, I’ll give an overview of the osm2pgsql command using OSM data. I’ll explain how we can easily transform and customize the data using simple Lua scripts to produce a concise and well-structured dataset that meets our specific needs.

The osm2pgsql command

The osm2pgsql program is a simple command line tool that can load a variety of OSM data-types into a PostgreSQL/Postgis database. Basic usage of the tool looks like this:

osm2pgsql --create --output flex --schema osmplanet --style styles/style.lua osm-planet.osm.pbf

The --create flag tells the tool to create a new table (--append can be used to add to an existing table). --output flex specifies that we will customise the tool with the Lua script specified by the --script flag. And finally, the --schema flag tells the tool which schema to use in the database (the tool uses public by default). More details about the command and flags can be found in the Man pages: man osm2pgsql.

During the import process, Osm2pgsql typically loads all of the OSM nodes into RAM before adding to the database. This makes loading very fast and efficient, but can use vast amounts of RAM. When I run the Osm2pgsql tool on a planet-wide osm.pbf file, I usually max out at around 90GB of RAM! My system has 128GB of DDR4 ECC RAM and a 12/24 core processor, so if I’m not doing anything else, my system can just about manage this.

If the system runs out of RAM during runtime, then loading will fail. If your system is short on RAM, then you can use the --slim and --flat-nodes arguments to store nodes on disk or in the database instead. This comes with a massive performance hit though.

If using osm2pgsql with the --output flex option, it’s recommended to use a build compiled with LuaJIT support. This will speed up parsing the OSM data by about 15%

Connecting to the database

Firstly, a PostgreSQL instance is needed with the Postgis extension installed. For loading planet-wide data, be aware that the database will need fine-tuning, as default PostgreSQL configuration is not suitable for large databases,

Information on preparing the database for use by osm2pgsql can be read here.

Once the database is configured with the necessary extensions installed, you can add the database credentials to a .env file in the same directory as the osm2pgsql command will be run. It should look something like this:

export PGHOST=localhost
export PGPORT=5432
export PGDATABASE=gisdb
export PGUSER=gisuser
export PGPASSWORD=abc1234password

This should be sourced before running the osm2pgsql command. For convenience, this can all be done in a single script:

#!/usr/bin/env bash

source .env
osm2pgsql --create --output flex --schema osmplanet --style styles/style.lua osm-planet.osm.pbf

The data file

OSM data files can be obtained from various sources. I use the planet-wide file as I like to have all OSM data available in my database, but you can easily just use an extract for a particular region instead. If you do intend to import planet-wide data, I’d advise testing the script with a small region first to make sure everything imports correctly, as planet-wide imports can take several hours.

The latest planet-wide file can be imported using cURL:

curl -OL https://planet.openstreetmap.org/pbf/planet-latest.osm.pbf

Or alternatively, regional extracts can be downloaded from geofabrik, or city level extracts from bbbike.

This will download Singapore (which is only about 25MB):

curl -OL https://download.bbbike.org/osm/bbbike/Singapore/Singapore.osm.pbf

The Lua script

The Lua script has two parts, the first defines the structure of the output table, and the second defines functions that transforms the data to the new structure.

Defining our table

A table is defined in Lua like so:

local placenames = osm2pgsql.define_table({
  name = "placenames",
  ids = { type = "any", id_column = "osm_id", type_column = "osm_type" },
  columns = {
    { column = "name", type = "jsonb" },
    { column = "type", type = "text" },
    { column = "geom", type = "point", projection = 4326, not_null = true },
  },
})

Here we define a table called placenames. The idea of this table is to indicate the name and type of a particular location as a single point. This includes the names of significant human settlements such as villages, towns and cities or regions such as countries or islands. More information on the place tag can be found on the OSM Wiki

I often use place names in my maps and GIS analyses to label significant locations, therefore it’s useful to have a global place names table that I can easily access in my database.

The ids field in the above snippet defines the ID columns. Here we specify that the object ID column will be named osm_id, and the column that specifies the object type as osm_type. So, for example, the osm_id could be 331952687 and the osm_type N for a node. The type = "any" part specifies the OSM types allowed within in the table. More information on id handling can be found here.

It’s useful to set these values, as you can later refer back to the original OSM object by the ID and type in the future. This is also mandatory information if you later wish to update your table from OSM diff files. I wont go into updating tables in this blog, but more information can be found here.

The columns field defines the columns that we wish to populate in the script. The name of the column is defined, along with other information such as the data type and projection (for geometry columns).

In the above example, we have created a name column of type text, a type column of type text, and a geom column of geometry type point with a projection of 4326. Of course, more columns can be defined here depending on your use-case, for example, for the place tag, you could include the population and admin_level fields too if you wanted.

Here is a list of the different data types available:

  • text
  • bool, boolean
  • int2, smallint
  • int4, int, integer
  • int8, bigint
  • real
  • hstore
  • json
  • jsonb
  • direction
  • geometry
  • point
  • linestring
  • polygon
  • multipoint
  • multilinestring
  • multipolygon
  • geometrycollection

The final table structure should look something like this:

osm_type osm_id name type geom
N 1234 abcd abcd 0000

Data mapping functions

The data mapping functions are where the magic happens. Custom logic can be implemented here to filter, transform and manipulate the data in any you can think of. There are also several built-in functions that can be applied to OSM objects before they are inserted into the table. More details on the built-in functions can be found here and here.

There are three main processing callback functions available:

  • osm2pgsql.process_node(object)
  • osm2pgsql.process_way(object)
  • osm2pgsql.process_relation(object)

There are also an additional three functions for processing untagged objects too, but I wont go in to that. More information can be found here.

These functions act on each of the osm objects in sequence from the input file, and is where we can define our custom logic. At least one of these functions needs to be defined, but not all of them are necessary. If you don’t want to process relations, then the process_relation function can be omitted. If there is an object that you do not want included in the final table, then the object can be filtered out using conditional logic and returning the function.

A very simple implementation of this for our placenames example could look like this:

local function process_nodes(object)
  if object.tags.place then
    local names, has_names
    names, has_names = lib.create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        geom = object:as_point(),
      })
    end
  end
end

There are a few things happening here.

  1. Firstly, we are processing only nodes with the process_nodes function.
  2. We then only include objects if the object.tags.place field exists.
  3. Then we define some custom logic for testing if the place tag has a name field. In this instance we want to remove objects where the name field is blank (as this is a place name table after-all!), then we refactor that field using the create_name_field() function (see below).
  4. If the name field exists, we then insert the object into the table with the insert() function.
  5. At this point we also convert the object to a point geometry using the built-in as_point() function.

Place names can exist in several different languages. The convention with OSM is that the local name (as known by people on the ground at the location) is the default name, and is often populated under the name tag. Sometimes additional languages are included, and for example are defined in a subtag like this name:en for English or name:ar for Arabic etc.

For my placenames table example, I’d like to prioritise English names. So ideally, if an name:en tag exists, that tag should be used in my name field before the OSM default name tag. This is the logic I’d like to use:

name:en > name > name:any

An example tag with multiple place names in different languages looks like this:

Key Value
addr:postcode 35470
name Bain-de-Bretagne
name:br Baen-Veur
name:fr Bain-de-Bretagne
name:fr-x-gallo Bouin
name:uk Бен-де-Бретань
place village

In this example, the name tag Bain-de-Bretagne will be preferred over the others.

I’ve created a custom function in Lua to do just that:

-- Create custom name field that prioritises English names

local function create_name_field(object)
  local name, name_en, name_default, name_other
  local has_name = false
  for k, v in pairs(object.tags) do
    if k == "name" then
      name_default = v
      has_name = true
    elseif k == "name:en" then
      name_en = v
      has_name = true
    elseif string.match(k, "^name:") then
      name_other = v
      has_name = true
    end
  end
  name = name_en or name_default or name_other
  return name, has_name
end

This function uses the name:en tag if available, else it’ll fallback to the name tag (i.e. local language), then as a final fallback, it’ll pick any one of the other languages matching name:*.

Sometimes places are not defined by a node, instead they can be relations or ways. For example, islands are often depicted as relations. So in our placenames script, we also want to process nodes and ways. However, I am only interested in including point geometries for this table. Here is an example of a place that is a relation.

In this circumstance, we can use the built-in centroid() function on a line/polygon geometry to get the center-point of that geometry.

Here is an example of the process_ways and process_relations callback functions in action:

local function process_ways(object)
  if object.tags.place then
    local names, has_names
    names, has_names = lib.create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        geom = object:as_linestring():centroid(),
      })
    end
  end
end

local function process_relations(object)
  if object.tags.place then
    local names, has_names
    names, has_names = lib.create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        geom = object:as_multipolygon():centroid(),
      })
    end
  end
end

Here we convert ways to a linestring geometry, and relations into a polygon geometry, before finding the centroid of these geometries and inserting into the table as a point.

The whole script

The whole script looks like this:

--
-- lua script to be used with the osm2pgsql command
-- creates a PostgreSQL table called `placenames` with the following columns:
-- osm_type|osm_id|name|type|admin_level|population|geom
--

-- define a table
local placenames = osm2pgsql.define_table({
  name = "placenames",
  ids = { type = "any", id_column = "osm_id", type_column = "osm_type" },
  columns = {
    { column = "name", type = "text" },
    { column = "type", type = "text" },
    { column = "admin_level", type = "text" },
    { column = "population", type = "text" },
    { column = "geom", type = "point", projection = 4326, not_null = true },
  },
})

-- custom function for creating a name field that prioritises English names
local function create_name_field(object)
  local name, name_en, name_default, name_other
  local has_name = false
  for k, v in pairs(object.tags) do
    if k == "name" then
      name_default = v
      has_name = true
    elseif k == "name:en" then
      name_en = v
      has_name = true
    elseif string.match(k, "^name:") then
      name_other = v
      has_name = true
    end
  end
  name = name_en or name_default or name_other
  return name, has_name
end

-- the callback functions for processing and transforming OSM objects
function osm2pgsql.process_node(object)
  if object.tags.place then
    local names, has_names
    names, has_names = create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        admin_level = object.tags.admin_level,
        population = object.tags.population,
        geom = object:as_point(),
      })
    end
  end
end

function osm2pgsql.process_way(object)
  if object.tags.place then
    local names, has_names
    names, has_names = create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        admin_level = object.tags.admin_level,
        population = object.tags.population,
        geom = object:as_linestring():centroid(),
      })
    end
  end
end

function osm2pgsql.process_relation(object)
  if object.tags.place then
    local names, has_names
    names, has_names = create_name_field(object)

    if has_names then
      placenames:insert({
        name = names,
        type = object.tags.place,
        admin_level = object.tags.admin_level,
        population = object.tags.population,
        geom = object:as_multipolygon():centroid(),
      })
    end
  end
end

Using our table

To use our placenames table, we can write a simple query like this. I also have an administrative_boundaries table in my Postgis database (Lua script available here), so we can use that to select all places in the City of London using the following query:

with
    placenames as (select * from osmplanet.placenames),

    admin_area as (
        select st_polygonize(geom) as geom
        from osmplanet.administrative_boundaries
        where admin_level = '6' and name = 'City of London'
    )

select a.osm_id, a.geom, a.name
from placenames a
join admin_area b on st_intersects(b.geom, a.geom)

We can then import that into GIS software such as QGIS:

City of London with place names

References

Attribution

© OpenStreetMap