Ali Anari

7 PostgreSQL data migration hacks you should be using (but aren't)

Ever find yourself needing to add a new column to your million row users table? Small caveat, this needs to be done without blocking concurrent writes. You also just learned that you need to re-hash all of your users’ passwords using the scrypt key deviation function to harden their credentials against attackers. Oh, and no users are allowed to log in for the duration of this data migration. At the breakneck pace that many of today’s high-tech startups operate, this type of pressure is not entirely uncommon. Fortunately, you have elected to use PostgreSQL or some other industrial-strength RDBMS as your database workhorse of choice, and you will soon have the necessary tools to nail down these newfound requirements with ease.

Elephant-grade storage

We love using Postgres at Tilt, and it makes our lives easier every day. It’s stable, extensible, supports high volume, and has so many advanced features that keep getting better with each new release (JSONB, Lateral Joins, Materialized Views, and Foreign Data Wrappers just to name a few). I could talk about the exciting future of Postgres all day, but for now, I’ll share some examples of how it has impacted us directly here at Tilt. Read on for some battle-tested hacks that you can incorporate right away into your data migrations playbook!

Rank and file

Need to group rows by some criteria and extract the most relevant ones? Use window functions (or analytic functions if you’re coming from an Oracle background) to efficiently return this data, and use the WITH clause to chain these results into the rest of your data migration. Here is an example where the most recent user is retrieved, per each external_id:

The ROW_NUMBER() ranking function is extremely useful, as it emulates the LIMIT and OFFSET clauses but in a more efficient way. It’s trivial to modify this example to return the first, middle, or last n rows, or combine the query with other statistical functions.

Note: You should use DENSE_RANK() instead however, if duplicate records are to receive the same rank, and there are no gaps in between ranks.

In through the out query

Want to massage some rows before shuffling them into another table? You can use the SELECT clause, and pipe the output directly into an table via an INSERT clause (without using VALUES), and save yourself a lot of hassle. This is a fairly common but nevertheless useful SQL idiom that MySQL supports as well:

The benefit of this approach is being able to perform arbitrary functions for each row being inserted in the SELECT portion (in this example, we’re marking this batch of rows as being “backfilled” as part of our data migration, presumably to differentiate them from other rows that already existed prior). This is useful for transforming data, analogous to how the higher-order map function works in many programming languages, and replaces the need for a PL/pgSQL loop except in more complicated scenarios.

Note: If you have a very large volume of data, I would recommend transforming the data on disk first, then performing a bulk insert or COPY operation to load the transformed records faster, as this approach may take too long for your particular data migration.

In-N-Out query

Use this technique to use rows as soon as they become available from an INSERT statement, without blocking. By taking advantage of the WITH clause again, you avoid having to hand-roll loops in PL/pgSQL, thus making your query convenient and highly readable:

This utilizes the optional RETURNING clause, which can return specified columns from the rows that are being inserted (use OUTPUT if you’re coming from MSSQL). Combine this with the GENERATE_SERIES() set returning function, and you’ve got a formidable force for generating indices or timestamps for rapid insertion!

Delete-fu query

This is very similar to the last insertion example, but instead operates the DELETE clause to return data and chain queries:

A key difference here is that we are returning a single column from the deleted table, instead of using the asterisk (*) character.

Note: You can use AS to alias the columns that are coming back from the RETURNING clause to rename them on the fly, as we are doing above.

Keep calm and archive on

Now let’s combine the last few techniques, and observe how seamless it is to delete, archive, and return specific rows from a fictitious customers table, all in a single statement:

This deletes all customers whose orders are exclusively older than 10 years (preserving customers who have placed any newer orders), moves them to the archived_customers table, and lastly displays the relevant rows that were moved.

Note: The USING clause is a Postgres extension that we used to join the latest_orders CTE onto customers. A more standard way to do this would be to include a sub-select, but sometimes the join style is easier to write and even faster to execute.

Batch at will, commander!

Finally, we may only want to lock and process so many rows at once during a particularly sensitive data migration – here’s how you can safely dispatch a large number of updates. I recently learned this from rosser, who shared this technique with everyone on Hacker News. Here’s a concrete example of how we can safely upgrade our most trusted “moderator” users to the “admin” role on a large internet forum, without contending with concurrent writes or incurring substantial I/O all at once:

In this example, the FOR UPDATE NOWAIT immediately locks the rows being retrieved in the first step (as if they were to be updated), and this prevents them from being modified or deleted by other transactions until the current transaction ends. Then we simply set LIMIT 1000 to however many records n we want to process per iteration, and loop on issuing that query until the COUNT(1) expression yields zero. Clever!

Summary

Hopefully you will have gleaned a thing or two from this post, and are now ready to venture out and write some robust and awesome data migrations of your own.

If you haven’t configured your psql interactive shell yet, go ahead and feel free to use my config file and simply drop it in your home directory as ~/.psqlrc. Voila: sensible defaults, better shell and overall Postgres experience.

I tried to keep this post fairly concise, but for the boldest of explorers, here is an excellent wiki page containing many other Postgres gems that you also may not have known about.

As you can see, with so many powerful data manipulation functions at our disposal, the possibilities are truly endless. Even if your organization runs NoSQL data stores in production to solve its day-to-day engineering challenges, Postgres remains a flexible, viable, and powerful Swiss-Army chainsaw of persistence mechanisms to have in your arsenal.

Further Reading

MySQL vs PostgreSQL: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL
Select: http://www.postgresql.org/docs/9.1/static/sql-select.html
Insert: http://www.postgresql.org/docs/9.1/static/sql-insert.html
Update: http://www.postgresql.org/docs/9.1/static/sql-update.html
Delete: http://www.postgresql.org/docs/9.1/static/sql-delete.html
COPY: http://www.postgresql.org/docs/9.1/static/sql-copy.html
WITH Queries: http://www.postgresql.org/docs/9.1/static/queries-with.html
Window Functions: http://www.postgresql.org/docs/9.1/static/functions-window.html
Set Returning Functions: http://www.postgresql.org/docs/9.1/static/functions-srf.html
rosser’s HN thread: https://news.ycombinator.com/item?id=9018756

« Roll Your Own A/B Tests With Optimizely and React Rendering React Components Using Perl »