A Guide to Logical Replication and CDC in PostgreSQL

2024-01-11 · 8 min read

PostgreSQL replication CDC logical replication data engineering

#A Guide to Logical Replication and CDC in PostgreSQL

By Jacob Prall · January 11, 2024 · 10 mins

PostgreSQL is a database that needs no introduction. Started as an open-source project out of UC Berkeley, it has evolved through decades of careful stewardship to become one of the world's most relied on and beloved relational database management systems (RDBMS).

There will inevitably come a time when the data captured and stored in your PostgreSQL database needs to exist somewhere else. Perhaps you need to replicate data between Postgres servers, to ensure your system survives a critical failure or heavy spikes in traffic. Alternatively, you may need to move your data to an environment better suited for data analysis and activation.

In this blog, we'll look at how to satisfy both of these requirements using logical replication. We'll review the internal mechanisms that enable replication in PostgreSQL, compare different types of replication, and finally, provide a guide to efficiently replicate your data between Postgres instances and other data stores.

Let's jump right in!

##What is Database Replication?

Database replication is the process of copying and maintaining database objects (in our case today, tables), in multiple locations. This can happen synchronously or asynchronously, at the byte, block, or logical level, and is crucial for high availability, load balancing, and data activation.

Replication in PostgreSQL is handled by several processes and features. We'll discuss how these components work below.

##PostgreSQL Replication Mechanisms

Let's look at a couple of the key internal components in PostgreSQL that make replication possible.

The first component to know is the Write-Ahead Log (WAL). The idea behind the WAL is that changes to data files (e.g. tables, indexes) must be written to disk only after those changes have been logged. The WAL, then, is that log — an append-only ledger that records every change in the database.

Using a WAL comes with a few benefits. It ensures PostgreSQL can recover from a crash, even if the crash occurs in the middle of a transaction. It also allows for point-in-time recovery, and helps optimize I/O operations.

WAL records are created every time a table's contents are modified. They are first written to the WAL buffer (whose size is determined by the `wal_buffers` setting). By writing to memory first, PostgreSQL can optimize and reduce the number of disk I/O operations. When the buffer is full, the data is flushed to disk as a WAL segment.

Each WAL record entry describes a change at the byte or block level in the database. The insert position of the record is described by its Log Sequence Number (LSN), a byte offset that increases with each new record. WAL files are stored in the `pg_wal` directory, and are a maximum of 16 MB by default (though this is configurable).

Through the process of logical decoding, WAL files can be turned into a readable format that represents high-level database operations like INSERT, UPDATE, and DELETE. So, for example, the record might originally say "byte A in file B was changed to C", but through logical decoding, it can be read as "row R in table T was updated to value V". These logical change records can be "published" by a database instance to "subscribers". After WAL records are flushed to the publisher's disk, the WAL Sender Process streams the committed WAL segment data to subscribed standby servers. This, in a nutshell, is how CDC replication works in PostgreSQL.

A PostgreSQL database can only have so many subscribers. The relationship between publisher and subscriber is mediated by PostgreSQL's replication slots — a persistent data structure used to track the progress of replication across subscribers, and to ensure that WAL data needed for replication is not prematurely removed (or recycled). They work by storing the LSN of the WAL record most recently received by each subscriber. When a replication slot is active, PostgreSQL will retain all relevant WAL segments necessary for the subscriber to stay in sync with the publisher, even if they are older than the retention policy would normally allow.

There are two kinds of replication slots in PostgreSQL — physical (for physical replication) and logical (for logical replication).

###Physical vs Logical Replication

Physical replication is at the byte-level. The exact binary data from the master server's disk are copied to the replica (including, but not limited to, the WAL). Physical replication is typically used when setting up a master-replica architecture and has native support for streaming, making it useful for setups that ensure high availability via standby servers.

Logical replication, on the other hand, is at the transaction level. Rather than copying bytes off the disk, the logical change records (detailing INSERTs, UPDATEs, DELETEs) are copied over. Because it's at the logic level, filters can be applied to only replicate specific tables, rows, or columns, making it much more flexible than physical replication. Logical replication is ideal for syncing your transactional data to a data lake or data warehouse. Both approaches are suitable for read-replica load-balancing, though logical replication introduces replication lag.

It's important to note that when using logical replication, the database schema and DDL commands are not replicated. Schema changes must be kept in sync manually. When the schema between producer and subscriber are out of sync, the replication process will error — this can be avoided by applying any additive schema changes to the subscriber first.

##Replicating Data Between PostgreSQL Servers

Load-balancing read requests to replicas is a common approach to reduce the load on your primary database. In this guide, we'll look at how to implement the primary-standby pattern using logical replication. This approach is only suitable for workloads without hard real-time requirements, as there will always be some degree of replication lag when implementing logical replication.

To set up logical replication following the primary-standby pattern, start by configuring your primary database.

Edit your `postgresql.conf` file to enable logical replication by setting the `wal_level` to `logical`, and adjust `max_replication_slots` and `max_wal_senders` to accommodate the number of replicas you need.
Create a user role with replication privileges:

`CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'password';`

Edit the `pg_hba.conf` file to allow the replication role to connect from the replicas' IP addresses.
Create a Publication on the primary:

`CREATE PUBLICATION my_publication FOR TABLE table1, table2;`

On each standby server, create a subscription:

`CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=master_ip dbname=db_name user=replicator password=password'
PUBLICATION my_publication;`

Before creating your subscription, you may want to initialize the standby's database with a snapshot of the primary's data. This can be achieved with `pg_dump` and `pg_restore`, Postgres utilities for backup and restoration.

##Replicating Data Between PostgreSQL and External Data Stores with Airbyte

Airbyte works by providing two connectors — a source connector and a destination connector — which together create your EL pipeline. The PostgreSQL source connector is certified, meaning the Airbyte team maintains it and provides a production-readiness guarantee.

To begin replicating your PostgreSQL database, start with steps 1 and 2 from above. Create a replication slot on your Postgres database to track changes you want to sync. Then, create publication and replication identities for each Postgres table you want to replicate. Publication identities specify the set of tables (and optionally, specific rows within those tables) whose changes you want to publish. Replication identities are configurations on the replica side that determine how the replicated data should be handled or applied.

From there, it's as simple as walking through Airbyte's UI to set up your PostgreSQL source connector and a destination connector (for example, BigQuery, Snowflake, or Redshift).

##Replicating Data Between Neon and External Data Stores with Airbyte

Effectively managing and scaling PostgreSQL deployments in the cloud can be expensive and impractical for smaller teams. Neon is one platform offering serverless PostgreSQL with features like autoscaling, separation of compute and storage, and advanced features like database branching.

Neon recently added support for logical replication and is fully compatible with Airbyte's CDC solution. Pairing a cloud database like Neon with Airbyte Cloud can deliver a scalable, reliable, and low-cost solution for your OLTP and replication needs.

##Final Thoughts

Before signing off, here are some WAL configurations to keep in mind when setting up logical replication:

`wal_compression` — Minimizes the impact of WAL accumulation between Airbyte syncs. Setting a compression policy saves storage at the cost of some extra CPU.
`max_wal_size` — Sets the amount of disk usage the WAL is allocated between checkpoints. The default is 1 GB. Set this large enough for the WAL to be easily stored between syncs.
`min_wal_size` — Sets the limit at which WAL files will be removed rather than recycled between checkpoints. The default is 80 MB. Adjusting this to align with your workload can optimize disk space and improve replication performance.

PostgreSQL is a time- and battle-tested workhorse. Its robust community, dedicated contributors, and flexible feature-set make it an excellent choice for a wide range of use cases. Today, we've examined how PostgreSQL implements logical replication and how Airbyte can be used in your CDC replication setup.

##Relevant Resources

https://www.postgresql.org/docs/current/different-replication-solutions.html
https://www.postgresql.org/docs/current/wal-internals.html
https://www.postgresql.org/docs/current/logical-replication.html
https://www.postgresql.org/docs/16/runtime-config-replication.html