Saturday, 26 October 2024

pg

Filled under:

During the early APAC hours, we observed an incident affecting the synchronization of PostgreSQL replicas. Below is a summary of the issue, actions taken, and key takeaways.

Incident Details

  • Time of Detection: Early APAC hours
  • Observation: Both PostgreSQL replicas were out of sync with the leader.

Initially, we noticed that while one replica was catching up with the leader, the other replica continued to experience an increasing lag. Fortunately, we held off on performing a reinit/rebuild for the first replica, which eventually synchronized with the leader after a couple of hours.

Upon further investigation, we identified that swap memory usage was significantly high on the leader node, though a more detailed analysis was not possible due to other scheduled activities, including China’s Business Continuity Management (BCM) tests.

Actions Taken

  1. Replica Reinitialization: Due to the continuous lag increase on the second replica, a reinitialization was initiated. However, the reinitialization extended beyond four hours.
  2. Scheduled Change Impact: Around 3 PM, a scheduled change required a switchover to the second replica. Fortunately, this replica was already prepared and synchronized, so we proceeded with the switchover without additional issues.

Following the switchover, the original leader entered a "starting" state. Multiple attempts to restart the Patroni service and perform a pg_rewind were unsuccessful. Further analysis indicated that a required WAL file was no longer available on any node.

Impact on Resiliency

The incident has temporarily impacted system resiliency, as one of the replicas now requires a full rebuild. Fortunately, the application does not have an immediate plan to switch back to the previous replicas, ensuring operational continuity for the time being.

Key Takeaways and Recommendations

  1. Monitor System Resources:

    • If a replica falls out of sync without apparent triggers like server or database restarts, it is essential to investigate memory and CPU usage, including swap utilization on the leader. Proactively monitoring these resources can help preempt further lag or sync issues.
  2. Optimize Large Database Reinitializations:

    • When dealing with large databases, rebuilds can be time-consuming. In this case, the 1TB database took over 7 hours to reinitialize. Implementing parallel options in pg_basebackup could expedite this process.

    Here’s an example of setting up pg_basebackup with parallel options in the Patroni configuration:

    yaml
    postgresql: basebackup: max-rate: '100M' # Limit the backup rate (optional) checkpoint: 'fast' # Perform a fast checkpoint -j: 4 # Number of parallel jobs -X: 'stream' # Stream WAL files
    • -j 4: Specifies 4 parallel jobs for pg_basebackup (adjust based on your system’s resources).
    • -X 'stream': Streams WAL files, reducing the time spent waiting for synchronization post-reinit.

We will continue monitoring the health of all nodes and proceed with necessary rebuilds and adjustments to restore full resiliency. Please reach out if there are any immediate concerns or additional questions.

0 comments:

Post a Comment