pg ~ JAVA J2EE 4 U

During the early APAC hours, we observed an incident affecting the synchronization of PostgreSQL replicas. Below is a summary of the issue, actions taken, and key takeaways.

Incident Details

Time of Detection: Early APAC hours
Observation: Both PostgreSQL replicas were out of sync with the leader.

Initially, we noticed that while one replica was catching up with the leader, the other replica continued to experience an increasing lag. Fortunately, we held off on performing a reinit/rebuild for the first replica, which eventually synchronized with the leader after a couple of hours.

Upon further investigation, we identified that swap memory usage was significantly high on the leader node, though a more detailed analysis was not possible due to other scheduled activities, including China’s Business Continuity Management (BCM) tests.

Actions Taken

Replica Reinitialization: Due to the continuous lag increase on the second replica, a reinitialization was initiated. However, the reinitialization extended beyond four hours.
Scheduled Change Impact: Around 3 PM, a scheduled change required a switchover to the second replica. Fortunately, this replica was already prepared and synchronized, so we proceeded with the switchover without additional issues.

Following the switchover, the original leader entered a "starting" state. Multiple attempts to restart the Patroni service and perform a pg_rewind were unsuccessful. Further analysis indicated that a required WAL file was no longer available on any node.

Impact on Resiliency

The incident has temporarily impacted system resiliency, as one of the replicas now requires a full rebuild. Fortunately, the application does not have an immediate plan to switch back to the previous replicas, ensuring operational continuity for the time being.

Key Takeaways and Recommendations

Monitor System Resources:
- If a replica falls out of sync without apparent triggers like server or database restarts, it is essential to investigate memory and CPU usage, including swap utilization on the leader. Proactively monitoring these resources can help preempt further lag or sync issues.
Optimize Large Database Reinitializations:
- When dealing with large databases, rebuilds can be time-consuming. In this case, the 1TB database took over 7 hours to reinitialize. Implementing parallel options in pg_basebackup could expedite this process.
Here’s an example of setting up pg_basebackup with parallel options in the Patroni configuration:
```
yaml
postgresql:
  basebackup:
    max-rate: '100M'           # Limit the backup rate (optional)
    checkpoint: 'fast'         # Perform a fast checkpoint
    -j: 4                      # Number of parallel jobs
    -X: 'stream'               # Stream WAL files
```
- -j 4: Specifies 4 parallel jobs for pg_basebackup (adjust based on your system’s resources).
- -X 'stream': Streams WAL files, reducing the time spent waiting for synchronization post-reinit.

We will continue monitoring the health of all nodes and proceed with necessary rebuilds and adjustments to restore full resiliency. Please reach out if there are any immediate concerns or additional questions.

Saturday, 26 October 2024

pg

Incident Details

Actions Taken

Impact on Resiliency

Key Takeaways and Recommendations

0 comments:

Post a Comment

About me

ni

Aboutme

Follow

You can also receive Free Email Updates:

You can also receive
Free Email Updates:

Followers

Blog Archive

Total Pageviews

About Metro UI Theme

Blogger news

Blogroll

Blogroll

About

Saturday, 26 October 2024

pg

Incident Details

Actions Taken

Impact on Resiliency

Key Takeaways and Recommendations

0 comments:

Post a Comment

About me

ni

Aboutme

Follow

You can also receive Free Email Updates:

You can also receive Free Email Updates:

Followers

Blog Archive

Total Pageviews

About Metro UI Theme

Contact us

Sign up for our newsletter

Blogger news

Blogroll

Blogroll

About

You can also receive
Free Email Updates: