Thursday, 31 October 2024

Filled under:

 

What should they continue doing?

  • Continue Leveraging Expertise in PostgreSQL and Audit Requirements: Their insights into PostgreSQL, especially in an on-premises setup, and understanding of audit requirements are highly valuable. By continuing to offer guidance in these areas, they help the team maintain compliance and optimize database performance.

  • Sustain Efforts in Cloud Knowledge and Automation: Their experience with Azure Cloud and automation provides the team with a strategic edge in cloud management and efficiency. Their expertise here is essential for any initiatives involving cloud migration, cost optimization, or automated operations.

What could they do to have more impact?

  • Lead Automation and Cloud Strategy Workshops: Given their strong background in Azure Cloud and automation, they could lead workshops or working sessions focused on best practices for automating workflows and managing resources in the cloud. This would empower team members to increase productivity and improve cloud management skills.

  • Develop PostgreSQL and Audit Playbooks: To leverage their PostgreSQL and audit expertise, they could create playbooks or guides for managing on-premises PostgreSQL environments and ensuring compliance with audit requirements. This documentation would be a practical resource for the team, particularly in onboarding and troubleshooting.

Posted By Nikhil18:19

Tuesday, 29 October 2024

Filled under:

 Meeting Speech on Recent Incident – Wrong Task Closure Impacting Workflow Progress


Good [morning/afternoon], everyone. Thank you all for joining this important meeting. I wanted to address a recent incident that had a significant impact on our operations, specifically within our workflow system. It’s essential we discuss this openly so we can understand what went wrong, mitigate any ongoing risks, and ensure it doesn’t happen again.


Incident Overview


As many of you are aware, there was a mistake involving the closure of a critical task in our workflow. This task was closed incorrectly, which caused a ripple effect in the system. As a result, subsequent tasks in the workflow – most notably those related to the server lifecycle change stage – were not triggered.


The missed lifecycle change stage impacted our ability to carry out key server transitions, including upgrades, patching, and other planned activities. This delay introduced unnecessary risks, including potential exposure to security vulnerabilities and operational slowdowns.


Impact Summary


1. Disrupted Workflow: Further tasks could not be opened, effectively stalling the entire workflow.



2. Server Lifecycle Management Delays: Important stages, such as patch updates and decommissioning, were not triggered on time.



3. Operational Risk: The delay in server changes leaves us open to compliance and performance risks, especially in environments requiring strict timelines.



4. Time and Effort: Teams had to scramble to identify the issue and resolve the blockage, taking valuable time away from their primary tasks.




Root Cause


After investigation, it was identified that the root cause was an incorrect task closure, either due to human error or a misunderstanding of task dependencies in the workflow system. The system didn’t flag the improper closure, and no escalation mechanism was in place to detect the workflow failure immediately.


Corrective Measures Taken


1. Immediate Task Re-opening: The affected tasks have been reopened, and the workflow has resumed.



2. Lifecycle Tasks Prioritized: The server lifecycle changes have been scheduled to minimize further delays.



3. Temporary Alerts Implemented: Teams will now receive notifications if critical tasks are closed improperly.




Preventive Measures Going Forward


1. Clearer Task Closure Guidelines: We will provide all relevant teams with specific instructions on how and when tasks should be closed.



2. System Enhancements: We are exploring adding validation rules or automated checks to prevent the improper closure of critical tasks.



3. Training and Awareness: A refresher session will be conducted for all staff handling workflow tasks to avoid future errors.



4. Monitoring and Alerts: We are implementing enhanced monitoring to flag and escalate blocked workflows in real-time.




Conclusion


I want to emphasize that while this incident was unfortunate, it’s also a valuable learning opportunity. Mistakes happen, but it’s how we respond that defines us. I appreciate everyone’s quick efforts to resolve the issue. Let’s use this experience to strengthen our processes and prevent similar issues in the future.


Thank you all for your attention. If anyone has further insights or questions, let’s open the floor for discussion now. Your input will be crucial in refining our workflows and ensuring smooth operations moving forward.


Posted By Nikhil17:40
Filled under:

 Subject: Meeting Request: Review of Escalated Issue Handling


Hi [Team/Participants' Names],


I’d like to set up a short meeting to review how we handled the recent escalated issue around task closures and the resulting audit work. The goal is to reflect on what went well, what could have been handled differently, and gather everyone’s perspective to improve our processes going forward.


Please let me know your availability.


Best regards,

[Your Name]


Posted By Nikhil07:09

Monday, 28 October 2024

Filled under:

 Subject: Clarification on Ticket Cancellation


Dear [Escalation Manager's Name],


The tickets in question were opened in June 2023 and remained inactive. During the queue cleanup in July 2024, I canceled them as they were outdated, rather than reassigning them to the migration team, which originally owned the task.


Additionally, I identified two more hosts in maintenance mode and have raised a ticket with the correct team to resolve the issue.


Let me know if further clarification is needed.


Best regards,

[Your Name]

[Your Position]


Posted By Nikhil23:36
Filled under:

 https://mc.gov.sg/mc/bplm3bti4c6w4lou4aq7u6e4r4

Posted By Nikhil02:07

Saturday, 26 October 2024

Filled under:

 It was a great collaboration today—thank you, everyone, for your contributions and teamwork!

Attached below are the file evidences from today’s test. Please review them as needed, and let me know if there are any questions or further actions required.

Looking forward to our continued progress!

Posted By Nikhil04:00

pg

Filled under:

During the early APAC hours, we observed an incident affecting the synchronization of PostgreSQL replicas. Below is a summary of the issue, actions taken, and key takeaways.

Incident Details

  • Time of Detection: Early APAC hours
  • Observation: Both PostgreSQL replicas were out of sync with the leader.

Initially, we noticed that while one replica was catching up with the leader, the other replica continued to experience an increasing lag. Fortunately, we held off on performing a reinit/rebuild for the first replica, which eventually synchronized with the leader after a couple of hours.

Upon further investigation, we identified that swap memory usage was significantly high on the leader node, though a more detailed analysis was not possible due to other scheduled activities, including China’s Business Continuity Management (BCM) tests.

Actions Taken

  1. Replica Reinitialization: Due to the continuous lag increase on the second replica, a reinitialization was initiated. However, the reinitialization extended beyond four hours.
  2. Scheduled Change Impact: Around 3 PM, a scheduled change required a switchover to the second replica. Fortunately, this replica was already prepared and synchronized, so we proceeded with the switchover without additional issues.

Following the switchover, the original leader entered a "starting" state. Multiple attempts to restart the Patroni service and perform a pg_rewind were unsuccessful. Further analysis indicated that a required WAL file was no longer available on any node.

Impact on Resiliency

The incident has temporarily impacted system resiliency, as one of the replicas now requires a full rebuild. Fortunately, the application does not have an immediate plan to switch back to the previous replicas, ensuring operational continuity for the time being.

Key Takeaways and Recommendations

  1. Monitor System Resources:

    • If a replica falls out of sync without apparent triggers like server or database restarts, it is essential to investigate memory and CPU usage, including swap utilization on the leader. Proactively monitoring these resources can help preempt further lag or sync issues.
  2. Optimize Large Database Reinitializations:

    • When dealing with large databases, rebuilds can be time-consuming. In this case, the 1TB database took over 7 hours to reinitialize. Implementing parallel options in pg_basebackup could expedite this process.

    Here’s an example of setting up pg_basebackup with parallel options in the Patroni configuration:

    yaml
    postgresql: basebackup: max-rate: '100M' # Limit the backup rate (optional) checkpoint: 'fast' # Perform a fast checkpoint -j: 4 # Number of parallel jobs -X: 'stream' # Stream WAL files
    • -j 4: Specifies 4 parallel jobs for pg_basebackup (adjust based on your system’s resources).
    • -X 'stream': Streams WAL files, reducing the time spent waiting for synchronization post-reinit.

We will continue monitoring the health of all nodes and proceed with necessary rebuilds and adjustments to restore full resiliency. Please reach out if there are any immediate concerns or additional questions.

Posted By Nikhil02:32

Friday, 25 October 2024

Filled under:

 Thanks for informing us about the backup disablement for certain development databases.

[DBA's Name], please take note of this update in the thread below regarding the backup disablement. Given this change, I suggest we consider moving these databases to NOARCHIVELOG mode to prevent potential space issues, which could lead to outages.

Additionally, please liaise with the backup team should you require re-enabling backups at any point. I assume any such reactivation would require exceptional approvals.

Posted By Nikhil18:15

Saturday, 19 October 2024

Filled under:

 As discussed, we will keep the BCP switchback/rollback task open until Monday as requested.

In case a switchback is requested, please assist with the process. Otherwise, kindly proceed with closing the CTASK before the change window closes to avoid any CM18 compliance issues.

Let me know if you need any further details.

Best regards,

Posted By Nikhil01:46

Friday, 18 October 2024

Filled under:

 I have a sense that the user may escalate the issue further since they're not getting responses from the available team members. For now, I'm replying with some details and have requested Prashant to allocate a resource to check in. I’ve looped myself in as well, in case there’s any follow-up needed.

Posted By Nikhil06:49
Filled under:

 I’m confirming that I will be the point of contact for tomorrow’s APAC activity, as always.

In the future, it would be really helpful if we could receive notifications by Thursday so we can better plan for implementation day.

Looking forward to supporting the activity!

Best regards,

Posted By Nikhil05:57
Filled under:

 Assistance with Identifying Database/Block Corruption and Verifying Backup Integrity

Dear EDB Support Team,

I hope this message finds you well. We would like to raise a case to understand how we can identify potential database or block corruption within our PostgreSQL environment. Specifically, we are looking for guidance on the following:

  1. Identifying Database or Block Corruption:
    Could you provide insights on the utilities or methods available through native PostgreSQL tools to detect database or block corruption? We would appreciate a detailed procedure or steps for performing such checks.

  2. Verifying Backup Integrity:
    Is there a way to verify whether there is any corruption in backups that have recently been taken? Any best practices or tools within PostgreSQL for ensuring the reliability and integrity of backups would be very helpful.

We are primarily looking for recommendations on using native PostgreSQL utilities and tools, and any documentation or step-by-step procedures would be greatly appreciated.

Looking forward to your insights on these matters.

Posted By Nikhil05:30
Filled under:

 Hi [Lead's Name],

I hope you’re doing well. We’ve received an escalation email from [User's Name] regarding an issue where they are unable to view database targets in OEM. Could you kindly allocate a resource from the team to assist with this matter and guide [User's Name] through the resolution process?


Hi [User's Name],

I wanted to apologize for the delay in responding to your email, as it was sent during my off-work hours. I also want to mention that I am not the sole point of contact for these issues, and the correct path would have been to reach out via the designated chat channel or the Team DL email for a more immediate response.

Regarding your concern about missing database targets in OEM, could you please confirm if you have raised a formal ticket with all the relevant information? This is our standard process, and having a ticket with complete details will help us address the issue effectively.

I also noticed that you didn’t specify which databases you expect to see in the Production OEM environment. However, your email includes a screenshot showing all the entitlements you currently have. Could you provide further clarification?

Before proceeding, I’d recommend referring to the available assist pages for troubleshooting steps, and then raise a ticket through the standard support system if the issue persists. This will ensure that the right teams can be looped in and work towards a resolution in a timely manner.

Let me know how you wish to proceed, and we will coordinate accordingly.

Best regards,

Posted By Nikhil04:53
Filled under:

 I’ve noticed that DBAs are raising tickets for TPAI access, even though we previously agreed to use existing ones. While the volume isn’t high, it still impacts our metrics.

Please see the attached report and let me know if you can’t access it.

Posted By Nikhil00:26

Thursday, 17 October 2024

Filled under:

 often find myself copying large blocks of text that are in a tabular format or require specific spacing for clarity. However, when I paste this content into Snow tickets, the formatting is lost, making it difficult to read and understand.

Is there a workaround to preserve the formatting when pasting? Any tips or best practices for maintaining readability in our Snow tickets would be greatly appreciated.

Posted By Nikhil19:09
Filled under:

 Thank you for your recent change request regarding grant privileges.

I would like to inform you that this request does not require a formal change ticket. Instead, you can utilize our standard catalog and submit a request through that system. This approach is more efficient for granting privileges.

If you need any assistance with the catalog or have further questions, please feel free to reach out.

Posted By Nikhil17:52

stack

Filled under:

 When I was on my certification journey, I got introduced to StackOverflow within my internal organization. At that time, I was diving deep into various topics, especially related to databases. As you know, preparing for a certification means constantly solving problems and clarifying concepts. Naturally, I kept looking for answers, and that’s when I discovered how StackOverflow could be an incredible platform—not just for asking questions, but for learning from experts and sharing knowledge.


Within our organization, we had a similar platform integrated with StackOverflow's framework where team members would ask questions related to databases, and many of these questions, unfortunately, went unanswered for days. It was obvious that the platform needed more active engagement.

I started looking at the unanswered questions and quickly realized that some of them were right up my alley. With my focus on Oracle and PostgreSQL databases, I could step in and provide solutions, often related to query optimization, performance tuning, or error diagnostics—things I had been working on myself.

As I kept answering more and more questions, something amazing happened—I started gaining points on the platform, which not only felt rewarding but also pushed me to get more involved. The community within the organization started recognizing my efforts, and it created a ripple effect. The same people who had unanswered questions began actively participating, and we all started helping each other out, making it a much more collaborative space.

In the end, not only did this engagement help me solidify my own understanding and prepare better for my certification, but it also helped others in the organization. And, of course, those extra points I earned? They were a nice bonus!

Posted By Nikhil05:00

Wednesday, 16 October 2024

Filled under:

 ChatGPT 4o mini

Posted By Nikhil16:25