PLTcloud Performance Degradation on 2024-10-23

For the first time in over 5 years we all saw a noticeable performance degradation in PLTcloud which prevented many test reports from getting from PLTs to PLTcloud. This is a core functionality of the system so we treated this as a blocker issue with maximum priority.

What caused the issue?

All of the health monitoring dashboards in AWS were green which made this hard to track down. We did see longer that usual latency times on calls to the database but this was not the root cause. We suspect that an AWS ECS patch issued at the same time was related.

How did we address the issue?

We restarted a PLTcloud component that was showing degraded performance and this visibly reduced latency and CPU load metrics.

How to find meaningful info when there are issues with PLTcloud?

The first place to check for issues like this is status.pltcloud.com which unfortunately was of no help this time. Opening a ticket is good because we can flag it and link it to bugs we are tracking internally. In this case, it was completely new. It is fine to call any of us at Blue Clover for an urgent matter like this. We will share updates here in the Troubleshooting section of the Support Center.

What are we doing to prevent this in the future?

1. Improve status page to include actions that go deeper than connecting to PLTcloud (2~3 weeks).
2. Implement store-and-forward feature so reports can be saved temporarily on the PLTs until service is fully restored (2~3 months). Note that this will not help for test plans that include webhooks.
3. Database upgrades were already planned for the next 2 months and will also likely help with the resiliency of PLTcloud.

As we refine our plan, we will update this post.

PLTcloud Performance Degradation on 2024-10-23

Comments

Didn't find what you were looking for?