Feature Flag target segments are failing in Prod-1/2 for customers with no target groups

Incident Report for Harness

Postmortem

Summary

Customers with no target groups configured were being returned null instead of [] for the /target-segments request when their sdks started up. This could lead to null pointer exceptions and a failure to initialise for some sdks.

SDK Customer Impact

Issue was related to a number of SDK versions, so we tested them and the latest versions to ascertain impact.

Java

1.3.1:
- Behaviour: the waitForInitialzation call never unblocks once the exception is thrown and caught in the polling thread. This would have caused user code to “freeze” when the SDK is initialising.
- Impact: critical impact, the SDK blocks users code from executing.
1.6.0 - latest version:
- Behaviour: An error is logged that the group was null and that the size of the group could not be calculated. Flags are loaded correctly and the waitForInitialzation call unblocks and evaluation calls return the correct variation.
- Impact: no functional impact outside of error logs. The correct evaluation will be returned.

Node.js

1.3.1:
- Behaviour: UnhandledPromiseRejection causes the SDK and application to crash. If the SDK client and waitForInitialzation were used in a try-catch block, then an error would be logged and the SDK would serve the correct evaluations.
- Impact:
  - If no exception handling was used on the client, then critical impact and the user’s application would crash.
  - If exception handling was used, no functional impact outside of error logs. The correct evaluation will be returned.
1.8.1 - latest version: same behaviour and impact as 1.3.1

Other SDK impact

The remaining server SDKs have been tested on their latest versions to ascertain impact. While there were no direct customer reports of issues, this is useful to understand the scope of this issue.

Erlang: 3.0.0: Critical impact, exception thrown could cause an application not to start, depending on how the SDK has been integrated.
Python 1.6.2: High impact, SDK fails to initialise and serves default variations.
.NET 1.7.0: No impact, but error is logged.
Go v0.1.23: No impact.
Ruby 1.3.0: No impact.

RCA

Why did some customers experience null pointer exceptions?

In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value null instead of an empty array []

Why was null being returned instead of an empty array?

A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the null json response.

Why was that change made to begin with?

Because of our high request rates we use many layers of caching. A side effect of returning errors from the db layer when no target groups exist was that we wouldn’t cache that response. With some high volume customers having no target groups this led to tens of millions of unnecessary requests hitting the database per week when flags are evaluated which we were attempting to avoid.

Why was this scenario not caught by tests?

Unit tests, end to end tests and sdk specific tests exist for this endpoint however the case where target groups are empty wasn’t full covered. This change was primarily meant to improve performance for the /client/evaluations endpoint which uses this code path and which was manually tested and confirmed to work correctly. The /client/target-segments code path experiencing side affects from this change wasn’t anticipated or caught by automated testing.

Follow up actions

Followup actions can cover the following based on different issues faced along with Jira id’s linked for tracking the followup completion. these followup items must ALSO be linked in the RCA ticket

Test enhancements
- Add unit tests for the target groups, working with both none, one and multiple
- Add new validation and logging, to ensure valid JSON is already returned by the endpoints

Posted Jun 27, 2024 - 07:24 PDT

Resolved

Between 09:18 and 16:08 UTC, customers with no evaluation groups seeing `NullPointerException` errors in the SDK's, when pulling evaluation rules. In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value null instead of an empty array [].

A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the null json response.

We can confirm normal operation. Get Ship Done!
We will continue to monitor and ensure stability.

Posted Jun 25, 2024 - 09:15 PDT