Tech @ Amber: It all started with a very large AWS bill...

This is the first blog post from Amber's engineering team. Want to tackle climate change using technology? Check out our current job openings.

There comes a time in every young startup’s life (usually when the free credits that you scooped up from your accelerator program run out) when you discover you’ve accidentally run up a giant infra bill and you have no idea why.

It happened to Segment with “the million dollar engineering problem”, it happened to Nike, and, naturally, it happened to us.

In November 2020 DynamoDB had cost us $1k USD to run.

In December 2020 our DynamoDB bill increased fivefold to $5k USD. 

In January 2021 we spent $13k USD on DynamoDB.

The pattern of the increase in cost was strange. As you can see in the chart above, there was a steady increase from mid-December, a steep drop-off in mid-January, followed by another steep increase.

While our customer base was growing each month, our DynamoDB costs were anything but linear. What was going on?

Barking up the wrong tree

At the beginning of my investigation I noticed that the mid-December inflection point coincided with the soft-release of a very popular feature: custom price notifications. This feature allows users to get a push notification whenever the wholesale price of energy passes a threshold they choose.

Although we hadn't publicly promoted the feature our customers steadily started to discover it and use it. I wondered if our exploding DynamoDB costs might be related to scaling issues with this feature.

A simplified diagram of our custom price notifications architecture.

Our initial architecture for the feature worked like this: if a notification hadn’t been sent in the previous two hours, and the price was above the notification threshold, we sent a push notification to the user's devices. Regardless if the notification had been sent or not we recorded the result of this operation to our database.

My investigation revealed an issue with this architecture. New prices are published every 5 minutes, and as more users opted into the feature, the lambda could no longer complete all the devices before new 5 minute prices were published. Writes to the database were also being throttled.

We re-architected away from this problem, delegating the work to queues which allowed the work to be performed concurrently.

Meanwhile, writes to our demand-response table had been steadily increasing since mid-December. This didn't account for the sharp drop-off in DynamoDB costs in mid-January, but it seemed like a potential culprit.

As wholesale prices are normally lower than the thresholds customers set, the majority of writes to the database were a record that no notifications had been sent. We mainly used these for occasional debugging. Since good logs should serve the same purpose, we decided to only write to the database when a notification had actually been sent.

This adjustment reduced the writes to the database (and cost) by a bit, but it wasn’t enough to account for the sharp increase in our DynamoDB costs on its own.

As we can see from the below graph of costs associated with the demand-response table, the change took effect on the 10th of February. We decided to keep the change, but it didn’t address the root cause of our problem.

Getting closer

The above graph shows the four tables accessible to CloudWatch Metrics Explorer. You can see that writes to the customer-details table peaked at around the same time as we’d reduced writes to demand-response, which could have been hiding any gains we'd made. The costs associated with the customer-details table had the opposite pattern to our bills: a mid-month peak and then a gradual taper off.

We wondered if there was a business reason for this pattern. I checked with our marketing team to see if there were any mid-month customer acquisition pushes that could explain the pattern, but our customer acquisition is fairly consistent. If anything, we acquire more customers at the beginning of the month.

I started trying to cost what I was seeing. How much should a customer, or a SmartShift™ device cost us? (SmartShift™ is technology we’ve developed to automatically optimise run times for home appliances based on the wholesale price of energy.) And while I didn't get to an exact figure, I realised that our costs per customer should be much lower than what was reflected on our bill.

The CloudWatch Metrics Explorer is great for creating multiple graphs to look at, but can only access tagged resources. I called another member of Amber’s engineering team, Kent, to try to determine the cost of a SmartShift™ device. We decided that couldn't be the culprit. Kent asked pointed questions about the investigation so far. Using the older CloudWatch metrics interface, we started overlaying additional tables and noticed that:

  • The usage table decreased its number of consumed WriteCapacityUnits every time customer-details increased.
  • The usage table patterns matched our AWS bill very well.
  • The usage table metrics could not be plotted on the same axis as the other tables as it was several magnitudes larger.

The true culprit

The usage table’s consumed WriteCapacityUnits had increased much faster than we’d acquired customers with actual usage to store. This unit consumption followed the same pattern as our bill in Cost Explorer: a gradual increase and a sharp mid-month cut off.

So what had happened back in December 2020 to cause this change?

Customer usage data for the previous day usually arrives in the early hours of the morning. We would store a cursor on the customer-details table containing the last valid date for which we have usage. A cron job would run every half hour and do the following:

  • Pull all our customers
  • Check each customer to see if we had valid usage for yesterday. If we did it would ignore the customer
  • If not, we would attempt to pull usage for the time period that we did not have usage for
  • Update the cursor stored on customer-details with the last valid date received

Unfortunately, some customers’ usage was often published out of chronological order.

Here’s an example. Let’s say a customer had missing usage on the 10th, 11th, and 12th. On the 13th, we’d attempt to fetch the missing data for the 10th, 11th and 12th. We’d receive usage for the 11th and 12th, and update the cursor date to the 12th. We didn’t go back to retrieve data for the 10th, because the last valid day of usage was the 12th.

In December I pushed a change to the way we calculate the cursor date. If we got fewer valid days than we requested, we would not update the cursor.

In the example above we requested 3 days of usage data but only received 2. We would leave the cursor date as the 10th, and try to fetch the data again later in the day. When the usage data for the 10th was eventually published we’d request data for 10th, 11th and 12th, hopefully receive 3 days worth of data, and update the cursor to the 12th.

We also decided that if the cursor date was more than 30 days ago, we would update the cursor to the last valid usage date (sometimes there are gaps in usage that are never properly filled).

This change worked very well for the two customers who frequently experienced this problem. It also worked on the few accounts I spot checked. Unfortunately, those accounts all had one thing in common; they only had one meter channel. Many customers actually have multiple meter channels - with additional channels being used for solar or controlled load.

When fetching the previous day's data for a customer with three meters, we would get three days of valid usage (one for each meter). The code would check whether we got the expected number of days (1) and then wouldn’t update the cursor because we’d received three days of usage data instead.

We were writing increasing amounts of usage data for those customers every half hour. Usage is recorded in half hour increments, so each day there are 48 entries per meter channel. This meant that for some customers, we were writing 48 * 3 * 30 = 4320 entries, every half hour as we approached 30 days from the original change.

By the time we'd gotten to 30 days, the code would assume an un-fillable gap in usage, and reset the cursor to the last valid usage day fetched. This is what caused the mid-month drop off, and then gradual increase.

After the investigation the fix itself was quite straightforward: we needed to divide the number of valid usage days by the number of meters before checking whether we got the number of days we expected. And several days on, our costs and consumed WriteCapacityUnits looked a lot more normal.

All in all, this was a lesson in not letting preconceived ideas guide an investigation too much, as the custom notifications feature was not to blame.

This was also a lesson in making sure that our explanation fit all observations, as user adoption of the custom notifications feature didn’t have the same ramp up and drop-off that usage did.