The 3000 dollar typo
Aah, the wondrous world of AWS and it’s time-based billing…
A couple of jobs back, we were working on transforming a big legacy application to run as a cloud-based service, AWS being the cloud provider of choice.
So far, so good.. One day, I was trying to make sure some of the legacy SPOFs were able to migrate freely between availability zones(multi-AZ EBS volumes weren’t a thing back then). The idea was quite nifty, I thought.. Using the Data Lifecycle Manager to create snapshots of each data volume, so the last snapshot won’t be more than 2hrs old (the minimum snapshot interval at the time) and then do a final snapshot and recreate the volume in the correct AZ in case an EC2 instance switched AZs.
So there I went and wrote up a Lambda function that could be triggered by the EC2 startup and shutdown hooks and it would halt the machine, make sure the volume is available in the right AZ and continue the boot. After a couple of tries it all worked reasonably well and I decided to give it a try on one of our development environments that were shut down every night and I was expecting to see some AZ movements there..
And after applying the terraform code, I found a log message that could be more helpful and adjusted it manually, as you sometimes do. I called it a day and went on to do something different the next day.. At the end of the next day, I finally came around to fixing the log message in the code and apply it to the environment.
Now.. A couple of days later, my manager came up to the team to ask about the 3000 dollars in logging cost that ran up within just a day and upon investigation, we found out that is was my Lambda :D
Usually, ordering a new EC2 instance and waiting for it to boot feels like it takes forever, but somehow, halting an instance, having a Lambda function throw a Python stacktrace and discarding the instance to try again is fast enough to rack up a couple of thousand of dollars in Cloudwatch logging :D