4 Essential Game Marketing Strategies for Acquisition and Engagement

With over a million apps available for download on Google’s App Store, making your mobile game stand out from the rest can feel like a challenge. Luckily, it’s a challenge that can be overcome if you…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




What alerts should you have for serverless applications?

A key metric for measuring how well you handle system outages is the Mean Time To Recovery or MTTR. It’s basically the time it takes you to restore the system to working conditions. The shorter the MTTR, the faster problems are resolved and the less impact your users would experience and hopefully the more likely they will continue to use your product!

And the first step to resolve any problem is to know that you have a problem. The Mean Time to Discovery (MTTD) measures how quickly you detect problems and you need alerts for this — and lots of them.

Exactly what alerts you need depends on your application and what metrics you are collecting. Managed services such as Lambda, SNS, and SQS report important system metrics to CloudWatch out-of-the-box. So depending on the services you leverage in your architecture, there are some common alerts you should have. And here are some alerts that I always make sure to have.

You might have noticed the regional metrics (I know, the dashboard says Account-level even though its own description says it's "in the AWS Region") in the Lambda dashboard page.

The regional ConcurrentExecutions is an important metric to alert on. Set the alert threshold to ~80% of your current regional concurrency limit (which starts at 1000 for most regions).

This way, you will be alerted when your Lambda usage is approaching your current limit so you can ask for a limit raise before functions are throttled.

You may also wish to add alerts to the regional Throttles metric. But this depends on whether or not you're using Reserved Concurrency. Reserved Concurrency limits how much concurrency a function can use and throttling excess invocations shows that it's doing its job. But those throttling can also trigger your alert with false positives.

(Note: depending on the function’s trigger, some of these alerts might not be applicable.)

Use CloudWatch metric math to calculate the error rate of a function — i.e., 100 * Errors / MAX([Errors, Invocations]). Align the alert threshold with your Service Level Agreements (SLAs). For example, if your SLA states that 99% of requests should succeed then set the error rate alert to 1%.

Unless you’re using Reserved Concurrency, you probably shouldn't expect the function's invocations to be throttled. So you should have an alert against the Throttles metric.

For async functions with a dead letter queue (DLQ), you should set up an alert against the DeadLetterErrors metric. This tells you when the Lambda service is not able to forward failed events to the configured DLQ.

Similar to above, for functions with Lambda Destinations, you should set up an alert against the DestinationDeliveryFailures metric. This tells you when the Lambda service is not able to forward events to the configured destination.

For functions triggered by Kinesis or DynamoDB streams, the IteratorAge metric tells you the age of the messages they receive. When this metric starts to creep up, it's an indicator that the function is not keeping pace with the rate of new messages and is falling behind. The worst-case scenario is that you will experience data loss since data in the streams are only kept for 24 hours by default. This is why you should set up an alert against the IteratorAge metric so that you can detect and rectify the situation before it gets worse.

Even if you know what alerts you should have, it still takes a lot of effort to set them up. This is where 3rd-party tools like Lumigo can also add a lot of value. For example, Lumigo enables a number of built-in alerts (using sensible, industry-recognized defaults) for auto-traced functions so you don’t have to manually configure them yourself. But you still have the option to disable alerts for individual functions should you choose to.

Here are a few of the alerts that Lumigo offers:

Furthermore, Lumigo integrates with a number of popular messaging platforms so you can be alerted prompted through your favorite channel.

By default, API Gateway aggregates metrics for all its endpoints. For example, you will have one 5xxError metric for the entire API, so when there is a spike in 5xx errors you will have no idea which endpoint was the problem.

You need to Enable Detailed CloudWatch Metrics in the stage settings of your APIs to tell API Gateway to generate method-level metrics. This adds to your CloudWatch cost but without them, you will have a hard time debugging problems that happen in production.

Once you have per-method metrics handy, you can set up alerts for individual methods.

Seriously, always use percentiles.

So when you set up latency alerts for individual methods, keep two things in mind:

When you use the Average statistic for API Gateway’s 4XXError and 5XXError metrics you get the corresponding error rate. Set up alerts against these to alert yourself when you start to see an unexpected number of errors.

When working with SQS, you should set up alerts against the ApproximateAgeOfOldestMessage metric for an SQS queue. It tells you the age of the oldest message in the queue. When this metric trends upwards, it means your SQS function is not able to keep pace with the rate of new messages.

There are a number of metrics that you should alert on:

They represent the various ways state machine executions would fail. And since Step Functions are often used to model business-critical workflows, I would usually set the alert threshold to 1.

Yes, you can!

Add a comment

Related posts:

FAQs on Getting Started in Cyber Threat Intelligence

One of the most frequent messages I get is from people who are looking for advice on getting started in cyber threat intelligence (CTI). I thought it would be useful to compile my answers to some of…

Jobs on the rise worldwide in cryptocurrency and blockchain

The number of jobs in cryptocurrency and blockchain grew more than 200% in the US in 2017. (https://www.coindesk.com/blockchains-big-year-competitive-job-market-grows-200/) In 2018, with the number…

My Story

Spent two years building bathtub gin in Mexico. A real dynamo when it comes to licensing clip-on ties in Ocean City, NJ. Love Problem Solution In Chandigarh for the underprivileged. Practiced in the…