Architecture

SES Disaster Recovery for Transactional Email

SES is regional and us-east-1 has bad days. Build real regional resilience for transactional email — with trade-offs at each tier.

The week of the great us-east-1 SES degradation a couple of years ago, I watched a B2B SaaS team learn in real time that their entire transactional email infrastructure was single-region. Their application was multi-region and well-architected. Their database had read replicas in three regions. Their Lambda functions were deployed everywhere. Their SES sending? us-east-1 only.

When SES us-east-1 started slowing down sends and rejecting some API calls, the team's runbook had no entry for it. They watched their queue of pending password resets grow into the tens of thousands. They scrambled to verify domain identities in eu-west-1, only to discover that domain verification takes time and DKIM keys propagation is not instant. By the time they had a working secondary, the primary had recovered. Two days of customer pain for an outage they could have planned around.

This is the typical SES DR story. The application is resilient. The email infrastructure is not, because nobody thought hard about it before the outage made them.

Here's how to actually plan for SES regional failure.

SES regional architecture, briefly

SES is a regional service. Every relevant resource — domain identities, configuration sets, email templates, suppression lists, sending statistics, account-level quotas — exists in a specific region. There is no automatic cross-region replication.

What this means in practice:

A template you create in eu-west-1 does not exist in eu-central-1. Calling SendEmail in eu-central-1 referencing that template name returns a TemplateDoesNotExist error.
A domain identity verified in us-east-1 is not verified in eu-west-1. Verification (and DKIM, and DMARC alignment) is per-region.
Your account-level sending quota in eu-west-1 is independent of your quota in eu-central-1. If you're sending 10,000 emails an hour from eu-west-1 because that's where your application primarily runs, your eu-central-1 quota may still be the sandbox 200 per day until you ask AWS to raise it.
Suppression lists are per-region. A customer who unsubscribed in eu-west-1 is not on the suppression list in eu-central-1 unless you put them there.

The asymmetry is the point. Most multi-region failover plans assume "deploy to the second region and resume." For SES, that requires the second region to already be set up — verified, quota'd, populated with templates, in sync with suppression lists.

Multi-region template replication

The first concrete piece of DR is keeping templates in sync across regions. If your DR region's template namespace is empty, your application can fail over but cannot send.

There are two patterns.

Active-active replication. Every template change is published to all DR regions simultaneously. The publishing CI job (or your control layer) iterates over a list of regions and calls UpdateEmailTemplate for each.

import boto3

REGIONS = ["eu-west-1", "eu-central-1", "us-east-1"]

def publish_template(name: str, content: dict) -> None:
    for region in REGIONS:
        client = boto3.client("sesv2", region_name=region)
        try:
            client.update_email_template(TemplateName=name, TemplateContent=content)
        except client.exceptions.NotFoundException:
            client.create_email_template(TemplateName=name, TemplateContent=content)

Pro: regions stay in sync without an explicit DR drill. Con: every publish is now subject to the latency and failure mode of the slowest region. If us-east-1 is degraded, your publish CI job hangs there and you discover that your template publish is no longer reliable.

The mitigation is to make per-region publishes independent and best-effort, with monitoring on whether each region is current:

def publish_template_safely(name: str, content: dict) -> dict:
    results = {}
    for region in REGIONS:
        try:
            client = boto3.client("sesv2", region_name=region)
            client.update_email_template(TemplateName=name, TemplateContent=content)
            results[region] = "ok"
        except Exception as e:
            results[region] = f"error: {e}"
    return results

The CI job posts the results to a dashboard. Regions that fail to publish raise alerts but don't block the deploy. The drift report from the drift post runs per-region and catches drift in DR regions before they're needed.

Lazy replication on failover. Templates are published only to the primary region under normal operation. On detected primary failure, a DR script reads templates from the source of truth (Git, IaC, control layer) and pushes them to the DR region.

Pro: simpler steady state. Con: the DR region is uninitialized until you need it, which means your DR procedure has to bootstrap the region — verify identity, configure DKIM, hit the suppression list — at the moment you can least afford delays.

For most teams, active-active replication is the right answer. The steady-state cost is small. The DR-time cost of lazy replication is large.

Application-layer failover

Replicated templates aren't useful if the application doesn't know how to send through the DR region.

The pattern is a small wrapper around the SES client that picks a region based on health.

import boto3
import time
from typing import Optional

class SESFailoverClient:
    def __init__(self, primary: str, secondary: str):
        self.primary = boto3.client("sesv2", region_name=primary)
        self.secondary = boto3.client("sesv2", region_name=secondary)
        self.primary_unhealthy_until: Optional[float] = None

    def send(self, **kwargs) -> dict:
        if self.primary_unhealthy_until and time.time() < self.primary_unhealthy_until:
            return self.secondary.send_email(**kwargs)
        try:
            return self.primary.send_email(**kwargs)
        except Exception:
            self.primary_unhealthy_until = time.time() + 60
            return self.secondary.send_email(**kwargs)

In practice you want this layered with a circuit breaker (don't keep retrying primary if it keeps failing) and metrics (count failovers per minute, alert when nonzero). For high-volume senders, the secondary region needs sufficient quota to handle the full load — request the quota increase well in advance, not during the incident.

A few real-world refinements:

Idempotency. If the primary call partially succeeds (the message was sent but the response was lost), retrying on the secondary sends a duplicate. SES SendEmail is not natively idempotent. Either accept occasional duplicates as the cost of failover, or implement an idempotency key at the application layer and check it before sending.

Latency budgets. Failover only helps if you detect the primary failure quickly. A 30-second timeout on SendEmail is too long for a password reset — the user gives up and tries again, generating a queue of duplicates. A 3-second timeout with circuit-breaker tripping is more useful.

Suppression check. If your application maintains its own suppression list (which is recommended), check it before sending regardless of region. The SES regional suppression lists may not be in sync, and you don't want to violate an opt-out because the secondary region didn't know about it.

Suppression list parity across regions

The suppression list is the trickiest piece of cross-region DR.

SES maintains an account-level suppression list per region, capturing addresses that hard-bounced or generated complaints. If your primary region adds an address to its suppression list, the DR region won't have it unless you copy it across.

The honest answer is that account-level suppression lists are not designed for multi-region replication, and the API for managing them is awkward. The pragmatic patterns:

Application-layer suppression. Maintain your own suppression list in DynamoDB or your application database. Update it from SES bounce/complaint event destinations. Check it before every send. Treat the SES regional suppression list as a defense-in-depth layer, not the source of truth. This is the pattern I see working most often.

SES list management API + sync job. Use the v2 list management API to manually replicate. A scheduled job in each non-primary region reads suppression entries from the primary and applies them locally. Possible but fragile, and the API rate limits make full sync slow on large lists.

Avoid double-bouncing. Whatever pattern you use, the goal is that addresses suppressed in primary are not sent to from secondary during a failover. Application-layer suppression solves this directly. Per-region suppression alone does not.

Runbooks and tabletop exercises

The best DR plan you've never tested is worse than the worse plan you've practiced.

The minimum viable runbook for SES DR has these pieces:

Detection. What signal tells you SES primary is unhealthy? CloudWatch metrics on send latency? AWS Health Dashboard? Customer reports? Define the trigger.
Decision. Who decides to fail over? Under what conditions? Failing over too eagerly creates churn; failing over too late prolongs customer pain.
Action. What commands or button-presses execute the failover? If it's automated (the wrapper from earlier), what manual interventions might still be needed?
Verification. How do you confirm sends are flowing through the secondary region successfully?
Communication. Who tells customers, internal stakeholders, and on-call? What channels?
Recovery. When primary is healthy again, how do you fail back? In what order? How do you reconcile state (sent messages, suppression entries, audit logs) that may have diverged during the incident?

Once a quarter, run a tabletop exercise. The exercise doesn't have to be a real failover — it can be a paper walk-through with the on-call team. Read the runbook. Identify what's outdated. Identify what assumptions have changed. Update accordingly.

Once a year, run a real failover drill. Pick a low-traffic time, fail over, watch what breaks, fail back. The first one will reveal three things you didn't think of.

RTO and RPO targets

Disaster recovery vocabulary asks two questions: how long can you tolerate the system being down, and how much data can you afford to lose?

For transactional email specifically:

RTO (Recovery Time Objective). How long after primary failure can transactional sends resume? For most B2B SaaS, the right target is minutes to tens of minutes. Password resets are time-sensitive. An hour-long outage of password reset email is a customer support incident even if the application itself is up.

RPO (Recovery Point Objective). How much in-flight email can be lost? For transactional, the answer is "as little as possible," but the practical achievable RPO depends on whether you have queued sends or fire-and-forget sends.

If your application calls SendEmail directly and treats the response as success, your RPO is zero for delivered emails (already sent) and full for failed sends (need to be retried, which the application probably won't do).

If your application enqueues sends through SQS or similar, with a worker that calls SES and retries on failure, your RPO is bounded by the queue depth at the moment of failure. The queue can drain through the secondary region as soon as the wrapper fails over.

For high-criticality transactional flows, a queue-with-retry pattern is the right architecture. The latency cost (a few hundred milliseconds for normal sends) is negligible. The reliability gain on failure is significant.

Sample architecture

For a SaaS with global customers and SOC 2 obligations:

Primary SES sending in eu-west-1 (matching primary application region).
Secondary SES sending in eu-central-1 (a different EU region, same compliance posture).
Verified domains in both regions, with DKIM keys published for each.
Account-level quotas raised to match steady-state production volume in both regions.
Templates replicated active-active to both regions on every publish.
Application sends go through SQS, with a worker that retries on failure and uses the SESFailoverClient pattern.
Suppression list maintained in DynamoDB at the application layer, populated from SES bounce/complaint event destinations.
Drift detection runs hourly, checking that templates in both regions match the source of truth.
DR runbook in the team's runbook repo, reviewed quarterly, with the failover decision tree and the communication plan.

The cost above a single-region setup is real but not exotic: a second region's worth of verified domains, doubled template publishing in CI, an application-layer suppression list, a queue worker. Everything else is configuration.

Where Sovy fits

Multi-region template management is one of the operational headaches that motivated Sovy's design. Sovy treats regions as a first-class concept: a template is published to a set of regions in a single operation, with per-region success reporting. Drift detection runs per region. Audit logs capture which regions a publish reached.

The point isn't that Sovy does DR for you — your application code still has to fail over, and your AWS architecture decisions are still yours. The point is that the template plane stays consistent across regions without you having to write the iteration logic, monitor it, and deal with the per-region failure modes.

For teams that don't have the engineering bandwidth to build active-active template replication and the monitoring around it, the alternative is usually "single region until something forces us to fix it." That something is usually an outage.

Where to start

If your SES setup is single-region today, the highest-leverage first move is requesting account quota increases and verifying domains in a secondary region. Do it before you need it. Domain verification, DKIM key publication, and quota requests have lead time. The DR region you set up the day before an outage is the DR region that fails when you need it.

After that: replicate templates, write the wrapper, write the runbook, run the tabletop. None of these is a quarter of work. All of them combined are still less work than handling an unplanned outage.

The objection to investing here is always "it's never happened to us." Every team that's had a regional SES incident said exactly that the day before.

Sovy is a control layer for Amazon SES templates with first-class multi-region support. Templates are published consistently across the regions you choose, with per-region drift detection and audit. If your DR plan for transactional email is a sticky note, we'd like to hear from you.