Skip to content
Techsense Developers
TrustLet's Talk
Insights
Cloud & Infrastructure7 min readJun 27, 2026

How to Build a 24/7 SRE Pipeline with Terraform and AWS

To build a Terraform SRE pipeline that runs 24/7 on AWS, you codify three things and wire them together: your infrastructure (compute, networking, data stores), your observability stack (metrics,…

To build a Terraform SRE pipeline that runs 24/7 on AWS, you codify three things and wire them together: your infrastructure (compute, networking, data stores), your observability stack (metrics, logs, traces, alerts), and your remediation paths (auto-scaling, self-healing, escalation). The pipeline then provisions, validates, and continuously reconciles all of it through version-controlled Terraform, gated by automated checks in CI/CD. The result is reliability that does not depend on a human being awake at 3 a.m.

This post walks through how I structure that pipeline in production, the AWS primitives that do the heavy lifting, and the failure modes to plan for before they page you.

Why a Terraform SRE Pipeline Beats Manual Reliability Work

Most reliability problems are not exotic. They are drift, inconsistent environments, and alerting that nobody trusts. When a human applies changes by hand, three things happen over time:

  • Configuration drift. The console gets a "quick fix" that never makes it back to code.
  • Tribal knowledge. The person who knows the runbook goes on vacation.
  • Inconsistent recovery. The same incident gets handled differently depending on who is on call.

Infrastructure as Code (IaC) for reliability fixes the root cause. When your monitoring, alerting, and remediation live in the same repository as your infrastructure, every change is reviewed, versioned, and reproducible. You stop debating what the production state should be because the repository is the source of truth.

The goal is not "no humans." The goal is that humans handle judgment calls while the pipeline handles the predictable work.

Architecture: The Four Layers of AWS SRE Automation

I structure a Terraform SRE pipeline in four layers. Each is its own Terraform module so blast radius stays contained and teams can own pieces independently.

1. Foundation layer

Networking, IAM, encryption keys, and account guardrails. This layer changes rarely and gets the strictest review.

module "network" {
  source = "./modules/network"

  vpc_cidr            = "10.40.0.0/16"
  availability_zones  = ["us-east-1a", "us-east-1b", "us-east-1c"]
  enable_flow_logs    = true
  flow_log_retention  = 90
}

Three availability zones is the baseline for 24/7 operation. A single-AZ design cannot survive a zone-level event, which AWS documents as a real and periodic occurrence.

2. Workload layer

The actual services: ECS, EKS, Lambda, RDS, whatever runs your product. The key reliability decisions live here, especially health checks and auto-scaling.

resource "aws_appautoscaling_target" "api" {
  max_capacity       = 20
  min_capacity       = 3
  resource_id        = "service/prod-cluster/api"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "api_cpu" {
  name               = "api-cpu-target"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.api.resource_id
  scalable_dimension = aws_appautoscaling_target.api.scalable_dimension
  service_namespace  = aws_appautoscaling_target.api.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 60.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Note the asymmetric cooldowns. I scale out fast (60 seconds) and scale in slowly (300 seconds) so a brief dip in load does not strip capacity right before the next spike.

3. Observability layer

This is where most "monitoring" projects stall. 24/7 infrastructure monitoring is not a dashboard, it is the alerting and the data behind it. I provision CloudWatch alarms, metric filters, and SNS topics in Terraform so alert thresholds are reviewed like any other code change.

resource "aws_cloudwatch_metric_alarm" "api_5xx" {
  alarm_name          = "api-5xx-rate-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  datapoints_to_alarm = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 25
  treat_missing_data  = "notBreaching"

  dimensions = {
    LoadBalancer = aws_lb.api.arn_suffix
  }

  alarm_actions = [aws_sns_topic.pager.arn]
  ok_actions    = [aws_sns_topic.pager.arn]
}

The datapoints_to_alarm = 2 over three periods is deliberate. It suppresses single-minute blips that would otherwise create noise. Alert fatigue is a reliability risk in its own right: if on-call engineers learn to ignore the pager, your AWS SRE automation is actively making things worse.

4. Remediation layer

This is what makes the pipeline "24/7." Some incidents should never require a human. Wire CloudWatch alarms to automated responses through SNS, Lambda, or EventBridge.

resource "aws_cloudwatch_event_rule" "unhealthy_host" {
  name        = "ec2-status-check-failed"
  description = "Trigger recovery on instance status check failure"

  event_pattern = jsonencode({
    source      = ["aws.cloudwatch"]
    detail-type = ["CloudWatch Alarm State Change"]
    detail = {
      state = { value = ["ALARM"] }
    }
  })
}

For stateless workloads, the safest automated remediation is "replace, don't repair." Let the auto-scaling group terminate and relaunch the unhealthy instance rather than trying to fix it in place.

Building the Pipeline: From Commit to Reconciled State

The CI/CD flow is what turns four Terraform modules into a living system. Here is the sequence I run on every change.

  1. Format and validate. terraform fmt -check and terraform validate fail fast on obvious errors.
  2. Static analysis. Run a policy and security scanner (such as tfsec or checkov) to catch open security groups, unencrypted volumes, and over-broad IAM before they merge.
  3. Plan with locking. Generate terraform plan against remote state backed by S3 with DynamoDB locking so two pipelines never apply at once.
  4. Human review of the plan. For production, a reviewer approves the actual plan output, not just the code diff.
  5. Apply. Merge to the protected branch triggers terraform apply.
  6. Drift detection. A scheduled job runs terraform plan on a cadence and alerts if real state has diverged from code.

A minimal GitHub Actions stage looks like this:

jobs:
  plan:
    runs-on: ubuntu-latest
    permissions:
      id-token: write   # OIDC, no long-lived AWS keys
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111122223333:role/terraform-ci
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform validate
      - run: terraform plan -out=tfplan

Use OIDC federation, not stored access keys. Long-lived credentials in CI are one of the most common ways a pipeline becomes the breach.

Remote state is non-negotiable

State is the heart of any Terraform SRE pipeline. Store it remotely, encrypt it, version it, and lock it.

terraform {
  backend "s3" {
    bucket         = "techorg-tfstate-prod"
    key            = "workload/api/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tf-locks"
    encrypt        = true
  }
}

Separate state per environment and per layer. A corrupted or locked state file should never be able to take down more than the module it belongs to.

Making It Genuinely 24/7

Provisioning is the easy part. Staying reliable at 3 a.m. requires three habits the pipeline must enforce.

  • Drift detection on a schedule. A nightly terraform plan that posts non-empty diffs to your alerting channel keeps the console-edit problem from accumulating.
  • Synthetic monitoring. Real user impact often shows up before any infrastructure metric breaches. Provision CloudWatch Synthetics canaries that exercise critical user paths every minute.
  • Tested runbooks as code. For anything not yet automated, store the runbook next to the Terraform that owns the resource, and reference it in the alarm description so the on-call engineer has context in the page.

The teams that get the most value from IaC for reliability treat every incident as a backlog item: if a human fixed it manually, the next step is to encode that fix so it never needs a human again. That feedback loop is the difference between a pipeline that decays and one that compounds.

If you are weighing whether to build this in-house or bring in help, our cloud and infrastructure capabilities cover the patterns above end to end. We also tailor reliability targets to context, because a fintech ledger and a content site have very different definitions of acceptable downtime; see how that plays out across the industries we work with.

A Sensible Rollout Order

If you are starting from manual operations, do not try to land all four layers at once. I sequence it like this:

  1. Get state remote and locked. Everything else depends on it.
  2. Import existing infrastructure into Terraform so code matches reality.
  3. Add the observability layer and tune alerts until on-call trusts the pager.
  4. Add auto-scaling and self-healing for stateless workloads.
  5. Turn on scheduled drift detection and synthetic canaries.

Each step is independently valuable, so you ship reliability improvements continuously instead of waiting for a big-bang cutover.

FAQ

What is a Terraform SRE pipeline?

It is a version-controlled, CI/CD-driven workflow that provisions and continuously reconciles your infrastructure, monitoring, and automated remediation as code. Rather than configuring reliability by hand in the AWS console, you define alarms, scaling policies, and recovery actions in Terraform so they are reviewed, repeatable, and auditable.

Can a Terraform SRE pipeline fully replace on-call engineers?

No, and that is not the goal. Automated remediation handles predictable, well-understood failures like replacing an unhealthy stateless instance. Humans still own judgment calls, novel incidents, and decisions with business risk. A good pipeline reduces the volume and severity of pages so the humans you do have are not burning out on toil.

How do I prevent configuration drift in AWS?

Run a scheduled terraform plan against production and alert whenever the diff is non-empty. Combine that with restricting console write access in production, so the only sanctioned path to change infrastructure is through the reviewed pipeline. Drift detection plus tight permissions is far more effective than either alone.

What should I monitor first for 24/7 infrastructure monitoring?

Start with user-facing signals: error rate, latency, and availability of critical paths via synthetic canaries. These correlate most directly with customer impact. Add resource-level metrics like CPU, memory, and queue depth afterward to support auto-scaling and capacity planning.

Is Terraform or CloudFormation better for AWS SRE automation?

Both can do the job. Terraform's advantages are a consistent workflow across multiple providers and a large module ecosystem, which matters if your stack is not purely AWS. CloudFormation is tightly integrated with AWS native services. The more important factor is disciplined state management and CI/CD gating, which you can achieve with either tool.