Axon Shield

ACME Protocol Implementation: Step-by-Step Technical Guide

Part of the ACME Certificate Automation Guide

After implementing ACME-based certificate automation at multiple enterprises, I've learned that technical capability is only half the battle. The other half is understanding organizational constraints, mapping dependencies, and choosing deployment patterns that match your actual operational reality.

This guide provides the technical implementation playbook based on what actually works in production environments managing thousands to hundreds of thousands of certificates.


Phase 0: Discovery and Assessment (Week 1-2)

Don't skip this. Organizations that start implementing before understanding their current state spend 3x longer fixing problems they could have anticipated.

Certificate Inventory Discovery

What you think you have vs. what you actually have:

Expected: 5,000 certificates tracked in CMDB
Actual (after discovery): 12,000-15,000 certificates

Where the extras come from:
- Let's Encrypt certificates issued by DevOps teams (not tracked)
- Acquired company infrastructure (not integrated)
- Kubernetes service mesh auto-generated certificates
- Development/staging environments (assumed "temporary")

Discovery methods (run all in parallel):

  1. Network scanning - Capture TLS handshakes for 30 days to identify all certificate usage
    • Tools: Censys, Shodan for external, nmap for internal networks
    • What it finds: Certificates you didn't know existed
  2. Log analysis - Certificate issuance/renewal logs from all known sources
    • Check: CA logs, web server logs, load balancer logs
    • What it finds: Historical patterns revealing renewal processes
  3. Application inventory - Survey every application owner
    • Questions: What certificates do you use? Who manages them? How do you renew?
    • What it finds: Shadow PKI systems and manual processes
  4. Cloud provider audit
    • AWS Certificate Manager inventory
    • Azure Key Vault certificate list
    • GCP Certificate Authority usage
  5. Container orchestration
    • Kubernetes: Check cert-manager deployments across all clusters
    • Service mesh: Istio, Linkerd certificate generation

Current Process Mapping

Document your actual renewal process, not what you think it is:

Example enterprise manual process (30-day timeline):
Day 1: Application owner realizes certificate expiring soon
Day 2-3: Searches wiki/Slack to find renewal process
Day 7: Infrastructure team generates CSR, submits ITSM ticket
Day 9: Security reviews request (2-day approval backlog)
Day 12: Certificate team submits to CA
Day 15: Certificate arrives, Change Advisory Board approval needed
Day 22: Change approved, scheduled for next maintenance window
Day 30: Certificate deployed, services restarted

Total visible cost: $200 CA fee
Total invisible cost: $2,000-$3,000 in labor
Teams involved: 5
Approval gates: 3
Risk: Any delay means outage

Key questions to answer:

  • How many approval gates actually exist?
  • Which approvals mitigate real risk vs. organizational inertia?
  • Can automation eliminate approvals or just speed them up?
  • What are your actual change windows for production systems?

Phase 1: Architecture Decisions (Week 2-4)

Decision Point 1: Public CA vs. Private CA vs. Hybrid

Public CA (Let's Encrypt) via ACME:

  • Best for: External-facing web services, APIs, customer-facing infrastructure
  • Pros: Free, automated, broadly trusted
  • Cons: 90-day validity max, requires domain validation, external dependency
  • Cost: Zero certificate fees, implementation only

Private CA with ACME support:

  • Best for: Internal services, IoT devices, code signing, regulated environments
  • Pros: Full control, custom validity periods, internal trust
  • Cons: Setup complexity, trust distribution challenges, operational overhead
  • Options: Smallstep CA (open source), HashiCorp Vault PKI, commercial CAs with ACME

Hybrid approach (our recommendation for most enterprises):

  • Public-facing: Let's Encrypt via ACME (significant portion of certificates)
  • Internal services: Private CA with ACME support (smaller portion)
  • Special cases: Commercial CA with extended validation or specific requirements (edge cases)

Decision Point 2: Tool Selection by Environment

Environment Recommended Tool Alternatives Timeline to Production
Kubernetes cert-manager External Secrets Operator + ACME 1-2 weeks
Linux VMs Certbot acme.sh, Caddy (web server replacement) 2-3 weeks
Windows/IIS win-acme Certify The Web, ACMESharp 2-4 weeks
Load Balancers Vendor-specific + Certbot Custom integration 3-4 weeks
Multi-cloud Terraform + cert-manager Cloud-native solutions (ACM, Key Vault) 4-6 weeks

Decision Point 3: Centralized vs. Distributed Management

Centralized approach:

  • Single platform manages all certificates across environments
  • Pros: Unified visibility, consistent policies, audit trail
  • Cons: Single point of failure, higher initial complexity
  • Best for: Regulated industries, compliance-heavy environments

Distributed approach:

  • Each environment manages its own certificates (cert-manager in each K8s cluster)
  • Pros: Environment isolation, simpler per-environment setup, resilient to platform failures
  • Cons: Harder to get unified visibility, potential policy inconsistencies
  • Best for: Tech companies, DevOps-mature organizations

Hybrid approach (recommended):

  • Distributed automation with centralized monitoring and reporting
  • Each environment auto-renews independently
  • Central platform aggregates certificate data for visibility

Phase 2: Implementation (Week 4-12)

Kubernetes Implementation with cert-manager

Week 1-2: Setup and Configuration

# Install cert-manager via Helm
kubectl create namespace cert-manager

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.14.0 \
  --set installCRDs=true

# Create Let's Encrypt ClusterIssuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-production-key
    solvers:
    - http01:
        ingress:
          class: nginx
    # DNS validation for wildcard certificates
    - dns01:
        cloudDNS:
          project: your-gcp-project
          serviceAccountSecretRef:
            name: clouddns-dns01-solver-sa
            key: key.json

Week 2-4: Certificate Definitions

# Example: Automated certificate for application
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-example-com-tls
  namespace: production
spec:
  secretName: app-example-com-tls
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  dnsNames:
  - app.example.com
  - www.app.example.com

Week 4-8: Rollout Strategy

  1. Start with non-production clusters (dev, staging)
  2. Validate automatic issuance and renewal
  3. Monitor for 30 days to observe first renewal cycle
  4. Migrate production workloads in phases:
    • Non-critical services first
    • Customer-facing services during low-traffic windows
    • Critical infrastructure last with full rollback plan

Traditional Infrastructure Implementation with Certbot

Week 1-2: Installation and Testing

# Install Certbot (Ubuntu/Debian example)
sudo apt update
sudo apt install certbot python3-certbot-nginx

# Test certificate issuance for single domain
sudo certbot --nginx -d example.com -d www.example.com

# Verify automatic renewal is configured
sudo systemctl status certbot.timer
sudo certbot renew --dry-run

Week 2-4: Automation Setup

# Create central renewal script
#!/bin/bash
# /usr/local/bin/certbot-renew-all.sh

LOGFILE="/var/log/certbot-renewal.log"
echo "$(date): Starting certificate renewal check" >> $LOGFILE

certbot renew --quiet \
  --deploy-hook "systemctl reload nginx" \
  >> $LOGFILE 2>&1

if [ $? -eq 0 ]; then
    echo "$(date): Renewal check completed successfully" >> $LOGFILE
else
    echo "$(date): Renewal check failed" >> $LOGFILE
    # Alert on failure
    /usr/local/bin/send-alert.sh "Certbot renewal failed on $(hostname)"
fi

# Add to crontab
0 2 * * * /usr/local/bin/certbot-renew-all.sh

Week 4-8: Enterprise-scale Deployment

  1. Create Ansible/Puppet playbook for standardized deployment
  2. Roll out to dev/test servers first
  3. Establish monitoring and alerting for renewal failures
  4. Document rollback procedures before production deployment
  5. Production rollout in phases with change windows

Common Pitfalls and Solutions

Pitfall 1: Rate Limiting

Problem: Let's Encrypt rate limits: 50 certificates per registered domain per week
Solution: 
- Use staging environment for testing (higher limits)
- Plan production rollout to stay under limits
- Use wildcard certificates where appropriate
- Spread rollout across multiple weeks if needed

Pitfall 2: DNS Propagation Delays

Problem: DNS-01 challenges fail due to propagation delays
Solution:
- Add propagation wait time to cert-manager configuration
- Use HTTP-01 validation where possible (faster)
- Pre-provision DNS records before certificate requests

Pitfall 3: Renewal Failures During Outages

Problem: If service is down during renewal window, renewal fails
Solution:
- Configure multiple renewal attempts over 30-day window
- Use HTTP-01 challenge with .well-known/acme-challenge paths that don't require application availability
- Monitor renewal status 30 days before expiration, not 7 days

Phase 3: Monitoring and Operations (Week 12+)

Essential Monitoring

Certificate expiration alerts:

# Prometheus alert example
- alert: CertificateExpiringSoon
  expr: (x509_cert_not_after - time()) / 86400 < 30
  labels:
    severity: warning
  annotations:
    summary: "Certificate {{ $labels.name }} expires in {{ $value }} days"

- alert: CertificateExpiryCritical
  expr: (x509_cert_not_after - time()) / 86400 < 7
  labels:
    severity: critical
  annotations:
    summary: "Certificate {{ $labels.name }} expires in {{ $value }} days"

Renewal success rate:

# Track renewal attempts vs. successes
certbot_renewal_attempts_total{} 
certbot_renewal_success_total{}

# Alert on declining success rate
alert: CertbotRenewalFailureRate
  expr: rate(certbot_renewal_success_total[24h]) / rate(certbot_renewal_attempts_total[24h]) < 0.95

Operational Runbooks

Renewal Failure Response:

  1. Check rate limiting status (Let's Encrypt has public status page)
  2. Verify DNS/HTTP validation paths are accessible
  3. Check for expired ACME account credentials
  4. Review firewall rules for ACME server communication
  5. Manual renewal fallback procedure documented

Emergency Manual Certificate Issuance:

# When automation fails and certificate expired
# Option 1: Force immediate Certbot renewal
sudo certbot renew --cert-name example.com --force-renewal

# Option 2: Manual certificate request
sudo certbot certonly --manual \
  -d example.com -d www.example.com \
  --preferred-challenges dns

Phase 4: Scaling and Optimization

Multi-Region Deployment

For global enterprises, deploy ACME automation regionally:

  • North America: Primary Let's Encrypt ACME endpoint
  • Europe: Consider regional ACME providers for GDPR compliance
  • Asia-Pacific: Local ACME providers or cached ACME responses

Private CA Integration

When you need private certificates with ACME automation:

# Smallstep CA setup with ACME support
step ca init --acme

# Configure cert-manager to use private CA
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  acme:
    server: https://internal-ca.example.com/acme/acme/directory
    skipTLSVerify: false  # Set true only for testing
    privateKeySecretRef:
      name: internal-ca-key
    solvers:
    - http01:
        ingress:
          class: internal

Success Metrics

Week 4 (After Initial Deployment):

  • 50-100 certificates automated
  • First successful automatic renewal observed
  • Monitoring and alerting operational
  • Runbooks documented and tested

Week 12 (Full Production):

  • Significant portion of public certificates automated
  • Zero manual renewals for automated certificates
  • Renewal success rate >99%
  • Engineering time saved: 20-40 hours weekly

Month 6 (Mature Operations):

  • 95%+ of target certificates automated
  • Zero certificate-related outages
  • Certificate management operational cost reduced by 70-90%
  • Team capacity freed for strategic projects

When to Get Expert Help

You can implement ACME automation yourself if you have:

  • 2+ engineers comfortable with Kubernetes/Linux ops
  • 2-3 months implementation timeline
  • Ability to learn from mistakes and iterate
  • Under 10,000 certificates initially

Consider expert help if you have:

  • Complex multi-environment requirements (50+ systems)
  • Compliance requirements (PCI, HIPAA, SOC 2)
  • Aggressive timelines (8-12 weeks to production)
  • Previous failed automation attempts
  • Need for private CA integration from day one

Related Resources


References

  1. Let's Encrypt. Rate Limits. https://letsencrypt.org/docs/rate-limits/
  2. cert-manager Documentation. https://cert-manager.io/docs/
  3. Certbot Documentation. https://eff-certbot.readthedocs.io/
  4. ACME Protocol RFC 8555. https://datatracker.ietf.org/doc/html/rfc8555