ACME Protocol Implementation: Step-by-Step Technical Guide

Part of the ACME Certificate Automation Guide

After implementing ACME-based certificate automation at multiple enterprises, I've learned that technical capability is only half the battle. The other half is understanding organizational constraints, mapping dependencies, and choosing deployment patterns that match your actual operational reality.

This guide provides the technical implementation playbook based on what actually works in production environments managing thousands to hundreds of thousands of certificates.

Phase 0: Discovery and Assessment (Week 1-2)

Don't skip this. Organizations that start implementing before understanding their current state spend 3x longer fixing problems they could have anticipated.

Certificate Inventory Discovery

What you think you have vs. what you actually have:

Expected: 5,000 certificates tracked in CMDB
Actual (after discovery): 12,000-15,000 certificates

Where the extras come from:
- Let's Encrypt certificates issued by DevOps teams (not tracked)
- Acquired company infrastructure (not integrated)
- Kubernetes service mesh auto-generated certificates
- Development/staging environments (assumed "temporary")

Discovery methods (run all in parallel):

Network scanning - Capture TLS handshakes for 30 days to identify all certificate usage
- Tools: Censys, Shodan for external, nmap for internal networks
- What it finds: Certificates you didn't know existed
Log analysis - Certificate issuance/renewal logs from all known sources
- Check: CA logs, web server logs, load balancer logs
- What it finds: Historical patterns revealing renewal processes
Application inventory - Survey every application owner
- Questions: What certificates do you use? Who manages them? How do you renew?
- What it finds: Shadow PKI systems and manual processes
Cloud provider audit
- AWS Certificate Manager inventory
- Azure Key Vault certificate list
- GCP Certificate Authority usage
Container orchestration
- Kubernetes: Check cert-manager deployments across all clusters
- Service mesh: Istio, Linkerd certificate generation

Current Process Mapping

Document your actual renewal process, not what you think it is:

Example enterprise manual process (30-day timeline):
Day 1: Application owner realizes certificate expiring soon
Day 2-3: Searches wiki/Slack to find renewal process
Day 7: Infrastructure team generates CSR, submits ITSM ticket
Day 9: Security reviews request (2-day approval backlog)
Day 12: Certificate team submits to CA
Day 15: Certificate arrives, Change Advisory Board approval needed
Day 22: Change approved, scheduled for next maintenance window
Day 30: Certificate deployed, services restarted

Total visible cost: $200 CA fee
Total invisible cost: $2,000-$3,000 in labor
Teams involved: 5
Approval gates: 3
Risk: Any delay means outage

Key questions to answer:

How many approval gates actually exist?
Which approvals mitigate real risk vs. organizational inertia?
Can automation eliminate approvals or just speed them up?
What are your actual change windows for production systems?

Phase 1: Architecture Decisions (Week 2-4)

Decision Point 1: Public CA vs. Private CA vs. Hybrid

Public CA (Let's Encrypt) via ACME:

Best for: External-facing web services, APIs, customer-facing infrastructure
Pros: Free, automated, broadly trusted
Cons: 90-day validity max, requires domain validation, external dependency
Cost: Zero certificate fees, implementation only

Private CA with ACME support:

Best for: Internal services, IoT devices, code signing, regulated environments
Pros: Full control, custom validity periods, internal trust
Cons: Setup complexity, trust distribution challenges, operational overhead
Options: Smallstep CA (open source), HashiCorp Vault PKI, commercial CAs with ACME

Hybrid approach (our recommendation for most enterprises):

Public-facing: Let's Encrypt via ACME (significant portion of certificates)
Internal services: Private CA with ACME support (smaller portion)
Special cases: Commercial CA with extended validation or specific requirements (edge cases)

Decision Point 2: Tool Selection by Environment

Environment	Recommended Tool	Alternatives	Timeline to Production
Kubernetes	cert-manager	External Secrets Operator + ACME	1-2 weeks
Linux VMs	Certbot	acme.sh, Caddy (web server replacement)	2-3 weeks
Windows/IIS	win-acme	Certify The Web, ACMESharp	2-4 weeks
Load Balancers	Vendor-specific + Certbot	Custom integration	3-4 weeks
Multi-cloud	Terraform + cert-manager	Cloud-native solutions (ACM, Key Vault)	4-6 weeks

Decision Point 3: Centralized vs. Distributed Management

Centralized approach:

Single platform manages all certificates across environments
Pros: Unified visibility, consistent policies, audit trail
Cons: Single point of failure, higher initial complexity
Best for: Regulated industries, compliance-heavy environments

Distributed approach:

Each environment manages its own certificates (cert-manager in each K8s cluster)
Pros: Environment isolation, simpler per-environment setup, resilient to platform failures
Cons: Harder to get unified visibility, potential policy inconsistencies
Best for: Tech companies, DevOps-mature organizations

Hybrid approach (recommended):

Distributed automation with centralized monitoring and reporting
Each environment auto-renews independently
Central platform aggregates certificate data for visibility

Phase 2: Implementation (Week 4-12)

Kubernetes Implementation with cert-manager

Week 1-2: Setup and Configuration

# Install cert-manager via Helm
kubectl create namespace cert-manager

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.14.0 \
  --set installCRDs=true

# Create Let's Encrypt ClusterIssuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-production-key
    solvers:
    - http01:
        ingress:
          class: nginx
    # DNS validation for wildcard certificates
    - dns01:
        cloudDNS:
          project: your-gcp-project
          serviceAccountSecretRef:
            name: clouddns-dns01-solver-sa
            key: key.json

Week 2-4: Certificate Definitions

# Example: Automated certificate for application
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-example-com-tls
  namespace: production
spec:
  secretName: app-example-com-tls
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  dnsNames:
  - app.example.com
  - www.app.example.com

Week 4-8: Rollout Strategy

Start with non-production clusters (dev, staging)
Validate automatic issuance and renewal
Monitor for 30 days to observe first renewal cycle
Migrate production workloads in phases:
- Non-critical services first
- Customer-facing services during low-traffic windows
- Critical infrastructure last with full rollback plan

Traditional Infrastructure Implementation with Certbot

Week 1-2: Installation and Testing

# Install Certbot (Ubuntu/Debian example)
sudo apt update
sudo apt install certbot python3-certbot-nginx

# Test certificate issuance for single domain
sudo certbot --nginx -d example.com -d www.example.com

# Verify automatic renewal is configured
sudo systemctl status certbot.timer
sudo certbot renew --dry-run

Week 2-4: Automation Setup

# Create central renewal script
#!/bin/bash
# /usr/local/bin/certbot-renew-all.sh

LOGFILE="/var/log/certbot-renewal.log"
echo "$(date): Starting certificate renewal check" >> $LOGFILE

certbot renew --quiet \
  --deploy-hook "systemctl reload nginx" \
  >> $LOGFILE 2>&1

if [ $? -eq 0 ]; then
    echo "$(date): Renewal check completed successfully" >> $LOGFILE
else
    echo "$(date): Renewal check failed" >> $LOGFILE
    # Alert on failure
    /usr/local/bin/send-alert.sh "Certbot renewal failed on $(hostname)"
fi

# Add to crontab
0 2 * * * /usr/local/bin/certbot-renew-all.sh

Week 4-8: Enterprise-scale Deployment

Create Ansible/Puppet playbook for standardized deployment
Roll out to dev/test servers first
Establish monitoring and alerting for renewal failures
Document rollback procedures before production deployment
Production rollout in phases with change windows

Common Pitfalls and Solutions

Pitfall 1: Rate Limiting

Problem: Let's Encrypt rate limits: 50 certificates per registered domain per week
Solution: 
- Use staging environment for testing (higher limits)
- Plan production rollout to stay under limits
- Use wildcard certificates where appropriate
- Spread rollout across multiple weeks if needed

Pitfall 2: DNS Propagation Delays

Problem: DNS-01 challenges fail due to propagation delays
Solution:
- Add propagation wait time to cert-manager configuration
- Use HTTP-01 validation where possible (faster)
- Pre-provision DNS records before certificate requests

Pitfall 3: Renewal Failures During Outages

Problem: If service is down during renewal window, renewal fails
Solution:
- Configure multiple renewal attempts over 30-day window
- Use HTTP-01 challenge with .well-known/acme-challenge paths that don't require application availability
- Monitor renewal status 30 days before expiration, not 7 days

Phase 3: Monitoring and Operations (Week 12+)

Essential Monitoring

Certificate expiration alerts:

# Prometheus alert example
- alert: CertificateExpiringSoon
  expr: (x509_cert_not_after - time()) / 86400 < 30
  labels:
    severity: warning
  annotations:
    summary: "Certificate {{ $labels.name }} expires in {{ $value }} days"

- alert: CertificateExpiryCritical
  expr: (x509_cert_not_after - time()) / 86400 < 7
  labels:
    severity: critical
  annotations:
    summary: "Certificate {{ $labels.name }} expires in {{ $value }} days"

Renewal success rate:

# Track renewal attempts vs. successes
certbot_renewal_attempts_total{} 
certbot_renewal_success_total{}

# Alert on declining success rate
alert: CertbotRenewalFailureRate
  expr: rate(certbot_renewal_success_total[24h]) / rate(certbot_renewal_attempts_total[24h]) < 0.95

Operational Runbooks

Renewal Failure Response:

Check rate limiting status (Let's Encrypt has public status page)
Verify DNS/HTTP validation paths are accessible
Check for expired ACME account credentials
Review firewall rules for ACME server communication
Manual renewal fallback procedure documented

Emergency Manual Certificate Issuance:

# When automation fails and certificate expired
# Option 1: Force immediate Certbot renewal
sudo certbot renew --cert-name example.com --force-renewal

# Option 2: Manual certificate request
sudo certbot certonly --manual \
  -d example.com -d www.example.com \
  --preferred-challenges dns

Phase 4: Scaling and Optimization

Multi-Region Deployment

For global enterprises, deploy ACME automation regionally:

North America: Primary Let's Encrypt ACME endpoint
Europe: Consider regional ACME providers for GDPR compliance
Asia-Pacific: Local ACME providers or cached ACME responses

Private CA Integration

When you need private certificates with ACME automation:

# Smallstep CA setup with ACME support
step ca init --acme

# Configure cert-manager to use private CA
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  acme:
    server: https://internal-ca.example.com/acme/acme/directory
    skipTLSVerify: false  # Set true only for testing
    privateKeySecretRef:
      name: internal-ca-key
    solvers:
    - http01:
        ingress:
          class: internal