Take Control, Let's Talk Cyber Defense!

Migrating from HashiCorp Vault

Despite being a devops tool, HashiCorp Vault has become the go to tool for companies to manage secrets related to human interactions by extending the platform. In this blog, I aim to demonstrate how Axon Shield can further assist in securing the Vault by integrating secure interfaces for the purpose of sharing secrets. Additionally, I will also provide a comprehensive method for migrating props without any significant effort. In simple terms, I will explain how AWS KMS can be used to encrypt the database linked database.

12/22/20248 min read

HashiCorp Vault factually is such an excellent platform that allows you to successfully manage secrets for CI/CD pipelines and also software integration among other things. A high number of IT corporations have taken a step further by having Vault used as a policy engine to manage the human/shared secrets for them or other software. These use cases are actually very dubious because the human themselves are the only ones capable of access to the secrets and very often a secret can be compromised the moment it was created.

It is all about the combination of Axon Shield and HashiCorp Vault creating a process that allows one to extend HashiCorp Vault over its secure interfaces with the help of people when sharing secrets or even remove HashiCorp Vault for little or no disruption to current users.

This planned procedure is to use a systematic approach that will enable the key components to be utilized correctly and at the same time collect the usage data for making the right decisions. An architectural design comprising the API Gateway, Lambda Proxy, DynamoDB, and CloudWatch will ensure the seamless transition and monitoring, with all the processes running on the serverless AWS cloud. Pointers on the structure are high-level representation of the flow of requests, error handling methods, and a thorough monitoring system to ensure operational efficiency, which also indicates the migration process. Significant features of the migration include gradual transition methods, detailed monitoring, and security measures both of which aim to guarant that the secrets remain intact and available during the conversion process.

To sum up the guide, one final touch on a checklist of the migration steps that enable businesses to seamlessly shift would mean less downtime and risk.

Introduction

A systematic approach to the migration of HashiCorp Vault confidential data from the cloud to a solution that employs AWS KMS and is backed by a database will be detailed in this guide. Moreover, one of them is knowing the change in things will be, surely.

Not at all a discussion of the access control is not among the issues we are exploring but we can provide more detailing of that too, if you want us to. Please let us know.

Through the AWS cloud the process of this whitepaper will be fully served with serverless, so the savings will be allowed as well as the scalability and the infrastructure of it will be easily managed. This entire system can be amalgamated within a Terraform-based CI/CD pipeline using only the elapsing time of 5 minutes to build, and/or, to demolish and repair it as well.

System Components

The architecture consists of several key components working together to facilitate a seamless migration from HashiCorp Vault to a new database backend while maintaining continuous service:

API Gateway: The main entry point for all client requests
Lambda Proxy: Intelligent routing and replication logic have made them a new standard and must be followed to get these benefits
HashiCorp Vault: Original secret storage
DynamoDB: New secret backend
AWS KMS: Encryption service or cryptographic security for the new backend
CloudWatch: Monitoring and logging

Requests are accepted by the API Gateway with a lambda integration. The logic will check if the secret has already been replicated into a new backend, which is based on DynamoDB.

Encryption is provided by AWS KMS (FIPS140-2 Level 3 security), and detailed error, and operational logs are pulled asynchronously from CloudWatch logs into an S3 bucket. AWS-integrated BI QuickSight shows basic metric that includes how many requests are served by each backend. Statistics for clients and "applications". This information feeds back into the transition process.

Request Flow Examples

1. Reading a Secret (GET)

Scenario A: Secret exists in new backend

GET /secret/myapp/database/credentials
Host: api.vault-proxy.example.com
X-Vault-Token: hvs.xxxxxxx

Control and data flow:

API Gateway receives request
Lambda checks DynamoDB for path "secret/myapp/database/credentials"
Secret found in DynamoDB, decrypted using KMS
Response returned directly from new backend
Response includes header X-Backend-Source: new

Scenario B: Secret not yet in new backend

Control and data flow:

API Gateway receives request
Lambda checks DynamoDB for path "secret/myapp/api/key"
Secret not found in DynamoDB
Request forwarded to Vault
Secret retrieved from Vault
Lambda replicates secret to new backend:
Encrypts data using KMS
- Stores in DynamoDB
Original Vault response returned to client
Response includes header X-Backend-Source: vault

Error Scenario - Example Vault Unreachable

Flow if secret exists in new backend:

Lambda fails to reach Vault
Since secret exists in DynamoDB, returns from new backend
Service continues without interruption

Flow if secret not in new backend:

Lambda fails to reach Vault
Returns 503 Service Unavailable
Error logged to CloudWatch
Metric emitted for monitoring

Let's see how we are implementing the service.

Phase 1: Implementing the Monitoring Proxy

An infrastructure that uses an API Gateway is supported by AWS Lambda and Amazon API Gateway that intercept the Vault API calls providing it with a scalable and managed solution. Here's the detailed architecture

Infrastructure - API Gateway

API Gateway refers to the basic infrastructure components that award for the API requests to be processed.

1. REST API Definition

We need to start by creating a regional API Gateway instance. But why are we doing it in a specific region instead of in any possible region?

It has been recently to be found out that this kind of edge endpoint can be used for multiple AWS network carrier customers at once.
Edge-optimized endpoints can be sometimes of worse latency compared to regional endpoints.
It could be rather better if we speak about the cost as opposed to edge-optimized endpoints.

2. Proxy Resource Configuration

It refers to the configuration of the path using the {proxy+} path parameter is inexpendable because:

It takes into account the ones that are yet to come.
Synchronizes Vault's hierarchical path structure
Dynamic-key
By this way, all the terms beginning with /secret/myapp/credentials get the correct mapping

3. Method Configuration

The configuration must handle any request - that's why we implement an ANY method configuration:

Permits all HTTP methods (GET, POST, PUT, DELETE)
It is the most flexible API of all that Vault has around
It is extendable meaning that future HTTP methods can be added without changes
For sure, it does complete the process of preserving the original request method for Lambda

4. Lambda Integration

The Lambda proxy integration technique enables the API GW to directly call Lambda functions and thereby increase scalability. Since this approach only allows texts of a maximum of 1MB to be returned, it is less than the one Hashicorp Vault stays. What was doubled, do you think?

AWS_PROXY is used to directly integrate with Lambda
Request details are mapped to Lambda event auto
Headers, query strings, and body are retained
Transformation overhead is reduced

Security Considerations

Authorization:
Authorization is handled at the Lambda level
- Preserves Vault token authentication
- Allows for future auth method additions
Request Validation:
Path parameters are required
- Method validation at Lambda layer
- Preserves Vault's security model
SSL/TLS:
HTTPS enforced by default
- TLS termination at API Gateway
- Backend communication secured

Logic - Lambda Proxy

1. Main Handler Function

The Lambda handler is the gate that directs all requests and has the following main roles:

Parse received API Gateway events
Route the requests based on HTTP method
Chosen a strategy to implement read caching
Handle errors and metrics generation

2. New Backend Read Function

The get_secret_from_new_backend function executes the read method of the new backend:

Queries DynamoDB for the latest version of the secret
Decrypts data utilizing KMS
Provides a formatted response matching the Vault's format
Provides None in case the secret is not found

Error handling:

DynamoDB errors are not the cause of request failure
KMS decryption errors will be tracked and fallback will be used
Moreover, the path will be kept running even under partial failures

3. Secret Replication Function

The replicate_secret function copies the secrets from the existing backend to the new one as follows:

Three separate steps to accomplish the task (verify, encrypt, store)
Every step is analyzed in detail
Non-blocking operation
Use of an idempotent design

Operation flow:

replication:
Check if secret exists
IF writing OR (reading AND not exists):
Encrypt data with KMS
Store in DynamoDB
Emit success metrics
Record metrics for stage completion
Handle errors without blocking main request
END replication

4. Error Handling Strategy

The Lambda implements comprehensive error handling:

Vault Errors:
Network timeouts
- Authentication failures
- Permission issues
- Records error type and returns appropriate status
DynamoDB Errors:
Throttling
- Consistency issues
- Permission problems
- Allows fallback to Vault
KMS Errors:
Key access issues
- Encryption/decryption failures
- Records for monitoring

5. Metric Emission

The Lambda emits detailed metrics for monitoring:

Operation Metrics:
Read vs Write operations
- Backend source (new vs Vault)
- Response times
- Error rates
Stage Metrics:
Success/failure per stage
- Stage duration
- Error categorization
Migration Progress:
Replication success rate
- Backend usage distribution
- Error patterns

Key Design Aspects

1. Read-Through Strategy

The Lambda implements an intelligent read-through strategy:

Checks new backend first for reads
Falls back to Vault if not found
Automatically replicates missing secrets
Maintains consistency during migration

2. Write Handling

Write operations follow a specific pattern:

Always write to Vault first
Only replicate on successful Vault write
Ensures Vault remains source of truth
Maintains consistency across backends

3. Performance Considerations

The implementation optimizes for performance:

Asynchronous replication where possible
Minimal blocking operations
Efficient error handling
Request pipelining

4. Security Implementation

Security measures include:

Token forwarding to Vault
KMS encryption for new backend
Secure error handling
Audit logging

Migration Features

1. Progressive Migration

The design supports gradual migration:

No downtime required
Secrets migrate on first access
Write operations maintain consistency
Fallback capabilities

2. Monitoring and Visibility

Comprehensive monitoring through:

CloudWatch metrics
Structured logging
Error tracking
Migration progress metrics

3. Operational Controls

The implementation includes:

Circuit breakers for backend failures
Configurable timeouts
Error thresholds
Monitoring alerts

Error Categories

The implementation categorizes errors into:

Infrastructure Errors:
Network issues
- Service unavailability
- Timeout problems
Data Errors:
Validation failures
- Format issues
- Version conflicts
Permission Errors:
Authentication failures
- Authorization issues
- Token problems
Replication Errors:
Encryption failures
- Storage issues
- Consistency problems

Monitoring and Logging Infrastructure

1. Short-term Logging (CloudWatch)

The system implements a tiered logging approach starting with CloudWatch:

1-week retention in CloudWatch Logs
Structured JSON log format
Real-time log ingestion
Immediate searchability

Example log structure:

{
"timestamp": "2024-12-22T10:15:30Z",
"request_data": {
"path": "secret/myapp/credentials",
"method": "GET",
"client_ip": "10.0.1.100",
"user_agent": "python-requests/2.28.1"
},
"response_data": {
"status_code": 200,
"latency_ms": 45,
"backend_source": "new"
},
"metadata": {
"token_hash": "abc123...",
"operation_id": "op-123"
}
}

2. Long-term Storage (S3)

Logs are archived to S3 with:

Structured directory hierarchy
Compression for storage efficiency
Lifecycle policies for cost optimization
Athena-optimized format

Directory structure:

plaintextCopyvault-logs/
├── YYYY/
│ ├── MM/
│ │ ├── DD/
│ │ │ ├── HH/
│ │ │ │ ├── operation_logs.json.gz
│ │ │ │ └── replication_logs.json.gz

Metrics Collection

1. Operational Metrics

Real-time metrics tracking:

Request Metrics
Replication Metrics
Client Metrics

2. Migration Progress Metrics

Tracking migration status:

Percentage of secrets in new backend
Replication success rate
Access patterns
Usage distribution

Monitoring Dashboards

1. CloudWatch Dashboards

Operational monitoring includes:

API Performance
Replication Status
System Health

2. QuickSight Analytics

The QuickSight framework allows businesses to track the migration process and uncover usage data for each secret. Metrics involve:

Migration Overview: progress tracking, success rates, error patterns, timeline projections
Access Analysis: client usage patterns, application behavior, secret popularity, access frequency
Performance Analysis: response time trends, error rate patterns, backend comparison, resource utilization

Alerting Infrastructure

1. Operational Alerts

Immediate alerting for: high error rates, latency spikes, replication failures, system availability

Alert thresholds (examples):

Error Rate: > 5% over 5 minutes
Latency: > 500ms p95 over 5 minutes
Replication: < 95% success rate
System: Any component unavailable

2. Migration Alerts

Migration-specific monitoring: replication lag, consistency issues, usage patterns, progress metrics

Service Operational Procedures

1. Monitoring Response

Defined procedures for:

Alert investigation
Error remediation
Performance issues
System recovery

2. Maintenance

Regular maintenance includes:

Log rotation
Metric cleanup
Dashboard updates
Report generation

Service Business Features

Scalability
API Gateway automatically scales to handle varying loads
- Lambda concurrency handles multiple simultaneous requests
- DynamoDB auto-scaling for access logs
Security
IAM roles for fine-grained access control
- Optional request authentication at API Gateway
- SSL/TLS termination at API Gateway
- Token hashing for secure correlation
Monitoring
CloudWatch metrics for API Gateway and Lambda
- X-Ray tracing for request analysis
- CloudWatch Logs for detailed Lambda logs
- DynamoDB streams for log processing
High Availability
Multi-AZ deployment through API Gateway
- Lambda automatic retries
- DynamoDB global tables option for multi-region setup

Phase 2: Secret Replication Process

The tightly integrated design of data replication combines with the proxy to capture secrets during both read and write operations:

Write Operations (POST/PUT)
In Vault when a secret is either created or updated
- After successful Vault write
- Before returning response to client
Read Operations (GET)
After a secret is read from the Vault
- After receipt of the answer from the Vault
- Before returning to the client

Phase 3: Analyzing Usage Patterns

Appoint the seldom accessed password as a review item
Be open to the option of a notification tool that will pop up close inactivity time
Prepare a detailed guide to retire unused secrets

Migration Checklist

Deploy the service of the monitoring proxy
Keep track of the usage data for a period of not less than one month
Configure AWS KMS infrastructure
Implement secret replication
Analyze usage patterns
Plan the migration schedule based on the usage patterns
Check the ability of the new system with a secret retrieval
Gradual transition of applications
Decommission unused secrets
Plan Vault decommissioning

Conclusion

This migration approach is about the systematic steps and maintenance of operational integrity. To have a good working procedure we must be sure that all the usage data is detailed and collected first, only then we can work with this information to devise the right migration strategy.