Managing Terraform State while using Azure PostgreSQL Flexible Server Point-in-Time Recovery
Addressing the complexities of state management during and after live DR scenarios or required production tests when using Azure PSQL Flexi Server
Overview
A growing trend in public cloud is leveraging PaaS to reduce operational overhead; however, one tradeoff is losing granular control. This can present acute issues especially when dealing with the complexity of resiliency mechanisms that deal with dynamically changing configuration where IT professionals are used to operating directly with the core services rather than an abstraction layer. One such PaaS service is Azure Database for PostgreSQL Flexible Server which has become an increasingly popular service given it provides a robust, scalable database service offering key features such as high availability, automated patching of underlying OS and hardware, and automated backups while providing some degree of management and customization flexibility. This article will focus on Azure PSQL Flexi Server’s point-in-time recovery feature (PITR) which allows for recovery down to a specific timestamp and is useful for scenarios such as accidental deletion, data corruption, or cloning (see here for more information) and how to manage the unique challenges this feature presents when one’s environment is being managed with Infrastructure as Code (IaC) tools such as Terraform.
Terraform, widely adopted for its declarative infrastructure management, works by using state to track resource configurations. While this state-based approach ensures consistency, it can complicate management when leveraging the PITR feature for Azure PSQL Flexi Server. For organizations operating using governed and controlled cloud infrastructure environments where direct state manipulation tends to be avoided there are technical hurdles to get PITR operationalized.
This article explores these challenges and presents a comprehensive, Terraform-driven solution for managing Terraform State while using Azure PostgreSQL Flexible Server Point-in-Time Recovery. The solution is designed to address the complexities of Terraform state management and comply with immutable infrastructure principles, enabling a seamless, automated PITR process.
Problem Statement
Azure PostgreSQL Flexi Server provides PITR functionality to recover databases to a specific point in time, which is critical for addressing accidental data loss, corruption, or testing disaster recovery strategies. However, the PITR feature introduces several limitations and constraints:
Azure-Specific PITR Behavior:
- PITR always creates a new server, requiring a unique server name. In-place recovery of the original server is not currently supported.
- Restored servers inherit attributes such as pricing tier, compute generation, backup retention, and networking configurations from the original server.
- High Availability (HA) is not enabled during the restore process, even if the original server was configured with HA.
- Networking settings are immutable during restore — transitioning between private and public access modes is not supported.
These constraints are workable from an Azure perspective, but introduce challenges for Terraform users operating within a highly governed environment following an immutable, deployment stamp model. See below for more information:
The Deployment Stamp Model (Immutable Infrastructure):
Organizations following immutable infrastructure practices enforce strict controls on Terraform code. In a deployment stamp model:
- Terraform code remains consistent across environments (e.g., dev, test, prod) and is not modified between deployments (while technically one can use hotfixes to make updates this would still require creating branches, getting approvals, and eventually cleaning up after the fact).
- Ideally all changes, including DR tests, would be driven through input variables rather than updates to the Terraform configuration itself (since the TF code cannot be easily updated in-place).
- Manual state interventions are often restricted, particularly in regulated environments or when using tools like Terraform Enterprise.
Terraform State Management Challenges within the Immutable Model:
For the Azure PSQL Terraform resource (see here for details) changing the create mode triggers a re-create which is not desired behavior for a DR test or even a live DR scenario. Additionally, since Azure does not support in-place restores this would potentially break the deployment.
If a PITR was performed outside of Terraform directly in the Azure environment one would need to import this resource to manage it via Terraform after the fact. This extends even further to associated resources like firewall rules or databases, already provisioned for the restored server, which cause Terraform state conflicts unless explicitly imported.
Terraform’s state-driven architecture means manually reconciling restored servers or their dependencies, often requiring direct state modifications — a process incompatible with immutable infrastructure principles.
Solution
To address these challenges, I developed a Terraform-based solution that leverages variables and dynamic resource management to streamline PITR for Azure PostgreSQL Flexible Server. The solution adheres to the principles of immutable infrastructure and avoids direct state manipulation, making it suitable for highly regulated, governed infrastructure environments.
Design Overview
Before getting into the design itself, it’s helpful to see the base deployment which is a PSQL flexible server and associated resources such as firewall rules, server parameters, databases, and private networking configuration.
resource "azurerm_postgresql_flexible_server" "psql_flexible" {
name = var.psql_server_name
resource_group_name = azurerm_resource_group.psql_test_resource_group.name
location = azurerm_resource_group.psql_test_resource_group.location
delegated_subnet_id = azurerm_subnet.psql_subnet.id
private_dns_zone_id = azurerm_private_dns_zone.psql_private_dns.id
public_network_access_enabled = false # Disable public access
version = "16" # Latest stable version
zone = "2"
administrator_login = var.psql_admin_username
administrator_password = var.psql_admin_password
sku_name = "GP_Standard_D2ds_v4" # Smallest SKU
storage_mb = 32768 # 32 GB for lowest cost
storage_tier = "P4"
auto_grow_enabled = true # Enable auto-grow for storage
backup_retention_days = 7 # Min backup retention for testing
geo_redundant_backup_enabled = false # Disable for cost
high_availability {
mode = "ZoneRedundant" # Enable zone-redundant high availability
standby_availability_zone = 1
}
depends_on = [
azurerm_virtual_network.vnet, # Ensure VNet is created
azurerm_subnet.psql_subnet # Ensure Subnet is created
]
tags = {
environment = "development"
}
}
# Firewall Rule
resource "azurerm_postgresql_flexible_server_firewall_rule" "allow_current_ip" {
name = var.allow_current_ip_firewall_rule_name
server_id = azurerm_postgresql_flexible_server.psql_flexible.id
start_ip_address = "136.226.60.93"
end_ip_address = "136.226.60.93"
}
# Custom Server Parameter (log_retention_days)
resource "azurerm_postgresql_flexible_server_configuration" "custom_param" {
name = "logfiles.retention_days"
value = "6"
server_id = azurerm_postgresql_flexible_server.psql_flexible.id
}
# Additional Database
resource "azurerm_postgresql_flexible_server_database" "additional_db" {
name = var.test_db_name
server_id = azurerm_postgresql_flexible_server.psql_flexible.id
collation = "en_US.utf8"
charset = "utf8"
}
resource "azurerm_private_dns_zone" "psql_private_dns" {
name = "acitera-test.postgres.database.azure.com" # Custom readable DNS format
resource_group_name = azurerm_resource_group.psql_test_resource_group.name
depends_on = [
azurerm_virtual_network.vnet # Ensure VNet is created first
]
}
resource "azurerm_private_dns_zone_virtual_network_link" "psql_dns_vnet_link" {
name = "psql-vnet-link"
resource_group_name = azurerm_resource_group.psql_test_resource_group.name
private_dns_zone_name = azurerm_private_dns_zone.psql_private_dns.name
virtual_network_id = azurerm_virtual_network.vnet.id
depends_on = [
azurerm_virtual_network.vnet # Ensure VNet is created first
]
}
For the purposes of the design overview this is the server that is being backed up and requires a PITR to be performed. Given the aforementioned constraints, this requires a separate resource to be created to perform the restore; however, since we don’t always want it to exist the solution leverages dynamic and conditional resource creation as well as import blocks.
The solution is primarily centered around three TF variables that have been defined:
restore_mode
: Toggles between regular server management and performing a PITRpost_restore_mode
: Enables post-restore configurations such as HA and other inherited settings for the restored server.restore_timestamp
: Provides the specific recovery point e.g., “2024–12–04T22:42:42.8258553Z”
The first two variables act almost like feature flags to dynamically and conditionally control PITR related resource creation and resource imports. Furthermore, since HA cannot be enabled on the restored server during the restore itself this solution requires the operator to run ‘terraform apply’ twice. First to restore the server and second to apply HA and configure related resources. Another point to add is that the restored server is being defined via Terraform to carry forward the same configuration as the original server although the code can be modified to have it behave differently if desired. The code is shown below:
resource "azurerm_postgresql_flexible_server" "psql_restore" {
count = var.restore_mode ? 1 : 0
name = var.restored_psql_server_name
resource_group_name = azurerm_resource_group.psql_test_resource_group.name
location = azurerm_resource_group.psql_test_resource_group.location
# Setting the mode to restore from point in time
create_mode = var.post_restore_mode ? "Default" : "PointInTimeRestore"
# Reference the original server
source_server_id = var.post_restore_mode ? null : azurerm_postgresql_flexible_server.psql_flexible.id
# Get the current timestamp in UTC for the most recent restore point
point_in_time_restore_time_in_utc = var.restore_timestamp
# Existing subnet and DNS
delegated_subnet_id = azurerm_subnet.psql_subnet.id
private_dns_zone_id = azurerm_private_dns_zone.psql_private_dns.id
backup_retention_days = var.post_restore_mode ? azurerm_postgresql_flexible_server.psql_flexible.backup_retention_days : null
# PostgreSQL version and other important attributes carried over
version = var.post_restore_mode ? azurerm_postgresql_flexible_server.psql_flexible.version : null
administrator_login = var.post_restore_mode ? var.psql_admin_username : null
administrator_password = var.post_restore_mode ? var.psql_admin_password : null
# Storage and performance settings carried over
storage_mb = var.post_restore_mode ? azurerm_postgresql_flexible_server.psql_flexible.storage_mb : null
storage_tier = var.post_restore_mode ? azurerm_postgresql_flexible_server.psql_flexible.storage_tier : null
sku_name = var.post_restore_mode ? azurerm_postgresql_flexible_server.psql_flexible.sku_name : null
auto_grow_enabled = true # Enable auto-grow for storage
# Disable public network access as in the original
public_network_access_enabled = false
zone = "2"
# Include HA configuration only if enable_ha is true
dynamic "high_availability" {
for_each = var.post_restore_mode ? [1] : []
content {
mode = "ZoneRedundant" # Enable zone-redundant high availability
standby_availability_zone = 1
}
}
depends_on = [
azurerm_virtual_network.vnet, # Ensure VNet is created
azurerm_subnet.psql_subnet # Ensure Subnet is created
]
tags = {
environment = "development"
}
lifecycle {
ignore_changes = [create_mode, source_server_id, point_in_time_restore_time_in_utc]
}
}
import {
for_each = var.post_restore_mode ? [1] : []
to = azurerm_postgresql_flexible_server_firewall_rule.allow_current_ip_restored[0]
id = "/subscriptions/${data.azurerm_subscription.primary.subscription_id}/resourceGroups/${azurerm_resource_group.psql_test_resource_group.name}/providers/Microsoft.DBforPostgreSQL/flexibleServers/${var.restored_psql_server_name}/firewallRules/${var.allow_current_ip_firewall_rule_name}"
}
# Firewall Rule with Dynamic IP
resource "azurerm_postgresql_flexible_server_firewall_rule" "allow_current_ip_restored" {
count = var.post_restore_mode ? 1 : 0
name = var.allow_current_ip_firewall_rule_name
server_id = azurerm_postgresql_flexible_server.psql_restore[0].id
start_ip_address = "136.226.60.93"
end_ip_address = "136.226.60.93"
depends_on = [
azurerm_postgresql_flexible_server.psql_restore
]
}
import {
for_each = var.post_restore_mode ? [1] : []
to = azurerm_postgresql_flexible_server_configuration.custom_param_restored[0]
id = "/subscriptions/${data.azurerm_subscription.primary.subscription_id}/resourceGroups/${azurerm_resource_group.psql_test_resource_group.name}/providers/Microsoft.DBforPostgreSQL/flexibleServers/${var.restored_psql_server_name}/configurations/logfiles.retention_days"
}
# Custom Server Parameter (log_retention_days)
resource "azurerm_postgresql_flexible_server_configuration" "custom_param_restored" {
count = var.post_restore_mode ? 1 : 0
name = "logfiles.retention_days"
value = "6"
server_id = azurerm_postgresql_flexible_server.psql_restore[0].id
depends_on = [
azurerm_postgresql_flexible_server.psql_restore
]
}
import {
for_each = var.post_restore_mode ? [1] : []
to = azurerm_postgresql_flexible_server_database.additional_db_restored[0]
id = "/subscriptions/${data.azurerm_subscription.primary.subscription_id}/resourceGroups/${azurerm_resource_group.psql_test_resource_group.name}/providers/Microsoft.DBforPostgreSQL/flexibleServers/${var.restored_psql_server_name}/databases/${var.test_db_name}"
}
# Additional Database
resource "azurerm_postgresql_flexible_server_database" "additional_db_restored" {
count = var.post_restore_mode ? 1 : 0
name = var.test_db_name
server_id = azurerm_postgresql_flexible_server.psql_restore[0].id
collation = "en_US.utf8"
charset = "utf8"
depends_on = [
azurerm_postgresql_flexible_server.psql_restore
]
}
Key Terraform Code Elements
- Use of the
count
meta-argument to conditionally create resources. - The expression
var.restore_mode ? 1 : 0
is a ternary operation that evaluates the variablevar.restore_mode
. Ifvar.restore_mode
istrue
(or a non-zero value in the case of numbers), thecount
will be1
, creating the resource. Iffalse
(or0
), thecount
will be0
, and the resource will not be created. - Use of the
dynamic block
which in Terraform allows you to programmatically generate nested blocks within a resource. It is used when the inclusion of certain configuration options is conditional or depends on iterating over a collection (such as when needing to enable HA after the initial PSQL Flexi server is restored) - Use of the
import block
, introduced in Terraform 1.5. It is used for importing an external resource into Terraform's state directly in the configuration without needing to use the import command - Use of the
for_each
construct that allows Terraform to iterate over a collection (map or set) and create one resource instance for each element in the collection. In our solution,for_each
evaluates thevar.post_restore_mode
variable and generates a block if the condition evaluates to[1]
(non-empty list), or skips it when it evaluates to an empty list ([]
).
The above elements allow the setting of the two feature flag variables to control the PITR process and it’s state impact without having to modify the Terraform code or state directly.
Steps to Use the Solution
(1) First Terraform Apply — Initiate a PITR
• Set restore_mode
to true
• Set restore_timestamp
to desired value such as 2024-12-04T22:42:42.82585553Z
• Run terraform apply
to create the server from a restore point
(2) Second Terraform Apply — Configure the Restored Server
• Keep restore_mode
set to true
• Set post_restore_mode
to true
• Run terraform apply
to configure the restored server and import associated resources
At this point the original server and the restored server are both created, configured, and under Terraform management without requiring any direct TF code edits or updates other than modifying input variables. Assuming the DR test is complete or the data has been recovered from the restored server, one can “clean up” the environment by setting both variables restore_mode
to false
and the post_restore_mode
to false
to followed by a terraform apply
to remove the resource.
Closing Remarks
This solution demonstrates how Terraform can be leveraged to fully automate Azure PostgreSQL Flexible Server PITR while adhering to immutable infrastructure principles. By relying on variable-driven logic and dynamic resource management, the solution avoids the pitfalls of manual state modifications and enables a seamless DR testing process.
While alternatives such as manual state imports or ad-hoc Terraform code changes may be viable in some scenarios, this approach is particularly suited for organizations with strict regulatory or TF operational constraints. By integrating PITR into a deployment stamp model, it ensures consistency, compliance, and ease of management across environments.
Thank you for taking the time to read this deep dive. I’d love to hear your thoughts, feedback, or any challenges you’ve faced in similar scenarios — feel free to share your insights or ask questions in the comments or reach out directly. Let’s keep the conversation going!
The views expressed in this post are my own and do not necessarily reflect the official policies, positions, or views of the global EY organization or its member firms.