Terraform Data Sources

Terraform Data Sources allow your configurations to fetch external information, making infrastructure management more dynamic and resilient. Declared with the data keyword, they retrieve details from cloud APIs, other Terraform states, local files, or HTTP endpoints, connecting your configuration to external information.

The "Read-Only" Principle & Key Benefits

Data sources are strictly read-only; they fetch information without modifying external objects, preventing errors if data is missing. Key benefits include:

  • Dynamic Configurations: Adapt to changing information (e.g., latest AMI IDs) without hardcoding.
  • Modularity: Modules become more self-sufficient by discovering environmental data.
  • External Data Integration: Standardized way to use data from various systems.
  • Error Prevention: Validate external data existence during terraform plan.

Data Sources vs. Managed Resources

Resource blocks define infrastructure Terraform manages (CRUD operations), while data blocks provide read-only information to configure those resources. An object should be managed by a resource or referenced by a data block, not both in the same configuration. The terraform_data resource is an exception, storing arbitrary values in the state, not for querying external info.

How Data Sources Function: Syntax & Lifecycle

data "<PROVIDER_NAME>_<DATA_SOURCE_TYPE>" "<LOCAL_NAME>" {
  # Configuration arguments (filters/identifiers)
  [argument_name = expression]
  ...
}
  • <PROVIDER_NAME>_<DATA_SOURCE_TYPE>: Specifies the data source type (e.g., aws_ami).
  • <LOCAL_NAME>: A local reference name (e.g., data.aws_ami.latest_ubuntu.id).
  • Configuration Block ({...}): Arguments for the provider to fetch specific data.

Providers use these arguments to query external systems. Data sources are typically evaluated during terraform plan refresh. Dependencies are inferred, but depends_on allows explicit ordering.

Practical Applications and Examples

1. Fetching Existing Infrastructure Details

Example: Get existing AWS VPC details

data "aws_vpc" "selected_vpc" {
  id = var.target_vpc_id // Assumes var.target_vpc_id is defined
}
resource "aws_subnet" "new_subnet" {
  vpc_id = data.aws_vpc.selected_vpc.id
  # ...
}

2. Dynamic Configuration Inputs

Example: Use the latest Amazon Linux 2 AMI

data "aws_ami" "latest_amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  filter { name = "name"; values = ["amzn2-ami-hvm-*-x86_64-gp2"] }
}
resource "aws_instance" "app_server" {
  ami = data.aws_ami.latest_amazon_linux.id
  # ...
}

3. Cross-Configuration Data Sharing

Use terraform_remote_state to access outputs from another state. Example: terraform_remote_state (HCP Terraform)

data "terraform_remote_state" "network" {
  backend = "remote"
  config = { organization = "my-org"; workspaces = { name = "prod-network" } }
}
# Use data.terraform_remote_state.network.outputs.some_output

4. Local-Only Data Sources

Example: http (fetch public IP)

data "http" "my_public_ip" { url = "https://api.ipify.org?format=json" }
# Use jsondecode(data.http.my_public_ip.response_body).ip

Example: local_file (read SSH key)

data "local_file" "ssh_key" { filename = "~/.ssh/id_ed25519.pub" }
# Use data.local_file.ssh_key.content

Quick Reference Table: Common Data Sources

Data Source Type

Provider(s)

Common Use Case

aws_ami

AWS

Find latest/specific AMI ID.

aws_vpc

AWS

Get existing VPC details.

terraform_remote_state

(Terraform Core)

Access outputs from another Terraform state.

http

hashicorp/http

Fetch data from an HTTP endpoint.

local_file

hashicorp/local

Read content from a local file.

Arguments, Attributes, and Filtering

Data sources use arguments for querying and export attributes with results. Meta-arguments like provider, depends_on, count, and for_each are common. lifecycle is mainly for precondition and postcondition. Provider-specific arguments act as filters (e.g., filter blocks in AWS, usually ANDed across blocks, ORed for multiple values in one filter).

Error Handling and Validation

  • ignore_errors: Rare, provider-specific, use cautiously as it can mask issues.
  • Custom Conditions (precondition/postcondition): Preferred method. precondition validates inputs before reading; postcondition validates returned data, halting with a custom error if a condition fails.

Example: postcondition for aws_ami

data "aws_ami" "validated_app_ami" {
  # ... arguments ...
  lifecycle {
    postcondition {
      condition     = self.tags["Validated"] == "true"
      error_message = "AMI must have 'Validated:true' tag."
    }
  }
}

Security Considerations

  • terraform_remote_state: Can expose sensitive data from the entire state file.
  • tfe_outputs (HCP Terraform/Enterprise): More secure, retrieves only defined outputs.
  • Best Practices: Avoid exposing sensitive data in outputs; secure underlying systems; manage credentials securely (env vars, IAM roles, Vault); encrypt state files.

Advanced Topics and Best Practices

  • Dynamic Blocks: Construct repeatable nested blocks (like filter) programmatically.
  • Performance: Each read is an API call, impacting plan times. Minimize lookups, use specific filters, and centralize common lookups. Data sources refresh on each plan.
  • Scaling in Enterprises: Managing data sources at scale involves ensuring consistency, security, performance, and governance. Platforms like Scalr can help by offering centralized variable/output management, RBAC, policy enforcement (OPA), and optimized execution environments, enhancing an organization's ability to use data sources safely and efficiently.

Conclusion and Recommendations

Terraform data sources are vital for dynamic IaC.

  1. Choose Appropriately: Understand data source types and filtering.
  2. Secure Sharing: Prefer tfe_outputs over terraform_remote_state in HCP Terraform/Enterprise.
  3. Validate Robustly: Use precondition and postcondition.
  4. Manage Performance: Minimize lookups and monitor API usage.
  5. Consult Documentation: Stay updated on provider specifics.
  6. Consider Management Platforms for Scale: Evaluate IaC platforms for enterprise-wide consistency, security, and governance.

Following these recommendations helps leverage data sources for sophisticated, secure, and maintainable infrastructure automation.