From Terraform to Production: Our Infrastructure-as-Code Journey

Managing 200+ resources across Morocco, Senegal, and Cote d'Ivoire required more than Terraform modules. We built a custom provider, a drift detection system, and a deployment pipeline that validates before it applies.

Intermediate Level

HarchOS data center facility at night with automated infrastructure management

Harch Intelligence's infrastructure spans 200+ resources across three countries: Morocco (Dakhla, Tangier, Casablanca), Senegal (Dakar), and Cote d'Ivoire (Abidjan). These resources include GPU clusters, network fabric, storage arrays, cooling systems, power distribution units, and monitoring infrastructure — each with a configuration surface that is more complex than a typical cloud VM. Managing this infrastructure manually was never an option. We adopted Terraform as our infrastructure-as-code platform in early 2025, and within four months we had hit the limits of what off-the-shelf Terraform could do for our use case. This article describes those limits, the custom tooling we built to overcome them, and the deployment pipeline that ensures every infrastructure change is validated before it reaches production.

The first limitation was provider support. Terraform has excellent providers for AWS, GCP, and Azure, but no provider exists for our GPU cluster management API, our custom cooling control system, or our sovereign data residency enforcement engine. We needed to manage these resources through Terraform, which meant writing a custom provider. The HarchOS Terraform Provider, written in Go, exposes 23 resource types and 12 data sources that map to HarchOS API endpoints. Resource types include harchos_gpu_cluster (manages a GPU cluster's configuration, including topology, cooling, and jurisdiction tags), harchos_pipeline (manages a SENSE-THINK-ACT pipeline definition), harchos_sovereignty_policy (manages data residency rules), and harchos_monitoring_alert (manages SENTINEL alerting rules). The provider implements full CRUD operations with plan-time validation — for example, a sovereignty policy that references a non-existent jurisdiction is caught during terraform plan, not during terraform apply. The provider is open source under the Apache 2.0 license and available on the Terraform Registry.

The second limitation was drift detection. Terraform's state file represents the desired state of infrastructure, but the actual state can diverge due to manual changes, hardware failures, or API inconsistencies. In a conventional cloud environment, drift is an inconvenience. In our environment, drift can violate data sovereignty — a GPU that is physically moved from a Moroccan hub to a Senegalese hub without a corresponding Terraform state update would cause the sovereignty policy to route data to the wrong jurisdiction. We built a drift detection system that runs every 15 minutes and compares the actual state of every resource (queried from the HarchOS API) against the Terraform state file. If any attribute differs, the system generates a drift report that includes the resource identifier, the attribute that changed, the expected value, and the actual value. Critical drifts — those affecting jurisdiction tags, GPU topology, or security policies — trigger an automatic alert to the on-call engineer and a suggestion to run terraform plan to visualize the remediation. Non-critical drifts (for example, a cooling setpoint that was adjusted manually) are logged for review during the next maintenance window. The drift detection system has caught 47 drifts in 9 months, of which 6 were jurisdiction-critical and would have resulted in data being routed to the wrong hub if left uncorrected.

The third limitation was deployment safety. A terraform apply that modifies GPU cluster configuration can disrupt running workloads if the change is applied naively — for example, reducing a cluster's GPU count while training jobs are running will cause those jobs to fail. We built a deployment pipeline that wraps terraform apply with three safety checks. The first check is a workload impact analysis: before applying any change that affects GPU clusters, the pipeline queries the scheduler for running workloads on the affected resources and estimates the disruption. If the disruption exceeds a configurable threshold (default: 5% of running workloads), the pipeline requires manual approval before proceeding. The second check is a sovereignty compliance validation: any change that modifies jurisdiction tags or sovereignty policies is validated against a rules engine that ensures no combination of changes would result in data being routed to a non-compliant jurisdiction. This is a static analysis check — it runs before any changes are applied, using the planned state from terraform plan. The third check is a canary deployment: infrastructure changes are applied to a single hub first, and the pipeline monitors the hub's health for 30 minutes before applying the same change to other hubs. If the canary hub shows degraded performance — increased error rates, higher latency, or scheduling failures — the change is automatically rolled back and the engineering team is alerted.

The pipeline is implemented as a GitHub Actions workflow with custom actions for each safety check. The workflow is triggered by a pull request to the infrastructure repository, which contains all Terraform code organized by environment (production, staging, development) and region (morocco, senegal, cote-divoire). A pull request that modifies production code requires approval from two infrastructure engineers and passes through all three safety checks before the merge is allowed. After merge, the deployment pipeline runs automatically, applying changes in the order defined by the dependency graph (network changes before compute changes, compute changes before pipeline changes, pipeline changes before policy changes). The average time from pull request to production deployment is 4.5 hours, of which approximately 2 hours is the canary monitoring period. This is slower than "terraform apply and hope," but the safety margin has prevented 12 production incidents in 9 months — incidents that would have affected running workloads or, worse, violated sovereignty constraints.

Our infrastructure-as-code journey is not complete. Three items are on the roadmap. First, we are migrating from Terraform's HCL to OpenTofu to avoid HashiCorp's license change and maintain an open-source toolchain. Second, we are building a custom policy-as-code engine using Open Policy Agent (OPA) that will replace the ad-hoc sovereignty compliance checks with a formal policy language. Third, we are implementing GitOps-style continuous reconciliation, where the drift detection system automatically generates pull requests to correct detected drifts, rather than relying on manual intervention. Each of these improvements addresses a limitation we encountered in production, not a theoretical concern. The best infrastructure-as-code practices are the ones you discover by operating at scale — and we are documenting every lesson along the way.

Continue Reading

InfrastructureAdvancedMarch 8, 202624 min read

DevOpsIntermediateNovember 7, 202512 min readHarch Intelligence DevOps Team

From Terraform to Production: Our Infrastructure-as-Code Journey

Intermediate Level

Continue Reading

InfrastructureAdvancedMarch 8, 202624 min read

From Terraform to Production: Our Infrastructure-as-Code Journey

Continue Reading

Inside HarchOS: How We Built a Distributed AI Operating System from Scratch

Designing the SENSE Layer: Real-Time Data Ingestion at 10M Events/Second

Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs

From Terraform to Production: Our Infrastructure-as-Code Journey

Continue Reading

Inside HarchOS: How We Built a Distributed AI Operating System from Scratch

Designing the SENSE Layer: Real-Time Data Ingestion at 10M Events/Second

Our GPU Scheduling Algorithm: Balancing Throughput and Fairness Across 1,798 GPUs