A Comprehensive Guide to MLOps Terraform: Infrastructure As Code(IaC)

GCP provisioning using Terraform

Senthil E
Better Programming

--

Photo by Andrea Leopardi on Unsplash

Introduction

Terraform is one of the tools used by a lot of tech companies for managing their infrastructure. It is becoming a popular tool in IaC. It is also cloud-agnostic and can work with any cloud provider. So you can work with AWS, GCP, Azure, OCI, etc. I started learning terraform to use in my personal projects. It is very simple and easy to learn. It eliminates all manual activities. Even for a small project, it will save a lot of time. As an MLOps person do you really need to know about IaC? I think it's better to learn. I will walk through the basics of Terraform and provision some GCP objects I used in my project.

In this article, let's see how we can use terraform to build GCP objects in an ML project.

Contents:

  • What is Infrastructure As Code(IaC)
  • Advantages of IaC.
  • Imperative Vs Declarative Approach.
  • Major Players in IaC.
  • Terraform
  • Provisioning GCP objects using Terraform

What is Infrastructure As Code (IaC)

According to Microsoft documentation

Infrastructure as Code (IaC) is the management of infrastructure (networks, virtual machines, load balancers, and connection topology) in a descriptive model, using the same versioning as DevOps team uses for source code. Like the principle that the same source code generates the same binary, an IaC model generates the same environment every time it is applied. IaC is a key DevOps practice and is used in conjunction with continuous delivery.

In simple terms, you provision the infrastructure by writing code instead of provisioning it manually.

Advantages of Using IaC

  • Speed
  • Cost reduction
  • Repeatability
  • Standardization
  • Scalability
  • Consistency
  • Avoid manual intervention-Automation
  • Reliability
  • Version management-Gitops
  • Security
  • Documentation
  • Faster disaster recovery
  • CICD Integration

Imperative vs. Declarative

There are 2 ways you can write the code.

  • Imperative Approach
  • Declarative Approach

Imperative Approach

  • The developer defines the exact steps to be carried out.
  • The system follows the instructions in the code.

Declarative Approach

  • The developer defines the required end state.
  • The platform or the system handles the steps needed to achieve the end state.

Check out this StackOverflow post and this article for detailed insights on the difference between Imperative and declarative approaches.

Terraform uses the declarative approach.

The major players in IaC are

  • Terraform
  • Ansible
  • AWS Cloud Formation
  • Google Cloud Deployment Manager
  • Azure Resource Manager
  • Puppet
  • SaltStack
  • Chef
Image credit Hashicorp Documentation

Terraform

According to Hashicorp documentation-

Terraform is an infrastructure as code (IaC) tool that allows you to build, change, and version infrastructure safely and efficiently. This includes low-level components such as compute instances, storage, and networking, as well as high-level components such as DNS entries, SaaS features, etc. Terraform can manage both existing service providers and custom in-house solutions.

Image credit Hashicorp Documentation
  • Terraform is a IaC tool.
  • Terraform has gained a lot of popularity recently and is used by many tech companies.
  • Terraform is written in Golang.
  • Terraform is open-source.
  • Terraform is cloud agnostic tool.
  • Terraform works with the APIs of the providers. Terraform can work with 1700 + providers as of now according to their documentation.
  • Terraform uses a declarative approach.
  • Terraform uses a language called Hashicorp Markup Language or HCL.
  • The developer creates the configuration file which is the required state.

For example, you want 3 kubernetes cluster and create the configuration file.

Then Terraform will create 3 kubernetes cluster and also creates a state file.

Now you want only 2 kubernetes cluster.Then change the config file saying I need only 2 GKE clusters.

Terraform checks the config file with the state file and deletes the 3rd kubernetes cluster.

Now you want 4 kubernetes cluster. Again make the change to the config file.Terraform checks the state file with the config file and create 2 more additional kubernetes cluster.

So you just provide the end state through the configuration file and Terraform figures out how to acheive the end state.This is the beauty of terraform. This is similar to kubernetes. You define the number of replicas like how many no of pods( no of applications) you want to run and then kubernetes handles on it own.

The other important part of Terraform is called provider. Terraform is cloud agnostic tool. You can create infrastructure in any cloud provider like AWS, GCP, Azure or Oracle Cloud, etc.

Image credit Hashicorp -Terraform Documenation

As of now terraform has 1781 providers.

  • The main important feature of terraform is you can create infrastructure in multiple cloud providers with one configuration file. For example, with cloud formation, you can create only in AWS whereas using terraform you can create in AWS, GCP, Azure, etc in the single config file.

To start learning terraform please refer to these documentations and tutorials:

Provisioning GCP using Terraform

Let's use Terraform to provision the following

  • Create a google cloud storage bucket.
  • Create a notebook instance for ML training.
  • Enable the required APIs.
  • Create service accounts for Vertex Training and Vertex Pipelines.
  • Create a GKE cluster to deploy an ML App.

Just watch the Hashicorp training videos or the youtube videos and then start doing the below steps. Otherwise, it will be a little difficult to understand. Again terraform is one of the easiest tools to learn and implement.

  1. Login into the GCP console.
  2. Create a project. It's better to create a new project ID. Since you can create all the resources in the project and then destroy the resources. So you won’t be billed for any resources you create.
  3. Open the cloud shell.

4. File Structure:

Image by the author

I have the following files:

The terraform file structure is above.

  • Code in the Terraform language is stored in plain text files with the .tf file extension. There is also a JSON-based variant of the language that is named with the .tf.json file extension.
  • A module is a collection of .tf and/or .tf.json files kept together in a directory. A Terraform module only consists of the top-level configuration files in a directory; nested directories are treated as completely separate modules, and are not automatically included in the configuration.
  • main.tf: contains locals, module, and resource definitions.
  • data.tf: contains data-resources for input data used in main.tf.
  • outputs.tf: contains outputs from the resources created in main.tf.
  • providers.tf: contains provider and provider’s versions definitions. Here we are using the provider information in the main.tf file.
  • variables.tf: contains declarations of variables (i.e. inputs or parameters) used in main.tf.
  • terraform.tfvars file contains the values assigned to the variables.
  • For example variables.tf contains

The tfvars file contains the value

We have separate tf files for storage bucket, notebook instance, GKE cluster creation, service account creation, and enabling API’s. Terraform loads all the configuration files within the directory specified in alphabetical order.

This StackOverflow question discusses why we can use different tf files for each configuration instead of using all the configurations in the main.tf file.

Some important syntax from Terraform documentation

  • Line comments start with #
  • Multi-line comments are wrapped with /* and */
  • Values are assigned with the syntax of key = value (whitespace doesn't matter). The value can be any primitive (string, number, boolean), a list, or a map.
  • Strings are in double-quotes.
  • Strings can interpolate other values using syntax wrapped in ${}, such as ${var.foo}.
  • Multiline strings can use shell-style “here doc” syntax, with the string starting with a marker-like <<EOF and then the string ending with EOF on a line of its own. The lines of the string and the end marker must not be indented.
  • Numbers are assumed to be base 10. If you prefix a number with, it is treated as a hexadecimal number.
  • Boolean values: true, false.
  • Lists of primitive types can be made with square brackets ([]). Example: ["foo", "bar", "baz"].
  • Maps can be made with braces ({}) and colons (:): { "foo": "bar", "bar": "baz" }. Quotes may be omitted on keys unless the key starts with a number, in which case quotes are required. Commas are required between key/value pairs for single line maps. A new line between key/value pairs is sufficient in multi-line maps.

For more please refer to Interpolation Syntax.

  • { } is used to define the block.

Let's dive into the files.

  • You can have one .tf file or many .tf files. I created one tf file for GKE, one tf file for storage cloud bucket, one tf file for service accounts, one tf file for note instance.

Main.tf:

This contains mainly the provider information.

  • The provider is defined in the provider block. It starts with the keyword “provider”. Here our provider is google-provider “google” { project = var.project_id }.
  • When you execute the terraform init then it downloads the google provider automatically.
  • You can refer to the full list of providers on the Hasicorp site. I have given the link above.
  • The terraform stores the provider in the project folder. It creates a special folder called terraform.
  • The format is terraform-provider-<NAME>_vX.Y.Z.
  • When you run the terraform init then terraform checks if the version is already available in the folder. If not then it downloads it.
  • We also required_providers block in the terraform block. In this block, we can restrict what version will work.

Variables.tf:

The above variables.tf just consist of the variables you used in your project.

  • Declare the variable using the variable command.
  • Then specify the name of the variable in double-quotes.
  • The description is optional.
  • No need to specify the type command. But it is recommended.
  • A type constraint allows you to be specific about the type of a variable. There are three simple types: string, number, bool.
  • Also, there are additional type constraints like list, set, tuple,map, object, and any.
  • The string should be declared with quotes.
  • The number can be declared with quotes or without quotes.
  • To use the value of a variable in the code, we use the syntax var.<variable_identifier>.For example, in our case it should be var.project_id , var.region, etc.
  • Variables can be set in different places
  1. As Environment variables — use export statement
  2. As Command-Line Flags — var
  3. In a File — tfvars file
  4. Variable Defaults — In variable.tf file

terraform.tvars:

  • terraform.tvars is a special file name used by terraform to identify the values of the variables created in the variables.tf file.
  • project_id = “mlops2022”. You can use the variable identifier then equal sign(=) and the value of the variable. In this case, the project_id is the variable and mlops2022 is the variable value.
  • You can also have file names ending with .auto.tfvars. In this case, if you want to create multiple files for variable value assignment then you can do it.

Also, there are some other concepts of which you need to be aware.

Data Sources:

  • Data sources allow terraform to use the information available outside of the current project.
Image by the author
  • A data source is defined by the data block.
  • After that specify the type of the object. In this case, google storage bucket.
  • Then the name of the data source.
  • Then the bucket name which exists already-image-store
  • Then the name of the object -folder/butterfly01.jpg.
  • The data source block is opened with a {
  • The data source block is closed with }.
  • You specify the properties of the object in between the brackets. Please refer to the documentation for the properties available for each object type.

Outputs:

  • Output is to show the data from the Terraform state file after the terraform activities are completed.
  • The output block starts with the keyword output. Then the name of the output in this case it is “ip”.
  • You then start the output block with {.The block ends with }.
  • Then you can use the value property. There is only one property called value. The value which is google_compute_instance.vm_instance.network_interface.0.network_ip will be displayed after the successful completion of the terraform apply.

Please refer to the documentation for more info on the outputs.

Locals:

  • A local is which you define once and can be used multiple times. A local may be a fixed string or a local can refer to another expression like 2 other locals.
  • A local block should start with a locals keyword.
  • You then start the output block with {.The block ends with }.
  • The first local is the service name and assign with ‘=’ a string with quotes “forum”.
  • The second local is the owner and assign with ‘=’ a string with quotes “Community Team”.
  • The local can be referenced in the code using local.local_identifier.
  • Local values are created by a locals block (plural), but you reference them as attributes on an object named local (singular). Make sure to leave off the "s" when referencing a local value!
  • In this case tags = local.common_tags ( It should be local and not locals)

Modules:

  • Modules are like more organizing the code.
  • You can use modules from external vendors. You can use the modules from AWS or Github, etc.
Image by the Author
  • You can reuse your code again as a module.
  • You can use the module from an external source like GitHub or from a local folder.
  • You can pass arguments to the modules.
  • You can use the Output to pass the value to the calling module.

There are a few other concepts like Workspaces and Provisioners which I haven’t covered here. Please refer to the documentation for more information.

5. The source repository is available below:

Now copy the files to GCP:

Image by the Author
  • Set the variable to the GitHub repository.
  • Set the local directory variable.
  • get fetches a remote package from a git subdirectory and writes it to a new local directory.
  • kpt pkg get REPO_URI[.git]/PKG_PATH[@VERSION] [LOCAL_DEST_DIRECTORY] [flags]

Now the files are copied to GCP:

6. Now we need to start creating the infrastructure. Before that let's see the workflow of terraform.

Please watch the YouTube videos and then check the .tf files in the above example. Then you will understand the provider, resources, output, variables, data sources, etc.

The main standard workflow is:

Image by the Author

This is my workflow.

Image by the Author

Terraform Init:

Terraform init is the first command you need to execute. This is mandatory. Terraform init which will download all the plugins associated with the provider. In our case it is GCP.

Image by the Author

Checkout terraform initdocs to know more about the command.

Terraform Validate:

The terraform validate command validates the configuration files in a directory, referring only to the configuration and not accessing any remote services such as remote state, provider APIs, etc.

Image by the Author

Checks like

  • Undeclared variables
  • providers declared multiple times
  • invalid module name
  • resources declared multiple times

Terraform Plan

According to hashicorp document the terraform plan the command creates an execution plan, which lets you preview the changes that Terraform plans to make to your infrastructure. By default, when Terraform creates a plan it:

  • Reads the current state of any already-existing remote objects to make sure that the Terraform state is up-to-date.
  • Compares the current configuration to the prior state and notes any differences.
  • Proposes a set of change actions that should, if applied, make the remote objects match the configuration.

You can save that plan to a file by using the -out parameter. You save it to a file for example called mlops by using the command terraform plan -out mlops.

Image by the Author

For terraform plan, apply and destroy it will ask for confirmation.

Image by the Author

It says Plan:39 objects will be added.

Terraform Apply:

The terraform apply the command executes the actions proposed in a Terraform plan.

There are 2 ways you can execute the terraform apply

  • Execute terraform apply without any arguments. In this case, it will first create the terraform plan first and then ask for confirmation. if you say yes then execute the plan.
  • Another way to use terraform apply is to pass it the filename of a saved plan file you created earlier, in which case Terraform will apply the changes in the plan without any confirmation prompt.
Image by the Author

Now it says terraform apply says 39 added. So in the plan, it said 39 will be added.

After the terraform apply completed then check the directory

Image by the Author

Now you see 2 new files created

  • terraform.tfstate file
  • kubeconfig-production

Terraform.tfstate file
The terraform state file is the file that is created by terraform. It stores all the resources created by terraform from the configuration files you provided like tf files.

Whenever a resource is removed, its corresponding entry under the state file is also removed. Terraform also stores the dependency of the objects it created and maintains the dependency relationship in the state file.

There is also a backup file of the previous state version in terraform.tfstate.backup.

Terraform statefile is used during the

  • terraform plan
  • terraform apply
  • terraform destroy

When you apply terraform apply then-new state file is created and then the old state file is written as the backup. You can maintain the terraform state file in your version control system like GitHub.

Check out the state commands and all the commands in the below cheat sheets:

Kubeconfig-production:

The kubeconfig-production is the kubeconfig for the newly created cluster.

Inspect the GKE cluster using the following command:

Image by the Author
Image by the Author

Now you can GKE is up and running with all the running pods.

So check all the objects you created

GCP Storage bucket:

Image by the Author

GCP Service accounts:

Image by the Author

GCP enabled APIs:

Image by the Author

GCP GKE Cluster:

Image by the Author

More about Terraform apply is available in the docs.

Terraform Destroy:

Terraform destroy allows destroying the resources created.

  • Terraform destroy without any arguments destroys all the resources created.
  • Terraform destroy with -target flag allows us to destroy the specific resource.
  • In our case, 39 resources are created and I want to delete all the 39 resources. hence I used terraform destroy.
Image by the Author

So all the 39 resources created are destroyed.

So the important steps are

  • Terraform Init
  • Terraform Apply (runs plan and ask for confirmation)
  • Terraform Destroy (if you want to destroy all the resources created).

Install terraform please refer to this link.

Additional Resources

1. If you plan to prepare for Terraform Associate Certificate, this YouTube Video is a great place to get started:

Video credit freecodecamp.org

2. Nice blog on Terraform Standards

3. A CLI tool to generate terraform files from existing infrastructure (reverse Terraform)

4. Terraform Variables Standard

5. Terraform examples

6. Terraform zero to hero

Conclusion

Again terraform is a very useful tool to learn and use. The terraform documentation itself is a very good place to start. Also, check out the YouTube videos I mentioned above. There are good courses available in Pluralsight, Udemy, and Coursera. Just check it out.

Try to automate the creation of resources in the cloud provider you work like AWS or GCP or Azure. It will be an interesting project. Good luck!

References

  1. https://github.com/GoogleCloudPlatform/mlops-with-vertex-ai
  2. https://github.com/hashicorp/terraform-provider-google
  3. https://github.com/terraform-google-modules
  4. https://learnk8s.io/terraform-gke
  5. https://www.terraform.io/docs
  6. https://learn.hashicorp.com/terraform
  7. https://docs.microsoft.com/en-us/azure/developer/terraform/
  8. https://cloud.google.com/docs/terraform
  9. https://github.com/upgundecha/howtheysre
  10. https://github.com/bridgecrewio/checkov
Want to Connect?Please feel free to connect with me on LinkedIn

--

--

ML/DS - Certified GCP Professional Machine Learning Engineer, Certified AWS Professional Machine learning Speciality,Certified GCP Professional Data Engineer .