Better Programming

Advice for programmers.

Follow publication

Generate a Docker Compose File Using PyYAML

Marin Aglić
Better Programming
Published in
9 min readJan 29, 2023
Docker + Python + YAML

In this article, I’ll summarise my experience using the PyYAML package to generate a docker-compose file. I’ve been learning about setting up Spark clusters lately, which led me to check out PyYAML.

Lately, I’ve been working on learning how to set up Spark clusters on Docker. I’ve set up a standalone cluster and one on Yarn. However, the problem I faced was that some web UIs weren’t accessible, and some links, e.g., on the Spark Master UI, didn’t work. The solution I came up with was to generate a docker-compose file before starting up the containers. This was previously done with shell scripts.

I decided to switch to a Python script because of the following reasons:

  1. I know Python better than Shell
  2. It seems I can achieve what I want easier if I use Python
  3. I believe with Python, I can make the code easier to maintain

Therefore, enter PyYaml. You can find the instructions on how to install it here. The only issue I have with it is that I’m unsure how actively it’s maintained.

I will use PyYaml to generate a docker-compose file for a Spark standalone cluster with an arbitrary number of workers.

PyYaml

PyYaml is a Python package for reading, serialising, and emitting YAML content. I’m also using the click Python package to enable the user to pass in options to the script. Currently, only one option is supported — the number of workers.

@click.command()
@click.option(
"-w", "--spark-worker-count", default=1, help="Number of spark workers to include."
)

def generate_docker_compose(spark_worker_count, jupyterlab):
pass

if __name__ == "__main__":
generate_docker_compose()

We start the script by defining a click command and the options. I named the script: compose_generator.py. Once this is defined, you can run the command from the terminal python compose_generator.py --help to see the instructions on how to run the script:

Usage: compose_generator.py [OPTIONS]

Options:
-w, --spark-worker-count INTEGER
Number of spark workers to include.
--help Show this message and exit.

To pass in the argument (the number of workers) to the script, we can simply call it with: python compose_generator.py -w 3 — if we want three workers.

In my case, I provide some template YAML files for constructing the final docker-compose file. This makes it simpler to get the output I want, as I don’t need to provide all the values through Python. The two templates are located under the templates folder:

  • templates/worker.tmpl.yml
  • templates/docker-compose.tmpl.yml

Here is their content (first for the worker template):

image: spark-image
depends_on:
- spark-master
environment:
- SPARK_PUBLIC_DNS=localhost
env_file:
- .env.spark
volumes:
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events

And the other one:

version: '3.8'

services:
spark-master:
container_name: spark-master
build: .
image: spark-image
entrypoint: ['./entrypoint.sh', 'master']
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8080" ]
interval: 5s
timeout: 3s
retries: 3
volumes:
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
env_file:
- .env.spark
ports:
- '9090:8080'
- '7077:7077'


spark-history-server:
container_name: spark-history
image: spark-image
entrypoint: ['./entrypoint.sh', 'history']
depends_on:
- spark-master
env_file:
- .env.spark
volumes:
- spark-logs:/opt/spark/spark-events
ports:
- '18080:18080'

volumes:
spark-logs:

OK, so I wanted to generate the docker-compose file that conformed to these rules:

  1. I want the entrypoint and healthcheck.test values to be a flow-style sequence (that means I want them to have the brackets)
  2. the quotation marks need to be preserved
  3. the indentations need to be preserved (or at least nice)

Loading the template

To make the code more manageable, I defined a class called TemplateManager that will manage the content of the YAML templates.

Here is the constructor of the TemplateManager class:

class TemplateManager:
def __init__(self):
path = Path("templates/")
self.worker_tmpl = Path(f"{path.name}/worker.tmpl.yml")
self.dc_tmpl = Path(f"{path.name}/docker-compose.tmpl.yml")

Once the user starts the script and sets the number of workers they want, the generate_docker_compose method first creates a configuration object using a data class and creates an instance of the TemplateManager.

@dataclass
class Config:
spark_worker_count: int

The TeamplateManager instance defines a generate_docker_compose method that accepts an instance of the data class. The method currently looks like this:

def generate_docker_compose(self, config: Config):
spark_workers = self.prep_spark(config)

return self.finalize_docker_compose(spark_workers)

Assigning keys

The prep_spark method loads the spark worker template and assigns the missing keys. To load the files using PyYaml, we first read in the text using the Path object (you can also pass in a TextIOWrapper) and then load it using the package.

worker_template = self.worker_tmpl.read_text()
worker_template = yaml.load(worker_template, Loader=yaml.Loader)

Once loaded, the data is represented as a dictionary. This means we can access elements using the keys or add them using the dictionary update method.

For example, to set the entry point for the service, we can do the following:

worker_template["entrypoint"] = ["./entrypoint.sh", "worker", f"{port}"]

Once all the keys are set, we return the dictionary containing the Spark workers definition and pass it to the finalize_docker_compose method.

In the finalize_docker_compose method, we first load the docker-compose file template and update the dictionary:

 doc_comp_template = self.dc_tmpl.read_text()
doc_comp_yaml = yaml.load(doc_comp_template, Loader=yaml.Loader)

doc_comp_yaml["services"].update(workers)

pp(yaml.dump(doc_comp_yaml, sort_keys=False))

We can see that we update the services dictionary with the keys from the workers' dictionary.

We can then use yaml.dump to dump the contents in YAML format and print them using pprint (I use from pprint import pprint as pp ). The sort_keys=False parameter tells the dump method that we don’t want it to sort the keys. The output looks like this (truncated):

(truncated)
' spark-worker-1:\n'
' image: spark-image\n'
' depends_on:\n'
' - spark-master\n'
' environment:\n'
' - SPARK_PUBLIC_DNS=localhost\n'
' env_file:\n'
' - .env.spark\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' container_name: spark-worker-1\n'
' entrypoint:\n'
' - ./entrypoint.sh\n'
' - worker\n'
" - '8081'\n"
' ports:\n'
' - 8081:8081\n'
' spark-worker-2:\n'
' image: spark-image\n'
(truncated)

Defining custom representers

OK, we got our YAML, but it doesn’t conform to my style of writing docker-compose files. How to fix this? Remember the first point:

I want the entrypoint and healthcheck.test values to be a flow-style sequence (that means I want them to have the brackets).

This can be achieved by defining a custom class and a new representor function and reinitializing the values to the instances of the new class.

We define the following:

class keepblockseq(list):
pass


def keepblockseq_repr(dumper, data):
return dumper.represent_sequence("tag:yaml.org,2002:seq", data, flow_style=True)

We then need to register this function:

REPRESENTER_CONFIGS = [
{"class": keepblockseq, "repr": keepblockseq_repr},
]

def add_representers():
for c in REPRESENTER_CONFIGS:
yaml.add_representer(c["class"], c["repr"])

Call the add_representers function in the generate_docker_compose function. Any instance of the type keepblockseq will use the flow representation, i.e., we can define which lists should keep their brackets.

We define the configuration for specific keys. Here’s what the code looks like:

KEY_REPR_CONFIG = {
"spark-master": {
"entrypoint": lambda x: keepblockseq(x),
"healthcheck.test": lambda x: keepblockseq(x),
},
"spark-worker": {
"entrypoint": lambda x: keepblockseq(x),
},
}

I use the dot notation to access nested keys. Define a _re-initialize-key-types method. It is quite long, so I won’t provide the entire method yet. The gist of it is:

  1. Iterate over the services
  2. The service name and key in config don’t need to match (spark-worker). So, determine the service_key used in the config:
service_key = (
service_name
if service_name in KEY_REPR_CONFIG
else next((k for k in KEY_REPR_CONFIG if service_name.startswith(k)), None)
)

if service_key is None:
continue

key_map_config = KEY_REPR_CONFIG[service_key]

3. Iterate over the YAML keys defined in the config object (KEY_REPR_CONFIG) for that service. Split each key by .

4. Iterate over the list of keys from the previous point while skipping all dictionaries. If it is not a dictionary, apply the transformation function:

nested_setting = service

for key in keys:
yaml_element = nested_setting[key]

if isinstance(yaml_element, dict):
nested_setting = yaml_element
continue

nested_setting[key] = transformer(yaml_element)

If we rerun the script with these settings, this is the output (truncated):

(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
' entrypoint: [./entrypoint.sh, master]\n'
' healthcheck:\n'
" test: [CMD, curl, -f, 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
(truncated)

So, the first point is achieved. However, there are still two things to do. How can we preserve the quotations?

Define another class and a helper method for preserving quotes. Here’s what that looks like:

class keepquotes(str):
pass


def keepquotes_repr(dumper, data):
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="'")

We then define a helper function and redefine our representation config dictionaries:

REPRESENTER_CONFIGS = [
{"class": keepquotes, "repr": keepquotes_repr},
{"class": keepblockseq, "repr": keepblockseq_repr},
]

def keep_quotes_and_block_list(ls):
return keepblockseq([keepquotes(x) for x in ls])

KEY_REPR_CONFIG = {
"spark-master": {
"entrypoint": keep_quotes_and_block_list,
"healthcheck.test": keep_quotes_and_block_list,
},
"spark-worker": {
"entrypoint": keep_quotes_and_block_list,
},
}

If we try to generate the docker-compose after this change, we will get (truncated):

(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
" entrypoint: ['./entrypoint.sh', 'master']\n"
' healthcheck:\n'
" test: ['CMD', 'curl', '-f', 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' env_file:\n'
' - .env.spark\n'
' ports:\n'
' - 9090:8080\n'
' - 7077:7077\n'
(truncated)

We can see that the entrypoint and healthcheck.test lists retain their blocks, and each element inside retains its quotes.

Overriding the Dumper class

The final point is to have nice indentations. I’m not claiming this is the indentation we started with; it conforms to my style of writing docker-compose files.

The simplest solution I came by was to inherit from the yaml.Dumper class and redefined the increase_indent method:

class Dumper(yaml.Dumper):
def increase_indent(self, flow=False, *args, **kwargs):
return super().increase_indent(flow=flow, indentless=False)

Now, when dumping, we change the line to this (remember that pp is pprint):

pp(yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper))
(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
" entrypoint: ['./entrypoint.sh', 'master']\n"
' healthcheck:\n'
" test: ['CMD', 'curl', '-f', 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' env_file:\n'
' - .env.spark\n'
' ports:\n'
' - 9090:8080\n'
' - 7077:7077\n'
(truncated)

We can see that there is better indentation with the non-flow styled lists (block-styled lists).

The Entire Script

Here is the code for the entire script. I also added some code to insert some new lines at specific places to make the file easier to read:

from dataclasses import dataclass
from pathlib import Path
from pprint import pprint as pp
import re

import click

import yaml


class Dumper(yaml.Dumper):
def increase_indent(self, flow=False, *args, **kwargs):
return super().increase_indent(flow=flow, indentless=False)


class keepquotes(str):
pass


def keepquotes_repr(dumper, data):
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="'")


class keepblockseq(list):
pass


def keepblockseq_repr(dumper, data):
return dumper.represent_sequence("tag:yaml.org,2002:seq", data, flow_style=True)


REPRESENTER_CONFIGS = [
{"class": keepquotes, "repr": keepquotes_repr},
{"class": keepblockseq, "repr": keepblockseq_repr},
]


def keep_quotes_and_block_list(ls):
return keepblockseq([keepquotes(x) for x in ls])


KEY_REPR_CONFIG = {
"*": {"ports": lambda xs: [keepquotes(x) for x in xs]},
"spark-master": {
"entrypoint": keep_quotes_and_block_list,
"healthcheck.test": keep_quotes_and_block_list,
},
"spark-worker": {
"entrypoint": keep_quotes_and_block_list,
},
"spark-history-server": {
"entrypoint": keep_quotes_and_block_list,
},
}


@dataclass
class Config:
spark_worker_count: int


class TemplateManager:
def __init__(self):
path = Path("templates/")
self.worker_tmpl = Path(f"{path.name}/worker.tmpl.yml")
self.dc_tmpl = Path(f"{path.name}/docker-compose.tmpl.yml")

def prep_spark(self, config: Config):
num_workers = config.spark_worker_count
port = 8081

workers = {}

for i in range(num_workers):
worker_template = self.worker_tmpl.read_text()
worker_template = yaml.load(worker_template, Loader=yaml.Loader)

worker_template["container_name"] = f"spark-worker-{i + 1}"
worker_template["entrypoint"] = ["./entrypoint.sh", "worker", f"{port}"]
worker_template["ports"] = [f"{port}:{port}"]

workers[f"spark-worker-{i + 1}"] = worker_template

port = port + 1

return workers

def _reinitialize_key_types(self, doc_comp_yaml):
services = doc_comp_yaml["services"]

apply_to_all = KEY_REPR_CONFIG.get("*", {})

for yaml_key, config in KEY_REPR_CONFIG.items():
config.update(apply_to_all)

for service_name, service in services.items():
service_key = (
service_name
if service_name in KEY_REPR_CONFIG
else next((k for k in KEY_REPR_CONFIG if service_name.startswith(k)), None)
)

if service_key is None:
continue

key_map_config = KEY_REPR_CONFIG[service_key]

for yaml_key, transformer in key_map_config.items():
if yaml_key == "*":
continue

keys = yaml_key.split(".")

nested_setting = service

for key in keys:
yaml_element = nested_setting[key]

if isinstance(yaml_element, dict):
nested_setting = yaml_element
continue

nested_setting[key] = transformer(yaml_element)


def finalize_docker_compose(self, workers):
doc_comp_template = self.dc_tmpl.read_text()
doc_comp_yaml = yaml.load(doc_comp_template, Loader=yaml.Loader)

doc_comp_yaml["services"].update(workers)

self._reinitialize_key_types(doc_comp_yaml)

pp(yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper))


dump = yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper)

for part_name in list(doc_comp_yaml.keys())[1:]:
dump = re.sub(f"^{part_name}:$", f"\n{part_name}:", dump, flags=re.MULTILINE)

for service_name in list(doc_comp_yaml['services'].keys())[1:]:
dump = dump.replace(f"{service_name}:", f"\n {service_name}:")

return dump

def generate_docker_compose(self, config: Config):
spark_workers = self.prep_spark(config)

return self.finalize_docker_compose(spark_workers)


def add_representers():
for c in REPRESENTER_CONFIGS:
yaml.add_representer(c["class"], c["repr"])


def write_docker_compose(text: str, filename: str = "docker-compose.generated.yml"):
with open(filename, "w") as file:
file.write(text)


@click.command()
@click.option(
"-w", "--spark-worker-count", default=1, help="Number of spark workers to include."
)

def generate_docker_compose(spark_worker_count):
config = Config(spark_worker_count)
template_manager = TemplateManager()

add_representers()
docker_compose_text = template_manager.generate_docker_compose(config)

write_docker_compose(docker_compose_text, filename="test.yml")


if __name__ == "__main__":
generate_docker_compose()

Takeaways

I think the key takeaways from this article are the following:

  • You can use PyYaml to load, parse, and manipulate YAML data
  • Use load or safe_load to load the YAML data, and dump or safe_dump to convert the dictionaries back to YAML strings (I didn’t go into the safe variants of the methods)
  • You can define custom classes and representer functions to represent different types and adjust the generator to your writing style. Remember to register the representers
  • You can implement your own Dumpster class to control the indentation.

The entire code is in the article, including the template files required.

References

  1. https://cantera.org/tutorials/yaml/yaml-format.html
  2. https://pypi.org/project/PyYAML/
  3. https://stackoverflow.com/questions/14000893/specifying-styles-for-portions-of-a-pyyaml-dump/
  4. https://github.com/yaml/pyyaml/issues/234
Marin Aglić
Marin Aglić

Written by Marin Aglić

Working as a Software Engineer. Interested in Data Engineering. Mostly working with airflow, python, celery, bigquery. Ex PhD student.

Write a response