Generate a Docker Compose File Using PyYAML
Load, parse, and manipulate YAML data

In this article, I’ll summarise my experience using the PyYAML package to generate a docker-compose file. I’ve been learning about setting up Spark clusters lately, which led me to check out PyYAML.
Lately, I’ve been working on learning how to set up Spark clusters on Docker. I’ve set up a standalone cluster and one on Yarn. However, the problem I faced was that some web UIs weren’t accessible, and some links, e.g., on the Spark Master UI, didn’t work. The solution I came up with was to generate a docker-compose file before starting up the containers. This was previously done with shell scripts.
I decided to switch to a Python script because of the following reasons:
- I know Python better than Shell
- It seems I can achieve what I want easier if I use Python
- I believe with Python, I can make the code easier to maintain
Therefore, enter PyYaml. You can find the instructions on how to install it here. The only issue I have with it is that I’m unsure how actively it’s maintained.
I will use PyYaml to generate a docker-compose file for a Spark standalone cluster with an arbitrary number of workers.
PyYaml
PyYaml
is a Python package for reading, serialising, and emitting YAML content. I’m also using the click
Python package to enable the user to pass in options to the script. Currently, only one option is supported — the number of workers.
@click.command()
@click.option(
"-w", "--spark-worker-count", default=1, help="Number of spark workers to include."
)
def generate_docker_compose(spark_worker_count, jupyterlab):
pass
if __name__ == "__main__":
generate_docker_compose()
We start the script by defining a click command and the options. I named the script: compose_generator.py
. Once this is defined, you can run the command from the terminal python compose_generator.py --help
to see the instructions on how to run the script:
Usage: compose_generator.py [OPTIONS]
Options:
-w, --spark-worker-count INTEGER
Number of spark workers to include.
--help Show this message and exit.
To pass in the argument (the number of workers) to the script, we can simply call it with: python compose_generator.py -w 3
— if we want three workers.
In my case, I provide some template YAML files for constructing the final docker-compose file. This makes it simpler to get the output I want, as I don’t need to provide all the values through Python. The two templates are located under the templates folder:
templates/worker.tmpl.yml
templates/docker-compose.tmpl.yml
Here is their content (first for the worker template):
image: spark-image
depends_on:
- spark-master
environment:
- SPARK_PUBLIC_DNS=localhost
env_file:
- .env.spark
volumes:
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
And the other one:
version: '3.8'
services:
spark-master:
container_name: spark-master
build: .
image: spark-image
entrypoint: ['./entrypoint.sh', 'master']
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8080" ]
interval: 5s
timeout: 3s
retries: 3
volumes:
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
env_file:
- .env.spark
ports:
- '9090:8080'
- '7077:7077'
spark-history-server:
container_name: spark-history
image: spark-image
entrypoint: ['./entrypoint.sh', 'history']
depends_on:
- spark-master
env_file:
- .env.spark
volumes:
- spark-logs:/opt/spark/spark-events
ports:
- '18080:18080'
volumes:
spark-logs:
OK, so I wanted to generate the docker-compose file that conformed to these rules:
- I want the
entrypoint
andhealthcheck.test
values to be a flow-style sequence (that means I want them to have the brackets) - the quotation marks need to be preserved
- the indentations need to be preserved (or at least nice)
Loading the template
To make the code more manageable, I defined a class called TemplateManager
that will manage the content of the YAML templates.
Here is the constructor of the TemplateManager
class:
class TemplateManager:
def __init__(self):
path = Path("templates/")
self.worker_tmpl = Path(f"{path.name}/worker.tmpl.yml")
self.dc_tmpl = Path(f"{path.name}/docker-compose.tmpl.yml")
Once the user starts the script and sets the number of workers they want, the generate_docker_compose
method first creates a configuration object using a data class and creates an instance of the TemplateManager
.
@dataclass
class Config:
spark_worker_count: int
The TeamplateManager
instance defines a generate_docker_compose
method that accepts an instance of the data class. The method currently looks like this:
def generate_docker_compose(self, config: Config):
spark_workers = self.prep_spark(config)
return self.finalize_docker_compose(spark_workers)
Assigning keys
The prep_spark
method loads the spark worker template and assigns the missing keys. To load the files using PyYaml, we first read in the text using the Path
object (you can also pass in a TextIOWrapper
) and then load it using the package.
worker_template = self.worker_tmpl.read_text()
worker_template = yaml.load(worker_template, Loader=yaml.Loader)
Once loaded, the data is represented as a dictionary. This means we can access elements using the keys or add them using the dictionary update method.
For example, to set the entry point for the service, we can do the following:
worker_template["entrypoint"] = ["./entrypoint.sh", "worker", f"{port}"]
Once all the keys are set, we return the dictionary containing the Spark workers definition and pass it to the finalize_docker_compose
method.
In the finalize_docker_compose
method, we first load the docker-compose file template and update the dictionary:
doc_comp_template = self.dc_tmpl.read_text()
doc_comp_yaml = yaml.load(doc_comp_template, Loader=yaml.Loader)
doc_comp_yaml["services"].update(workers)
pp(yaml.dump(doc_comp_yaml, sort_keys=False))
We can see that we update the services dictionary with the keys from the workers' dictionary.
We can then use yaml.dump
to dump the contents in YAML format and print them using pprint
(I use from pprint import pprint as pp
). The sort_keys=False
parameter tells the dump
method that we don’t want it to sort the keys. The output looks like this (truncated):
(truncated)
' spark-worker-1:\n'
' image: spark-image\n'
' depends_on:\n'
' - spark-master\n'
' environment:\n'
' - SPARK_PUBLIC_DNS=localhost\n'
' env_file:\n'
' - .env.spark\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' container_name: spark-worker-1\n'
' entrypoint:\n'
' - ./entrypoint.sh\n'
' - worker\n'
" - '8081'\n"
' ports:\n'
' - 8081:8081\n'
' spark-worker-2:\n'
' image: spark-image\n'
(truncated)
Defining custom representers
OK, we got our YAML, but it doesn’t conform to my style of writing docker-compose files. How to fix this? Remember the first point:
I want the
entrypoint
andhealthcheck.test
values to be a flow-style sequence (that means I want them to have the brackets).
This can be achieved by defining a custom class and a new representor function and reinitializing the values to the instances of the new class.
We define the following:
class keepblockseq(list):
pass
def keepblockseq_repr(dumper, data):
return dumper.represent_sequence("tag:yaml.org,2002:seq", data, flow_style=True)
We then need to register this function:
REPRESENTER_CONFIGS = [
{"class": keepblockseq, "repr": keepblockseq_repr},
]
def add_representers():
for c in REPRESENTER_CONFIGS:
yaml.add_representer(c["class"], c["repr"])
Call the add_representers
function in the generate_docker_compose
function. Any instance of the type keepblockseq
will use the flow representation, i.e., we can define which lists should keep their brackets.
We define the configuration for specific keys. Here’s what the code looks like:
KEY_REPR_CONFIG = {
"spark-master": {
"entrypoint": lambda x: keepblockseq(x),
"healthcheck.test": lambda x: keepblockseq(x),
},
"spark-worker": {
"entrypoint": lambda x: keepblockseq(x),
},
}
I use the dot notation to access nested keys. Define a _re-initialize-key-types
method. It is quite long, so I won’t provide the entire method yet. The gist of it is:
- Iterate over the services
- The service name and key in config don’t need to match (
spark-worker
). So, determine theservice_key
used in the config:
service_key = (
service_name
if service_name in KEY_REPR_CONFIG
else next((k for k in KEY_REPR_CONFIG if service_name.startswith(k)), None)
)
if service_key is None:
continue
key_map_config = KEY_REPR_CONFIG[service_key]
3. Iterate over the YAML keys defined in the config object (KEY_REPR_CONFIG
) for that service. Split each key by .
4. Iterate over the list of keys from the previous point while skipping all dictionaries. If it is not a dictionary, apply the transformation function:
nested_setting = service
for key in keys:
yaml_element = nested_setting[key]
if isinstance(yaml_element, dict):
nested_setting = yaml_element
continue
nested_setting[key] = transformer(yaml_element)
If we rerun the script with these settings, this is the output (truncated):
(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
' entrypoint: [./entrypoint.sh, master]\n'
' healthcheck:\n'
" test: [CMD, curl, -f, 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
(truncated)
So, the first point is achieved. However, there are still two things to do. How can we preserve the quotations?
Define another class and a helper method for preserving quotes. Here’s what that looks like:
class keepquotes(str):
pass
def keepquotes_repr(dumper, data):
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="'")
We then define a helper function and redefine our representation config dictionaries:
REPRESENTER_CONFIGS = [
{"class": keepquotes, "repr": keepquotes_repr},
{"class": keepblockseq, "repr": keepblockseq_repr},
]
def keep_quotes_and_block_list(ls):
return keepblockseq([keepquotes(x) for x in ls])
KEY_REPR_CONFIG = {
"spark-master": {
"entrypoint": keep_quotes_and_block_list,
"healthcheck.test": keep_quotes_and_block_list,
},
"spark-worker": {
"entrypoint": keep_quotes_and_block_list,
},
}
If we try to generate the docker-compose after this change, we will get (truncated):
(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
" entrypoint: ['./entrypoint.sh', 'master']\n"
' healthcheck:\n'
" test: ['CMD', 'curl', '-f', 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' env_file:\n'
' - .env.spark\n'
' ports:\n'
' - 9090:8080\n'
' - 7077:7077\n'
(truncated)
We can see that the entrypoint
and healthcheck.test
lists retain their blocks, and each element inside retains its quotes.
Overriding the Dumper class
The final point is to have nice indentations. I’m not claiming this is the indentation we started with; it conforms to my style of writing docker-compose files.
The simplest solution I came by was to inherit from the yaml.Dumper
class and redefined the increase_indent
method:
class Dumper(yaml.Dumper):
def increase_indent(self, flow=False, *args, **kwargs):
return super().increase_indent(flow=flow, indentless=False)
Now, when dumping, we change the line to this (remember that pp
is pprint
):
pp(yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper))
(truncated)
' spark-master:\n'
' container_name: spark-master\n'
' build: .\n'
' image: spark-image\n'
" entrypoint: ['./entrypoint.sh', 'master']\n"
' healthcheck:\n'
" test: ['CMD', 'curl', '-f', 'http://localhost:8080']\n"
' interval: 5s\n'
' timeout: 3s\n'
' retries: 3\n'
' volumes:\n'
' - ./data:/opt/spark/data\n'
' - ./spark_apps:/opt/spark/apps\n'
' - spark-logs:/opt/spark/spark-events\n'
' env_file:\n'
' - .env.spark\n'
' ports:\n'
' - 9090:8080\n'
' - 7077:7077\n'
(truncated)
We can see that there is better indentation with the non-flow styled lists (block-styled lists).
The Entire Script
Here is the code for the entire script. I also added some code to insert some new lines at specific places to make the file easier to read:
from dataclasses import dataclass
from pathlib import Path
from pprint import pprint as pp
import re
import click
import yaml
class Dumper(yaml.Dumper):
def increase_indent(self, flow=False, *args, **kwargs):
return super().increase_indent(flow=flow, indentless=False)
class keepquotes(str):
pass
def keepquotes_repr(dumper, data):
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="'")
class keepblockseq(list):
pass
def keepblockseq_repr(dumper, data):
return dumper.represent_sequence("tag:yaml.org,2002:seq", data, flow_style=True)
REPRESENTER_CONFIGS = [
{"class": keepquotes, "repr": keepquotes_repr},
{"class": keepblockseq, "repr": keepblockseq_repr},
]
def keep_quotes_and_block_list(ls):
return keepblockseq([keepquotes(x) for x in ls])
KEY_REPR_CONFIG = {
"*": {"ports": lambda xs: [keepquotes(x) for x in xs]},
"spark-master": {
"entrypoint": keep_quotes_and_block_list,
"healthcheck.test": keep_quotes_and_block_list,
},
"spark-worker": {
"entrypoint": keep_quotes_and_block_list,
},
"spark-history-server": {
"entrypoint": keep_quotes_and_block_list,
},
}
@dataclass
class Config:
spark_worker_count: int
class TemplateManager:
def __init__(self):
path = Path("templates/")
self.worker_tmpl = Path(f"{path.name}/worker.tmpl.yml")
self.dc_tmpl = Path(f"{path.name}/docker-compose.tmpl.yml")
def prep_spark(self, config: Config):
num_workers = config.spark_worker_count
port = 8081
workers = {}
for i in range(num_workers):
worker_template = self.worker_tmpl.read_text()
worker_template = yaml.load(worker_template, Loader=yaml.Loader)
worker_template["container_name"] = f"spark-worker-{i + 1}"
worker_template["entrypoint"] = ["./entrypoint.sh", "worker", f"{port}"]
worker_template["ports"] = [f"{port}:{port}"]
workers[f"spark-worker-{i + 1}"] = worker_template
port = port + 1
return workers
def _reinitialize_key_types(self, doc_comp_yaml):
services = doc_comp_yaml["services"]
apply_to_all = KEY_REPR_CONFIG.get("*", {})
for yaml_key, config in KEY_REPR_CONFIG.items():
config.update(apply_to_all)
for service_name, service in services.items():
service_key = (
service_name
if service_name in KEY_REPR_CONFIG
else next((k for k in KEY_REPR_CONFIG if service_name.startswith(k)), None)
)
if service_key is None:
continue
key_map_config = KEY_REPR_CONFIG[service_key]
for yaml_key, transformer in key_map_config.items():
if yaml_key == "*":
continue
keys = yaml_key.split(".")
nested_setting = service
for key in keys:
yaml_element = nested_setting[key]
if isinstance(yaml_element, dict):
nested_setting = yaml_element
continue
nested_setting[key] = transformer(yaml_element)
def finalize_docker_compose(self, workers):
doc_comp_template = self.dc_tmpl.read_text()
doc_comp_yaml = yaml.load(doc_comp_template, Loader=yaml.Loader)
doc_comp_yaml["services"].update(workers)
self._reinitialize_key_types(doc_comp_yaml)
pp(yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper))
dump = yaml.dump(doc_comp_yaml, sort_keys=False, Dumper=Dumper)
for part_name in list(doc_comp_yaml.keys())[1:]:
dump = re.sub(f"^{part_name}:$", f"\n{part_name}:", dump, flags=re.MULTILINE)
for service_name in list(doc_comp_yaml['services'].keys())[1:]:
dump = dump.replace(f"{service_name}:", f"\n {service_name}:")
return dump
def generate_docker_compose(self, config: Config):
spark_workers = self.prep_spark(config)
return self.finalize_docker_compose(spark_workers)
def add_representers():
for c in REPRESENTER_CONFIGS:
yaml.add_representer(c["class"], c["repr"])
def write_docker_compose(text: str, filename: str = "docker-compose.generated.yml"):
with open(filename, "w") as file:
file.write(text)
@click.command()
@click.option(
"-w", "--spark-worker-count", default=1, help="Number of spark workers to include."
)
def generate_docker_compose(spark_worker_count):
config = Config(spark_worker_count)
template_manager = TemplateManager()
add_representers()
docker_compose_text = template_manager.generate_docker_compose(config)
write_docker_compose(docker_compose_text, filename="test.yml")
if __name__ == "__main__":
generate_docker_compose()
Takeaways
I think the key takeaways from this article are the following:
- You can use PyYaml to load, parse, and manipulate YAML data
- Use
load
orsafe_load
to load the YAML data, anddump
orsafe_dump
to convert the dictionaries back to YAML strings (I didn’t go into the safe variants of the methods) - You can define custom classes and representer functions to represent different types and adjust the generator to your writing style. Remember to register the representers
- You can implement your own
Dumpster
class to control the indentation.
The entire code is in the article, including the template files required.