Let Me In: Designing Host Authentication

Layering multiple architectures for an ideal ecosystem

Elliot Graebert
Better Programming

--

By asadykov

While trying to write a post about authentication, I waffled around various ways to tackle the subject. I started by trying to write an enormous post that covered all the aspects of authentication but barely finished the outline before I threw that idea away. Instead, I’ve chosen to focus on one of the most critical and open areas for authentication: hosts!

From a security perspective, if an adversary can crack your host authentication system, that’s game over. However, if you make your host authentication so secure that the DevOps team cannot work effectively, you will negatively affect your recovery time during severe outages.

For this reason, you need to design an architecture that supports the key qualities of authentication: security, traceability, reliability, and usability.

In this post, I cover the generic architectures that all host authentication solutions fall into.

  • The hard-coded identities model is when an identity is put directly on the host itself
  • The remote directory model is when an identity is pulled or pushed from a centralized domain.
  • The remote execution model is when users authenticate to an application that executes the desired command on the user’s behalf.

The number of possible authentication solutions is massive. However, by keeping simple categories like the above in your mind, it becomes significantly easier to categorize and investigate the market of authentication solutions.

Interested in other DevOps topics like backups, vulnerabilities, or logging? Then check out the broader series:

Key Features in Host Authentication

The key features of an authentication system strongly depend on if you are on the security team or the DevOps team. Both teams agree that the following five properties are essential:

  • Security: In the context of host authentication, security refers to the ability of an adversary to bypass the authentication requirement, establish an identity as an existing user, or escalate beyond the designed role of a user.
  • Traceability: The authentication system must have a clear trail from user to action (and vice versa). Security teams use this information in both detection and responses. This requirement diminishes the value of tools like Hashicorp’s Vault SSH Helper that do not support an easy path from command to user.
  • Reliability: The more complicated the authentication path, the more the failure modes will compound. Authentication failures are notoriously difficult to recover from as they usually require an authenticated session to do advanced debugging and system recovery. Authentication systems also must be reliable at scale.
  • Usability: A key selection criterion when implementing host authentication is ensuring that the workflows required can be executed quickly (this is critical in emergencies). Thinking through the exact authentication workflow is especially important when it comes to multi-host command execution.
  • Maintainability: Finally, maintainability is a major concern when designing your authentication system. Security teams tend to push for large, complex solutions that are difficult to maintain but have excellent security controls, whereas DevOps teams tend to push for simple, reliable, and usable tools that don’t offer the features the security team needs.

Balancing these features will often feel like a zero-sum game: as you increase security, you’ll decrease usability. You can increase reliability by adding a fallback authentication system but that will decrease maintainability and security. This is the nature of the game with authentication. Your goal is to strike an ideal balance between these five attributes.

Host Authentication Architectures

Local domain model

This model is awful. The only positive thing I can say about the local domain model is that it is the simplest. In this model, users and their identification method are directly coded on each host without having an external system of record. Unfortunately, this means manual intervention is needed for each new user that needs access or each host added. The only reason to call out this design is to illustrate the base case that we are trying to avoid.

Remote domain with runtime validation

The remote domain with runtime validation is a standard (but brittle) model. In this model, a domain administrator generates the new user in the domain, and the hosts reach out directly to the domain to verify the user’s identity. The biggest downside to this model is the increased complexity of the login workflow due to the runtime dependency between the host and the domain.

If the domain is down, the engineers will be unable to log into the hosts. This becomes even worse if fixing the domain is dependent on the engineer connecting to a host.

Example technologies that implement this model:

  • Simple LDAP Integration: Hosts join the domain and verify the engineer’s identity with an LDAP lookup. Common domains include Active Directory, FreeIPA, or Centrify.
  • Vault SSH Secret Engine: Engineers login to a Vault cluster and generate an OTP for a specific host. Hosts are configured with a remote agent that validates the OTP with the Vault cluster.
  • Vault Signed SSH Certificates: Hosts are configured to trust a signing key (Vault), and engineers authenticate with Vault to sign their local key. The best practice is to use short TTLs, meaning that there is essentially a runtime dependency on Vault.

Remote domain with local caching

The remote domain with local caching eliminates the runtime dependency on the domain. This model is excellent because it combines the simplicity of the local domain model with the advantages of an external domain. The key aspect of this design is the hosts’ resilience to the domain being down due to their local cache of the identities.

Example technologies that implement this model include the following:

  • Puppet/Ansible laying out SSH keys: In this case, the “domain” is the Puppet or Ansible configuration, and the “cache” is the authorized key file.
  • SSSD Credential Caching: Using SSSD, one can cache the credentials locally and be resilient against the domain going down. Common domains include Active Directory, FreeIPA, or Centrify.

Remote execution model

The remote execution model is becoming the preferred model in many organizations. The key aspect of the design is that users no longer get direct access to the underlying host, and instead execute their commands via a process controlled by the remote management application.

If done correctly, this model can be an improvement in both productivity and security over the direct execution model described in the previous models. Productivity is improved when the most common commands are made even easier using automation provided by the remote management application.

Remote management applications usually support more granular user access, better visibility, and restriction in available user commands which has the potential to improve security.

The disadvantage of this model is the significantly increased complexity of the user workflow. In this model, there is a runtime dependency on the domain, remote management application, and often an agent running on the host. If any of these systems are failing, access will be broken.

Example technologies that implement this model include the following:

  • Ansible Tower: In the Ansible model, the engineer tells Ansible Tower (the application) to execute Ansible (a script) on the remote host.
  • Teleport: In the Teleport model, a Teleport agent lives on the remote host, which has a trusting relationship with the Teleport server. After an engineer has established trust with the Teleport server, they can issue commands to be executed by the remote agent.
  • AWS Systems Manager/Azure VM Agent: This model is similar to the Teleport service, but it is cheaper and has fewer features.

Putting It All Together

A very common architecture involves layering multiple models together in a single architecture. In a well-managed environment, this is done to ensure redundancy in case of both infrastructure and configuration failures. Below is a common example:

  • As a primary path, the DevOps team leverages a remote execution model to grant limited access to the application development teams. The goal is to enable developers to debug their applications without the security risks introduced by granting them full access to the underlying hosts.
  • As a secondary path, the DevOps team leverages a configuration management tool to deploy users and credentials to be used by robots for inter-cluster operations or vulnerability scanning. This is an example of a remote domain with local caching.
  • In emergencies, the DevOps team uses break glass keys to debug machines. This is essentially a remote domain with a local caching model where the cache is never updated or invalidated. This is almost identical to a local domain model, except that the cloud provider is the source of truth for identities.

The naive approach is to hope that one technique will work for all three types of use cases: humans, robots, and emergencies. Unfortunately, this isn’t the reality that we live in. Applications will go down, the configuration will get corrupted, and workflows that are human-compatible aren’t always robot compatible.

Rather than ignore these disparate use cases, let’s instead look at the critical features of host authentication and command execution and make sure that each of our connectivity paths supports the features we need.

Concrete Example of a Layered Authentication Architecture

The remote execution model is implemented using Hashicorp’s Boundary and Okta. This would be the primary way for humans to access all AWS EC2 instances. The DevOps team also uses Puppet to configure the authorized key file with public keys needed by automated processes like a vulnerability scanner.

These vulnerability scanners would have the corresponding private key, enabling them to connect directly to the underlying hosts. And finally, in emergencies, the DevOps team could connect directly to the underlying via the AWS-provided EC2 key pairs. This would give them a backup in case Boundary was down.

Multifactor Authentication

No host authentication design can be considered secure without multi-factor authentication. Backend access to your hosts (especially as a root user) is one of the juiciest targets an adversary can go for: with full root access, your network is their playground.

Here are three of the most common techniques for implementing multi-factor authentication. As discussed above, your go-to solution might require layering a couple of these together.

  • Use a bastion host to implement a multifactor auth chokepoint. In this model, you configure all hosts in your environment to only allow SSH connections from a single host (known colloquially as a “bastion”). Using a tool like a Duo PAM module, you enforce all users connecting to the bastion host to perform a multifactor authorization. This can also be done with a desktop bastion.
  • Enforce multifactor auth on the application layer. When using a remote execution application (e.g., Teleport), you can configure it to use an external authentication platform (usually via SAML) and enforce multifactor auth on that layer. The same technique works if you are using a tool like Hashicorp’s Vault to generate short-lived SSH credentials.
  • Use a hardware token for SSH identities. A great example of this would be a Yubikey. In this model, the user has to perform a multi-factor operation in order to access their SSH key.

It’s also worth noting that it’s possible to implement multi-factor authentication poorly such that it cripples productivity while providing little security. Unfortunately, deep diving into this topic is complex enough to warrant its own dedicated post. Here are a couple of example pitfalls:

  • Implementing multifactor auth in an environment with >100 hosts. Requiring multifactor auth on every connection is fine if you have five hosts, but what if you have 50,000 that you need to execute an emergency command on. Tapping your phone 50,000 times in a row is an unacceptable workflow. Make sure you design around this use case.
  • Ensure you are setting reasonable session limits. The longer your session window, the more opportunities you give an adversary to exploit the session. However, setting a really short session window will cause unnecessary friction with your engineers. This is often a great reason to generate multiple accounts per engineer: a read-only user with long session limits and an admin user with short session limits.

This topic is very complex, and I am doing an inadequate job of describing how to secure the authentication process. This process should be designed collaboratively with your security team, with an eye on both productivity and security.

Wrap Up

This post covered three different architectures for host-based authentication: local domains, remote domains without caching, remote domains with caching, and remote execution. Each of these architectures has its own strengths and weaknesses, and most mature companies layer multiple systems in order to achieve the right balance of security, traceability, reliability, usability, and maintainability.

As always, I recommend starting simple and adding complexity only as the rewards outweigh the risks.

Do you have another model of authentication that you think I missed, or have thoughts on host-based authentication that you want to share? Send me a comment, and I’ll respond.

Want to connect?You can find me on LinkedIn.

--

--

Director of Engineering at Skydio, Ex-Palantir, Infrastructure and Security Nerd, Gamer, Dad