Processing Payments in a Distributed System

Explained through a real-world example

Published in

Better Programming

9 min readJan 25, 2023

This article is intended for those looking to understand distributed systems, such as microservices, better. It can also help those seeking an approachable example of how distributed patterns can be applied in practice. This article provides a real-world example of how a common capability, like processing payments, can be designed and implemented using distributed systems.

Introduction

As businesses grow and evolve, so do their technology needs. One common challenge organizations face is scaling their systems to meet increasing demands. One solution to this problem is to migrate from a monolithic architecture to a distributed system.

A monolithic architecture is characterized by a single, self-contained codebase that handles all aspects of the application. While this approach can be simple and efficient for small-scale projects, it can become unwieldy as the application grows. On the other hand, a distributed system comprises multiple smaller, modular components that can be developed, tested, and deployed independently. This modularity makes it easier to scale and maintain a distributed system as the application grows.

What Problem Are We Trying To Solve?

The problem is that distributed systems are often more complex than monolithic systems as they involve multiple components and layers of communication, making it more difficult to understand how all the pieces fit together.

This article aims to provide a practical example of how distributed systems work, which can help demystify the concepts and provide a starting point for further learning.

What Use Case Are We Addressing?

Let’s keep it simple for now — imagine we want to build a system that allows customers to submit one-off credit card (CC) payments in exchange for a digital product (e.g., ebook, insurance policy) they purchased from your platform.

This simple use case can be further extended by adding subscriptions, for example. Still, the fundamental operation of processing the payment is the same, except with a subscription, the payment is triggered automatically. Our example will not include subscriptions, but the goal is to design a system that can be easily extended to support that as well.

What Are the System Requirements?

Authentication: The system should have a method for verifying the identity of users before allowing them to make payments.
Process CC payments: We will assume Stripe as the payment processor. We will also assume the customer already has an account within our system and a CC on file associated with the customer record.
Consistency: Any payment processed by Stripe should eventually be reflected in our system.
Idempotency: Every payment request should be processed once, resulting in a single payment (there should never be duplicate charges against the same CC).
Traceability: The system should provide detailed information about historical updates for payment transactions.
Monitoring and alerting: The system should provide monitoring of current operations and alerting of (unrecoverable) errors.

Can I See a High-Level System Diagram?

Before we go into the details, let’s see how our system is organized.

The frontend application

The frontend application layer is responsible for the application's user interface and user experience. In our example, this layer will allow users to submit the payment request for processing.

The backend-for-frontend (BFF)

The BFF layer sits between the frontend and the core services. It is meant to improve the maintainability of a microservices-based architecture by reducing the amount of code that needs to be written in the frontend — and making the frontend less dependent on the core services. The typical responsibilities found at this layer are:

Adapting the API: The BFF layer adapts the backend services API to suit the frontend needs better. This includes things like filtering, sorting, and pagination of data, as well as mapping between the data models of the backend and the frontend.
Authentication and authorization: The BFF layer may also handle authentication and authorization for the frontend, acting as a gatekeeper for the backend services.
Aggregating services: The BFF layer can also aggregate multiple backend services to provide a single API for the frontend. This can simplify the frontend code and hide the complexity of the backend services from the frontend.
Caching: BFF can also act as a cache layer for the data in the frontend, reducing the load on the backend services and improving the frontend performance.
Handle business logic: BFF can handle a portion of the business logic that is specific to the frontend, making the frontend more decoupled from the backend.

The Core Services

The Core Services layer handles the heavy lifting of our system; it stores the data and processes it, contains the business logic, implements security measures, and handles scalability and communication. In our example, core services provide a REST HTTP API to manage the state and publish state change event notifications via the pub-sub.

The Publish/Subscribe

The Pub-Sub layer is responsible for routing messages between different system components in a decoupled way, allowing them to communicate without being tightly coupled. The Pub-Sub can offer additional capabilities, such as built-in retry mechanisms and event filtering.

The Workers

In a distributed architecture, workers perform specific tasks in response to events on the Pub-Sub. These tasks help offload the load from the main thread and help to improve the performance, scalability, and fault tolerance. In our example, the Payment Worker will be responsible for processing payment state change events published by the Payment Service.

Can I See the Sequence Diagram for Our Use Case?

Let’s look at how these high-level system components communicate with each other to complete our payment processing flow:

User Authentication Requirement

The user authentication requirement is handled at the BFF layer.

In our system, the BFF is the only public API of our system, and Core Services will be managed on a private network that only the BFF layer can access. We could add authentication at the lower levels of the system, but for this specific use case, that is not necessary.

Handle CC Payments Requirement

The Payment Service handles the CC Payment requirement with support from the Pub-Sub and the Payment Worker.

Upon receiving the request to create a payment, the Payment Service will:
a. Initiate a database transaction
b. Create a NEW payment record in the database as part of the transaction created in the previous step.
c. Publish an event notification to the Pub-Sub. To keep things simple, the event notification is a thin event containing just the event name and the ID of the payment record — here’s an example: {name:'payment.created', id: 'abc123'}.
d. Commit the transaction to the DB, which will make the payment record permanent.
Upon receiving a new event from the Pub-Sub, the Payment Worker will:
a. Pull the event from the pub/sub and mark it as in-flight such that other workers can’t process the same event while another worker is already processing it.
b. Retrieve the payment object from Payment Service, given that amount and payment source are needed for the next operation, and our thin event does not contain those details.
c. Create a Stripe charge for the CC and the amount referenced by the payment object.
d. Update the payment object with the response from Stripe. This includes a status update and an external reference to the Stripe charge in case of a successful operation.
e. Remove the event from the pub/sub.

Consistency Requirement

We want to ensure Stripe will process every payment created in our system and that the response from Stripe is reflected in our Payment Service.

In the previous section, we talked about how every “create payment” request is wrapped in a DB transaction:
→ which means the DB transaction will not be committed unless the event is published on the Pub-Sub
→ which means that for every payment record, we will have an event on the pub/sub.

On top of that, we should add that:

the Pub-Sub should guarantee “at least one delivery” (most messaging systems offer that out of the box)
our Payment Worker must explicitly remove the event from the pub/sub only when the processing is complete.

If the above is true, every payment created in our system will be processed at least once.

Idempotency Requirements

The previous section indicates that some events might be processed multiple times. Since the event handler in the Payment Worker submits charge requests to Stripe, there is a risk of processing the same payment twice, resulting in duplicate charges.

The same risk arises when the Payment Worker fails after submitting the request to Stripe (2c above). The system will retry the event, which could result in a double charge if the previous processes submitted the request to Stripe successfully and we just failed to record the Stripe response in our system.

Both these risks can be mitigated by using the payment ID as the idempotency key for processing Stripe charges. This will guarantee that payments are processed exactly once.

Traceability Requirement

Our system must provide detailed information about historical updates for payment transactions.

A simple solution is to make the payment records immutable once created, except for the status and timestamp fields shown in the table below (the green color indicates field updates).

This approach works for our simple use cases. Still, once we start having more complex state transitions, it might be worth considering more complex solutions such as log tables or event sourcing.

Note that the status field will be used to manage the state machine of our payment object, and Payment Service will be responsible for validating state transitions based on the current status of the payment object.

Monitoring Requirements

Monitoring in distributed systems is crucial for ensuring the system's availability, scalability, and performance. It allows us to identify and diagnose the root cause of problems, which can help to improve performance and the user experience.

We expect our system to log the operations performed by every system component. Ideally, every request submitted by the user should be assigned a unique tracing ID which will be printed as part of every log.

Since we rely on pub-sub’s retry mechanism, all failed events that the Payment Worker will be sent to the dead-letter queue (DLQ). The DLQ size is an important metric to measure the system's health.

Possible Side Effects

There are many failure points in this flow, but if the correct tools and processes are used, the system should be able to recover automatically and complete the flow successfully.

One possible side effect of this proposed design is when the operation fails immediately after publishing the payment.created event in the Payment Service. Since the HTTP operation is wrapped into a DB transaction, the payment record will be rolled back, but the published event is not reversible, so the worker will process that event.

In this case, the expectation is that the Payment Worker will handle the error — when retrieving the payment by ID, the service will return a NotFound (404) error, in which case the worker should skip the event.

In practice, I have seen (rare) cases where the worker picked up the event before the DB transaction is committed — to address this exception, it’s advisable to retry the event processing several times (three retries with exponential backoff should cover it). A superior solution for this problem would be to use the OutBox pattern, but also a more expensive solution to implement and maintain long term.

Conclusions

In a distributed system, the functionality is separated into multiple individual components or services, each with a specific responsibility. This can result in a larger number of “moving pieces” in the overall system.

However, by breaking down the functionality into smaller, focused services, each component or layer becomes simpler and easier to understand. Each service has a clearly defined responsibility, which makes it easier to reason about its behavior and to test, deploy and maintain it. This is in contrast to a monolithic system, where all the functionality is in a single large codebase, making it harder to understand and reason.

For example, in a monolithic payment system, all the payment-related functionality, such as transaction processing, user management, and fraud detection, may be implemented in a single codebase, making it harder to understand and reason about. But in a distributed system, each of these functionalities can be implemented in a separate service, making it easier to reason about and maintain each component separately.

Additionally, having smaller, focused services makes it easier to scale, deploy and upgrade each component independently, making the system more resilient and adaptable to changing requirements.

In summary, as the system becomes more distributed, the number of moving pieces in the system increases, but by breaking down the functionality into smaller, focused services, each component or layer becomes simpler and easier to understand, reason about, test, deploy, and maintain.