Phew! We’ve covered a lot of ground in this book and for most of our audience all of these ideas are new. With that in mind, we can’t hope to make you experts in these techniques. All we can really do is show you the broad-brush ideas, and just enough code for you to go ahead and write something from scratch.
The code we’ve shown in this book isn’t battle-hardened production code: it’s a set of lego blocks that you can play with to make your first house, spaceship, and skyscraper.
That leaves us with two big tasks left for our final chapter. We want to talk about how to start applying these ideas for real in an existing system, and we need to warn you about some of the things we had to skip. We’ve given you a whole new arsenal of ways to shoot yourself in the foot, so we should discuss some basic firearms safety.
Chances are that a lot of you are thinking something like this:
"OK Bob and Harry, that’s all well and good, and if I ever get hired to work on a green-field new service, I know what to do. But in the meantime, I’m here with my big ball of Django mud, and I don’t see any way to get to your nice, clean, perfect, untainted, simplistic model. Not from here."
We hear you. Once you’ve already built a big ball of mud, it’s hard to know how to start improving things. Really, we need to tackle things step by step.
First things first: what problem are you trying to solve? Is the software too hard to change? Is the performance unacceptable? Have you got weird inexplicable bugs?
Having a clear goal in mind will help you to prioritise the work that needs to be done and, importantly, communicate the reasons for doing it to the rest of the team. Businesses tend to have very pragmatic approaches to technical debt and refactoring, so long as engineers can make a reasoned argument for fixing things.
Tip
|
Making complex changes to a system is often an easier sell if you link it to feature work. Perhaps you’re launching a new product, or opening your service to new markets? This is the right time to spend engineering resources on fixing the foundations. With a six month project to deliver, it’s easier to make the argument for three weeks of clean-up work. Bob refers to this as "architecture tax". |
At the beginning of the book, we said that the main characteristic of a big ball of mud is homogeneity: every part of the system looks the same, because we haven’t been clear about the responsibilities of each component. To fix that, we’ll need to start separating responsibilities out and introducing some clear boundaries. One of the first things we can do is to start building a service layer.
Many years ago, Bob worked for a software company that had outsourced the first version of their application, an online collaboration platform for sharing and working on files.
When the company brought development in-house, it passed through several generations of developers' hands, and each wave of new developers added more complexity to the code’s structure.
At it’s heart, the system was an ASP.Net Web Forms application, built with an NHibernate ORM. Users would upload documents into workspaces, where they could invite other workspace members to review, comment, or modify their work.
Most of the complexity of the application was in the permissions model since each document was contained in a folder, and folders allowed read, write, and edit permissions, much like a Linux filesystem.
Additionally, each workspace belonged to an account, and the account had quotas attached to it via a billing package.
As a result, every read or write operation against a document had to load an enormous number of objects from the database in order to test permissions and quotas. Creating a new workspace involved hundreds of database queries as we set up the permissions structure, invited users, and set up sample content.
Some of the code for operations was in web handlers, and ran when a user clicked a button or submitted a form, some of it was in "manager" objects that held code for orchestrating work, and some of it was in the domain model. Model objects would make database calls, or copy files on disk, and the test coverage was abysmal.
To fix the problem, we first introduced a service layer so that all of the code for creating a document or workspace was in one place and could be understood. This involved pulling data access code out of the domain model and into command handlers. Likewise, we pulled orchestration code out of the managers and the web handlers, and pushed it into handlers.
The resulting command handlers were long and messy, but we’d made a start at introducing order to the chaos.
TODO: Add diagram
This was the system where Bob first learned how to break apart a ball of mud, and it was a doozy. There was logic everywhere - in the web pages, in "manager" objects, in "helpers", in fat "service" classes that we’d written to abstract the managers and helpers, and in hairy "command" objects that we’d written to break apart the services.
If you’re working in a system that’s reached this point, it can feel hopeless, but it’s never too late to start weeding an overgrown garden. Eventually we hired an architect who knew what he was doing, and he helped us get things back under control.
Start by working out the use cases of your system. If you have a user interface, what actions does it perform? If you’ve got some back-end processing component, then maybe each cron job or celery job is a single use-case. Each of your use cases needs to have an imperative name: "Apply Billing Charges", "Clean Abandoned Accounts", or "Raise Purchase Order".
In our case, most of our use-cases were part of the manager classes and had names like "Create Workspace" or "Delete Document Version". Each use-case was invoked from a web front end.
Our goal is to create a single function or class for each of these supported operations that deals with orchestrating the work to be done. Each use case should:
-
Start its own database transaction if needed
-
Fetch any required data
-
Check any preconditions (see the ensure pattern in the validation appendix)
-
Update the domain model
-
Persist any changes
Each use case should succeed or fail as an atomic unit. You might need to call one use case from another - that’s okay, just make a note of it, and try to avoid long-running database transactions.
Note
|
One of the biggest problems we had was that manager methods called other manager methods, and data access could happen from the model objects themselves. It was hard to understand what each operation did without going on a treasure- hunt across the codebase. Pulling all the logic into a single method, and using a unit of work to control our transactions, made the system easier to reason about. |
This is a good opportunity to pull any data-access or orchestration code out of the domain model and up into the use-cases. We should also try to pull IO concerns (eg. sending emails, writing files) out of the domain model and up into the use-case functions. We apply the techniques from [chapter_03_abstractions] on abstractions to keep our handlers unit testable even when they’re performing IO.
Tip
|
It’s fine if you’ve got duplication in the use-case functions. We’re not trying to write perfect code, we’re just trying to extract some meaningful layers. It’s better to duplicate some code in a few places than to have use-case functions calling one another in a long chain. |
These use-case functions will mostly be about logging, data access, and error handling. Once you’ve done this step, you’ll have a grasp of what your program actually does, and a way to make sure each operation has a clearly defined start and finish. We’ll have taken a step toward building a pure domain model.
Read Working Effectively With Legacy Code for guidance on how to get legacy code under test, and how to start separating responsibilities. https://www.oreilly.com/library/view/working-effectively-with/0131177052/
Part of the problem with the codebase in our case study was that the object graph was highly connected. Each account had many workspaces, each workspace had many members, all of whom had their own accounts. Each workspace contained many documents, which had many versions.
You can’t express the full horror of the thing in an ERD diagram. For one thing, there wasn’t really a single account related to a user. Instead there was some bizarre rule where you had to enumerate all of the accounts associated to the user via the workspaces and take the one with the earliest creation date.
Every object in the system was part of an inheritance hierarchy that included SecureObject and Version, and this inheritance hierarchy was mirrored directly in the database schema, so that every query had to join across ten different tables and look at a discriminator column just to tell what kind of objects you were working with.
The codebase made it easy to "dot" your way through these objects like so:
user.account.workspaces[0].documents.versions[1].owner.account.workspaces[0].settings;
It’s easy to build a system this way with Django ORM or SQLAlchemy but it’s to be avoided. While it’s convenient, it makes it very hard to reason about performance because each property might trigger a lookup to the database.
Tip
|
Aggregates are a consistency boundary. In general each use-case should update a single aggregate at a time. One handler fetches one aggregate from a repository, modifies its state, and raises any events that happen as a result. If you need data from another part of the system, it’s totally find to use a read model, but avoid updating multiple aggregates in a single transaction. When we choose to separate code into different aggregates, we’re explicitly choosing to make them eventually consistent with one another. [chapter_06_uow] has much more on this topic. |
There were a bunch of operations that required us to loop over objects this way, for example:
# Lock a user's workspaces for non-payment
def lock_account(user):
for workspace in user.account.workspaces:
workspace.archive()
Or even recurse over collections of folders and documents:
def lock_documents_in_folder(folder):
for doc in folder.documents:
doc.archive()
for child in folder.children:
lock_documents_in_folder(child)
These operations killed performance but fixing them meant giving up our single object graph. Instead we began to identify aggregates and to break the direct links between objects.
Note
|
We talked about the infamous "select N+1 problem" in [chapter_11_external_events], and how we might choose to use different techniques when reading data for queries vs reading data for commands. |
Mostly we did this by replacing direct references with identifiers:
Before:
class Workspace:
folders: List[Folder]
class Folder:
permissions: PermissionSet
documents: List[Document]
parent: Folder
children: List[Folder]
class Document:
parent: Folder
workspace: Workspace
version: List[DocumentVersion]
After:
class Document:
id: int
workspace_id: int
parent_id: int
# Note that our Document Aggregate continued to hold all its versions
# so that we could treat the whole document as a single unit.
versions: List[DocumentVersion]
class Folder:
id: int
permissions: PermissionSet
workspace_id: int
class Workspace:
id: int
Tip
|
Bi-directional links are often a sign that your aggregates aren’t right.
In our original code, a Document knew about its containing Folder, and the
Folder had a collection of Documents. This makes it easy to traverse the
object graph but stops us from thinking properly about the consistency
boundaries we need. We break apart aggregates by using references instead.
In the new model, a Document had a folder_id but no way to directly access
the Folder.
|
If we needed to read data, we avoided writing complex loops and transforms and tried to replace them with straight SQL. For example, one of our screens was a tree view of folders and documents.
This screen was incredibly heavy on the database, because it relied on nested for loops that triggered a lazy-loaded ORM.
Tip
|
We use this same technique in the book in [chapter_11_external_events] where we replace a nested loop over ORM objects with a simple SQL query. It’s the first step in a CQRS approach. |
After a lot of head-scratching, we replaced the ORM code with a big, ugly stored procedure. The code looked horrible, but it was much faster and it helped us to break the links between Folder and Document.
When we needed to write data, we changed a single aggregate at a time, and we
introduced a message bus to handle events. For example, in the new model, when
we locked an account, we could first query for all the affected workspaces
SELECT id FROM workspace WHERE account_id = ?
.
We could then raise a new command for each workspace:
for workspace_id in workspaces:
bus.handle(LockWorkspace(workspace_id))
MADE.com started out with two monoliths: one for the front-end e-commerce application, and one for the back-end fulfilment system.
The two systems communicated through XML-RPC. Periodically, the back-end system would wake up and query the front-end system to find out about new orders. When it had imported all the new orders, it would send RPC commands to update the stock levels.
Over time this synchronisation process became slower and slower until, one Christmas, it took longer than 24 hours to import a single day’s orders. Bob was hired to break the system into a set of event-driven services.
Firstly we identified that the slowest part of the process was calculating and synchronising the available stock. What we needed was a system that could listen to external events, and keep a running total of how much stock was available.
We exposed that information via an API, so that the user’s browser could ask how much stock was available for each product, and how long it would take to deliver to their house.
Whenever a product ran out of stock completely, we would raise a new event that the e-commerce platform could use to take a product off sale. Because we didn’t know how much load we would need to handle, we wrote the system with a CQRS pattern. Whenever the amount of stock changed, we would update a redis database with a cached view model. Our flask API queried these "view models" instead of running the complex domain model.
As a result, we could answer the question "How much stock is available" in two to three milliseconds and the API frequently handles hundreds of requests a second for sustained periods.
If this all sounds a little familiar, well, now you know where our example app came from!
When building the availability service we used a technique called event interception to move functionality from one place to another. This is a three step process:
-
Raise events to represent the changes happening in a system you want to replace.
-
Build a second system that consumes those events and uses them to build its own domain model.
-
Replace the older system with the new.
We used event interception move from this:
TODO: Context diagram, E-commerce and Fulfilment over XMLRPC
to this
TODO: Context diagram, E-Commerce, Availability, Fulfilment, Event-Broker
If you’re thinking about carving a new system out of a big ball of mud, you’re probably suffering problems with reliability, performance, maintainability or all three simultaneously. Deep intractable problems call for drastic measures!
We recommend domain modelling as a first step. In many over-grown systems, the engineers, product owners, and customers no longer speak the same language. Business stakeholders speak about the system in abstract, process-focused terms while developers are forced to speak about the system as it physically exists in its wild and chaotic state.
We mentioned earlier that the account and user model in our first system were bound together by a "bizarre rule". This is a perfect example of how engineering and business stakeholders can drift apart.
In this system, accounts parented workspaces, and users were members of workspaces. Workspaces were the fundamental unit for applying permissions and quotas. If a user joined a workspace, and didn’t already have an account we would associate them with the account that owned that workspace.
This was messy and ad-hoc, but it worked fine until the day a product owner asked for a new feature:
When a user joins a company, we want to add them to some default workspaces for the company, like the HR workspace or the Company Announcements workspace.
We had to explain to them that there was no such thing as a company, and there was no sense in which a user joined an account. Moreover, a "company" might have many accounts, owned by different users, and a new user might be invited to any one of them.
Years of adding hacks and workarounds to a broken model caught up with us, and we had to rewrite the entire user management function as a brand new system.
Figuring out how to model your domain is a complex task, and the subject of many decent books in its own right. We like to use interactive techniques like Event Storming, and CRC modelling, because humans are good at collaborating through play.
Event Modelling is a technique that brings engineers and product owners together to understand a system in terms of commands, queries, and events.
TODO: Link event modelling
The goal is to be able to talk about the system using the same ubiquitous language, so that you can agree on where the complexity lies.
We’ve found a lot of value in treating domain problems as TDD kata. For example, the first code we wrote for the availability service was the batch and order line model. You can treat this as a lunch time workshop, or as a spike at the beginning of a project. Once you can demonstrate the value of modelling, it’s easier to make the argument for structuring the project to make modelling easy.
Once you’ve managed to convince your team to try a new way of building software it’s easy to get overwhelmed by the sheer number of new things to think about.
What message broker should we use? How should we
-
decide on a piece of the old system to carve out.
-
get your system to produce events
-
as its main outputs
-
and as inputs to your new system
-
-
consume them in your new service. we now have a separate db and bounded context
-
the new system produces
-
either the same events the old one did (and we can switch those old parts off)
-
or new ones, and we switch over the downstream things progressively
-
Hi, I’m David - one of the tech reviewers on this book. I’ve worked on several complex Django monoliths, and so I’ve known the pain that Bob and Harry have made all sorts of grand promises about soothing.
When I was first exposed to the patterns described here, I was rather excited. I had successfully used some of the techniques already on smaller projects, but here was a blueprint for much larger, database-backed systems like the one I work on in my day job. So, I started trying to figure out how I could implement it at my current organization.
I chose to tackle a problem area of the code base that had always bothered me. I began by implementing it as a use case. But I found myself running into unexpected questions. There were things that I hadn’t considered while reading that now made it difficult to see what to do. Was it a problem if my use case interacted with two different aggregates? Could one use case call another? And how was it going to exist within a system that followed different architectural principles without resulting in a horrible mess?
What happened to that oh-so-promising blueprint? Did I actually understand the ideas well enough to put them into practice? Was it even suitable for my application? Even if it was, would any of my colleagues agree to such a major change? Were these just nice ideas for me to fantasize about while I got on with real life?
It took me a while to realize that I could start small. I didn’t need to be a purist, or to 'get it right' first time: I could experiment, finding what worked for me.
And so that’s what I’ve done. I’ve been able to apply some of the ideas in a few places. I’ve built new features whose business logic can be tested without the database or mocks. And as a team we’ve introduced a service layer to help define the jobs the system does.
If you start trying to apply these patterns in your work, you may go through similar feelings to begin with. When the nice theory of a book meets the reality of your code base, it can be demoralizing.
My advice is to focus on a specific problem and ask yourself how you can put the relevant ideas to use, perhaps in an initially limited and imperfect fashion. You may discover, as I did, that the first problem you pick might be a bit too difficult - if so, move on to something else. Don’t try to boil the ocean, and don’t be too afraid of making mistakes. It will be a learning experience, and you can be confident that you’re moving roughly in a direction that others have found useful.
So, if you’re feeling the pain too - give these ideas a try. Don’t feel you need permission to rearchitect everything - just look for somewhere small to start. And above all, do it to solve a specific problem. If you’re successful in solving it, you know you got something right - and others will too.
-
Monolith to Microservices by Sam Newman, and his original book, Building Microservices. Strangler (Fig) pattern is mentioned as a favorite, also many others. Good if you’re thinking of moving to microservices, and also on integration patterns and the considerations of async messaging-based integration.
-
Clean Architectures in Python by Leonardo Giordani, which came out in 2019, is one of the few previous books on application architecture in Python.
- Do I need to do all of this at once? Can I just do a bit at a time?
-
No, you can absolutely adopt these techniques bit by bit. We recommend, if you have an existing system, building a service layer to try and keep orchestration in one place. Once you have that, it’s much easier to push logic into the model and edge concerns like validation, or error handling, to the entry points.
It’s worth having a service layer even if you still have a big messy Django ORM because it’s a way to start understanding the boundaries of operations. - Extracting use-cases will break a lot of my existing code, it’s too tangled
-
Just copy-paste. It’s okay to cause more duplication in the short-term. Think of this as a multi-step process. Your code is in a bad state now, so copy and paste it to a new place, and then make that new code clean and tidy.
Once you’ve done that, you can replace uses of the old code with calls to your new code and finally delete the mess. Fixing large code bases is a messy and painful process. Don’t expect things to get instantly better, and don’t worry if some bits of your application stay messy. - Do I need to do CQRS? That sounds weird, can’t I just use repositories for
-
Of course you can! The techniques we’re presenting in this book are intended to make your life easier, they’re not some kind of ascetic discipline with which to punish yourself.
In our first case-study system, we had a lot of "View Builder" objects that used repositories to fetch data and then performed some transformations to return dumb read models. The advantage is that when you hit a performance problem, it’s easy to rewrite a view builder to use custom queries or raw SQL. - How should use cases interact across a larger system? Is it a problem for one to call another?
-
This might be an interim step. Again, in the first case-study, we had handlers that would need to invoke other handlers. This gets really messy, though, and it’s much better to move to using a message bus to separate these concerns.
Generally, your system will have a single message bus implementation, and a bunch of different subdomains that center on a particular aggregate or set of aggregates. When your use case has finished, it can raise an event, and a handler elsewhere can run. - Is it a smell for a use case to use multiple repositories, and why?
-
Yes! A repository is a pattern that we use for reading aggregates from our persistent store. By definition, we should only ever be updating one aggregate at a time. If you need to read data to figure out what to do, then consider a read-model that returns just the data you need, even if you cheat and build it with a repo and domain objects under the hood.
- What if I have a read-only but business-logic heavy system?
-
View models can have complex logic in them. In this book we’ve encouraged you to separate your read and write models because they have different consistency and throughput requirements. Mostly, we can use simpler logic for reads, but that’s not always true. In particular, permissions and authorization models can add a lot of complexity to our read-side.
We’ve written systems where the view models needed extensive unit tests. In those systems, we split a view builder from a view fetcher.
TODO: Diagram
This makes it easy to test the view builder by giving it mocked data, eg. a list of dicts.
"Fancy CQRS" with event handlers is really a way of running our complex view logic whenever we write so that we can avoid running it when we read. - Do I need to build microservices to do this stuff?
-
Egads, no! These techniques pre-date microservices by a decade or so. Aggregates, domain events, and dependency inversion are ways to control complexity in large systems. It so happens that when you’ve built a set of use cases and a model for a business process, it’s relatively easy to move it to its own service, but that’s not a requirement.
- I’m using Django, can I still do this?
-
We have an entire appendix, just for you! [appendix_django]
- How can I convince other engineers to try this stuff?
-
tbc
Okay, so we’ve give you a whole bunch of new toys to play with, here’s the fine print. Harry and Bob do not recommend that you copy-paste our code into a production system and rebuild your automated trading platform on Redis pub-sub. For reasons of brevity and simplicity, we’ve hand-waved a lot of tricky subjects. Here’s a list of things we think you should know before trying this for real.
Redis PUBSUB is not reliable and shouldn’t be used as a general purpose messaging tool. We picked it because it’s familiar and easy to run. At MADE we run Eventstore as our messaging tool, but we’ve had experience with Rabbit and AWS EventBridge.
Tyler Treat has some excellent blog posts on his site bravenewgeek.com and you should read at least: * https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/ * https://bravenewgeek.com/what-you-want-is-what-you-dont-understanding-trade-offs-in-distributed-messaging/
In [chapter_08_events_and_message_bus] we update our process so that deallocating an order line and reallocating the line happen in two separate units of work. You will need monitoring to know when these transactions fail, and tooling to replay events. Some of this is made easier by using a transaction log as your message broker (eg. Kafka, EventStore). You might also look at the Outbox pattern: * https://microservices.io/patterns/data/transactional-outbox.html
We haven’t given any real thought to what happens when handlers are retried. In practice you will want to make handlers idempotent so that calling them repeatedly with the same message will not make repeated changes to state. This is a key technique for building reliability, because it enables us to safely retry events when they fail.
There’s a lot of good material on idempotent message handling, try starting here: * https://blog.sapiensworks.com/post/2015/08/26/How-To-Ensure-Idempotency * https://lostechies.com/jimmybogard/2013/06/03/un-reliability-in-messaging-idempotency-and-de-duplication/
You’ll need to find some way of documenting your events and sharing schema with consumers. We like using JSON schema and markdown because it’s simple but there is other prior art.
Greg Young wrote an entire book on managing event-driven systems over time: https://www.goodreads.com/book/show/34327067-versioning-in-an-event-sourced-system
Enterprise Integration Patterns by (as always) Martin Fowler is a pretty good start for messaging patterns.