8 fallacies of distributed computing:
1. The network is reliable

Mateusz Wilczyński CTO

Intro.

Have you ever had problems with your internet connection? It depends on the region you live in, the type of connection you have, and the time since you started using the internet, but in most cases, the answer is yes. Not very often, but it happens sometimes. Since we know that our internet connection is not reliable, why do we assume that the network is reliable when designing distributed systems, microservices, and especially when making a simple call from web applications to another system? Probably because the solutions used in professional data centres are much better. But the fact is that even they can fail. Not often, but believe me, they will fail at some time, probably in the worst possible moment.

Problem

Making network calls is very popular these days. There are many calls in a distributed system, for example when using the microservices approach. However, due to the increasing popularity of early API products and other IT trends, even a simple monolithic application needs to make a lot of API calls to external services. Realizing exactly how much and when an application is making a network request is not as trivial as it may seem. This can be an issue especially for less experienced programmers, but in an application with a bad system design and lots of hacks, this can be an issue even for senior developers. I have personally seen a system containing an object method that was just a getter previously but changed to an API call because of an ugly hack. This change was implemented with the wrong assumption that the network is always reliable. That wrong design decision hit the development team many times and reverting it was not easy. Having hidden network calls can lead us to having distributed transactions without even realizing it. That’s an issue even if the network call succeeds. Distributed transactions are quite a big topic to discuss, definitely not for this article, but the simplest solution is to just avoid them as much as possible often by redesigning the system and modules boundaries. The assumption that a network is reliable may come from the way we execute and test our applications before pushing them to production. We are often doing it in the local environment or a prepared staging/test/QA environment much simpler than the real production. In such environments, network issues don’t exist at all or occur very rarely. That’s why it’s very easy to overlook them.

Solution

Our job as Software Engineers and Architects is to create reliable systems and applications based on an unreliable network. It is not an easy task, but it is achievable. The simplest solution is just the auto-retry mechanism. Another solution can be utilizing the Circuit Breaker solutions such as Hystrix, for example. Generally, it’s much easier if your system is designed to utilize asynchronous communication. In this case, you can use queuing systems that are good at handling errors and retrying automatically. Unfortunately, switching to async communication in a system not intended for this is not an easy task. In places where you only use the network to fetch data, you can use the fallback mechanism. This means using the default value, or more often a cached value, in case the network request fails.

Conclusion

You may say that these days the network in professional data centres is good enough, and you can treat it as a reliable asset. That may be a good assumption in simple systems. If you are responsible for creating a fintech solution that will be processing a lot of money every day, it is not acceptable. In fintech applications losing even a single transaction would be a big issue from the business, financial, and compliance perspective. It doesn’t mean that you should rewrite the entire system right away. It is always about managing the risk. In particular, you need to protect the most critical areas, but even in the case of less important parts of the system, you should consider what will happen if the network call fails.