Network Error Logs: Invisible Error Collection

Every software developer is familiar with logs. They serve as the backbone for debugging and understanding the application's behaviour. Yet, not many are aware of Network Error Logs (NEL) which provide visibility into the hard-to-see last-mile issues and are irreplaceable for diving deep into network-related issues that often fly under the radar. In this blog, we'll explore what NEL logs are and why they provide important insights into areas that are typically overlooked.

Understanding NEL

Network Error Logging (NEL) is a browser specification designed to shed light on typically invisible network errors on the web. The idea is simple – the first time a user visits a website, it sends you a response header instructing your browser to report network errors to a specified URL. Then, one day, the user might not be able to open a website for some reason and instead is presented with the error view that we've all seen before.

image

While network error logs can capture 4xx and 5xx errors, the real value comes from capturing network errors like TCP resets, DNS errors—problems, and other issues which indicate a problem in which users don't even reach the server.

Debugging these issues without NEL logs is usually an engineering nightmare because it involves a huge number of back and forth between engineers and users to understand and fix the root cause. Very often, network issues are specific to a particular region, and can affect only a specific town due to ISP (mis)configurations.

Help! My website is down, but only in Kentucky

Network-related outages are often geographically localised, so the number of users affected by them usually isn’t huge.

However, the network failures are usually catastrophic, meaning your website becomes completely unavailable. What doesn't help is that the "support" pages are often hosted on the same infrastructure, causing them to be unavailable as well. Not only can your users often not access the website, but they can’t even get help! So, these failures often result in a large amount of negative reviews and messages on social media. Such failures erode trust quite quickly, so not only can users often just switch to a competitor.

There are often a few reasons these failures can occur.

CDN is down in city X

Modern CDNs have servers in a large number of data centers around the world, e.g., have a look at Cloudflare – 300+ locations and counting. While this means that the content is delivered to users very quickly, that’s a lot of locations to take care of, and it’s not unusual for a CDN to temporarily have an outage in a specific region. However, you as a consumer of their product often have very limited visibility into these failures because you don’t control their servers.

Synthetic tests are commonly used to catch those failures. However, the number of networks and ISPs is so large that it’s impossible to catch these failures unless they affect large regions. It doesn't help that most synthetic providers use data centres which often have much better connectivity to CDN nodes compared to residential networks.

When such errors occur, you can often see a spike in NEL errors such as tcp.reset. However, depending on the type of failure, the error type and the symptoms can vary. Reverse-proxy CDN errors are the main reason to actually record 5xx errors, as you can compare the number of errors your users are experiencing with the number of errors your website metrics are recording to see any deviations.

ISP blocked my website

Sometimes ISPs might block your website by pointing the domain in their DNS resolvers, which will be the default DNS server for the majority of users, to their "sinkhole" address.

Some Internet Service Providers (ISPs) might restrict access to certain websites based on false positive security reports. Sometimes, ISPs might block entire TLDs like the .so ccTLD managed by Somalia, causing all websites with that TLD to be inaccessible.

Websites hosting user-generated content are especially prone to these issues due to abuse reports. One user might upload some malicious content for use in a phishing campaign, and after a few reports, an ISP might just block the whole domain because they aren't able to block individual URLs. You can often see a similar effect when

When this happens, your users will either see the ISP’s message that the website is malicious, which erodes trust, or, often, your users will see an SSL error due to a certificate mismatch, especially if HSTS is enabled on your website.

SSL error

Network congestion

Sometimes, however, network failures don’t make the website unavailable but rather make it very slow, especially at certain times of the day. Users might wait 10 seconds for your website to load, and then they’ll just give up and complain about how slow your website is.

NEL logs help in identifying these issues with a high number of abandoned error reports, suggesting users are giving up on your site due to long wait times.

How do you start collecting these logs

You can set up your own NEL collection infrastructure, but it's crucial to host it separately from your main website's infrastructure to avoid losing access during network failures. At CiThru, we've build a NEL log collection system that makes it very easy to start collecting NEL reports. All you need to do is to sign up for free. You'll then need to add a special header. This can be done through your hosting service or directly within the application framework your website uses. After adding the header, CiThru will start collecting network errors, and you can view the analytics on your dashboard.

CiThru

It's important to note that NEL logs are not a replacement for traditional monitoring. Instead, they add another layer to your existing monitoring infrastructure by providing insights into the last mile of your users' network requests.

Similarly to traditional monitoring, having historical data prior to discovering an issues is critical for successful debugging. There is always a level of background noise, and it's the absolute number of errors that's important, not the relative. So even a heavily sampled small amount of data is much more helpful than no data at all. However, without a sufficiently large volume of historical data, it might be hard to filter events by various parameters to make it easier to understand the root cause.

Conclusion

Network Error Logs are an exceptional tool for uncovering hidden network-related issues that could affect your site's accessibility and user experience. By integrating NEL logs into your monitoring strategy, you gain a more comprehensive view of your site's health and are better equipped to respond to issues proactively. Be aware of what they are and what they're not, and use them to their fullest potential to keep your site reliable and your users happy.

Written by Sergey Tselovalnikov

Published on 19 Apr 2024

Logo
CiThru