Blog :: Network Operations

How We Troubleshot a Network Performance Problem in Minutes

jake

There comes a time in every network engineer’s job where they have to troubleshoot slow network performance. Wouldn’t it be nice if you had a fast, reliable tool to make this task easier? In this blog, I want to cover a use case we recently saw at our main HQ and how we used our NetFlow and metadata collector to troubleshoot the issue in a matter of minutes, rather than hours.

What is slow network performance?

Since every network is different, there are no hard and fast numbers on exactly what constitutes as slow, but I think we all know it when we see it. In this case, our IT team received complaints of jitter on phone calls. We also had a few alarms on our VoIP dashboard. Upon first investigation, we found that link utilization didn’t seem to be the problem.

Network Performance Troubleshooting

As you can see, there is a slight spike in bandwidth, but this is hardly out of the ordinary.

Troubleshooting with IPFIX:

Since we ruled out a bandwidth issue, what can we do now? When this happens, I usually pivot to one of my favorite reports: Source > Host Flows. Instead of trending by bits/s, it shows the number of connections for a particular host. It’s also a great way to see if any devices might be flooding the network with connections.

Root cause analysis

Very quickly, we see this machine is peaking at over 2,300 flows/s! Compare this against the second host in the list that is peaking at around 170 flow/s. With our NetFlow and metadata collector, we can also run this baseline over the course of the week to see if this is normal (in our case it wasn’t).

Root cause analysis with NetFlow:

Upon investigation of this host, we found that it was an old development resource running Ubuntu Server. Obviously, this helps us stop the issue since we can shut down the system to cease all communications. This fixed the issue for us but still doesn’t explain why it happened in the first place. Let us further our investigation using some other complementary tools at our disposal.

Using a third-party integration we have set up, we quickly pivoted to our Endace packet capture appliance to look at the raw packets. It became apparent that this server was re-transmitting a lot of TCP packets.

NetFlow integrations

Now that we knew it could not communicate to the external cloud server, we could update the server’s repository list to make sure it only reaches out to known good servers and fix this issue for good. In our case, this server was not in use and could be decommissioned, but that might not always be the case.

Network anomaly detection:

Now that we have tracked down the issue, we can set up proactive alarms to warn us if this issue happens again. I hope that this blog showed you the benefits of collecting NetFlow, IPFIX, and various forms of metadata. If you need help tracking down any issues on the network, feel free to reach out to our team!