I was working with a customer who was seeing one of their devices in Scrutinizer randomly going inactive and active on a consistent basis. After ruling out that the issue was not within the database we started analyzing the packets coming into the Scrutinizer server and what we saw was puzzling. Typically, a device in Scrutinizer that goes up and down indicates that we’re not getting NetFlow from that device during those time frames, but in this case flows were getting to the server while Scrutinizer was showing the device as inactive.

What now?

It was time to do some digging. The customer was exporting NetFlow from a Cisco ASA at their main site along with exporting NetFlow v5 from Cisco Routers at their external sites. The external sites were sending NetFlow over an encrypted IPSec tunnel back to their Scrutinizer server which was located at their main site. Using a trusty packet capturing tool, Wireshark, we started inspecting the NetFlow packets coming into the Scrutinizer server. This is where we found that even though the NetFlow packets were making it to the Scrutinizer server the packet checksums were incorrect causing them to be discarded before they could make it up the IP Stack to Scrutinizer. Below is what a bad checksum looks like in Wireshark.

What causes bad checksums?

There are a number of reasons that can cause bad checksums, but in our case the culprit was the IPSec tunnel and the fragmentation of packets. When a packet is sent over a network it has a MTU that it cannot exceed and the MTU can change depending on how the traffic is being sent.

In our case, the MTU of the internal network was different than the MTU of the IPSec tunnel and because of this the packets had to be fragmented into multiple parts and sent across the network. When the packets made it to the other side of the IPSec tunnel and were reassembled by their Cisco ASA they were no longer the same as the original packet that was sent which caused the checksum to be incorrect and the packet to be discarded.

How is it fixed?

In our case, the problem was resolved by turning on Pre-fragmentaiton for IPSec VPNs on their Cisco ASAs. This changed the fragmentation method to find what the lowest MTU to its destination is then fragment the packet before it’s encrypted rather than fragmenting the encrypted packet. Once this change was made, we no longer saw bad checksums in Wireshark and the device in Scrutinizer no longer showed inactive. In the end, Scrutinizer showing the device going inactive and active was a direct reflection of when the packets were coming in with good and bad checksums.

Paul Dube

Paul Dube is the Director of Technical Services at Plixer. He has a passion for enabling individuals and organizations to use highly complex systems to solve business and personal objectives. This passion for problem solving has Paul working with some of the largest enterprises to solve their security and networking challenges and also educating his young daughters on how to enrich their lives with technology. When he's not working, you will find him enjoying time with his family, cooking something delicious on the Big Green Egg, and enjoying the best brews that the locals have to offer.

Related

Leave a Reply