Hourly server overload?

Hi All.

I’m new here.

I brought my first server up a week ago. And I see hourly, on the hour, “abuse” of my server. If, and that is a big if, I have set my appliance up correctly, then it can handle 40K requests per second, and it still logs “Excessive traffic on port X” one minute past the hour, about 50% of the time. This seems to not be caused by DNS, but by abusive users in my case?

Looking at pfTop on my firewall I see high request rates from some IPs, but I don’t know if I’m misinterpreting the data. And I have been unable to catch the hourly overload in pfTop, as I forget to look on the hour. :sleeping_face:

Example from logs:

Oct 15 14:01:13 TimeProvider alarmd: Id: 112, Index: 002, Severity: major, Alarm: set, Msg: Excessive traffic on Ethernet port 2
Oct 15 14:01:13 TimeProvider alarmd: Id: 112, Index: 003, Severity: major, Alarm: set, Msg: Excessive traffic on Ethernet port 3
Oct 15 14:01:25 TimeProvider alarmd: Id: 112, Index: 002, Severity: major, Alarm: clear, Msg: Excessive traffic on Ethernet port 2 cleared
Oct 15 14:01:25 TimeProvider alarmd: Id: 112, Index: 003, Severity: major, Alarm: clear, Msg: Excessive traffic on Ethernet port 3 cleared

As this is an appliance(Microchip TimeProvider 4100) I’m unable to get real request rates. I plan to move the NTP traffic to it’s own firewall port to track it on the firewall.

The TP4100, according to tech specs, can handle 20K pps per port. I am currently load balancing between two ports. I will try to bring more ports online, but it uses SFP ports which makes it a bit hard.

Thank you,
Errol

It might be that there are (lots of) people who have set up a cron job to update the clock at some specific minute each hour (which is not a good idea).

If this traffic passes through your firewall and there’s a possibility to use tcpdump on your firewall, capturing the traffic would likely give you the best information. Something like “tcpdump -w ntp.pcap udp and dst port 123 and dst host 160.119.230.39”, run at the appropriate time. You could then examine the biggest offenders from the capture file.

Useful oneliner for examining the capture file: tcpdump -nn -r ntp.pcap | cut -d" " -f3 | cut -d. -f1-4 | sort | uniq -c | sort -rn | head

1 Like

Thanks for adding your server!
I am an appliance fan, while i only operate Meinberg equipment i really like your TimeProvider 4100.

With regard to your issue, some zones have trouble with request spikes around the hour / half hour. Current understanding is that these spikes are mostly either from bad-behaved appliances sending NTP requests all at the same time or from internet clients behind CGNAT (Carrier-Grade NAT) which do the same thing.

It would help understand what the request volume is (queries/s). By comparing this to the capabilities of your server this helps in pnpointing the problem. Could also be related to the router being unable to handle the traffic volume or your loadbalancer having issues.

I would love a Meinberg, but can’t afford one, not even secondhand. I bough two TP4100 servers at my local ewaste recycler for under $50 total. One with OCXO clock, one with Rb atomic clock. The OCXO worked fine, the Rb one had an issue with the Rb clock, which I was able to fix. So now I have two time servers doing nothing, so I decided to join the pool. I only have one IP though, so can only run one at a time, the other being backup during maintenance of the other.

I have 500Mb/s fiber entering the site.

My router/firewall is a Sohpos XG115 running OPNsense. It seems to be running fine, but might bog down during the peaks, though I don’t notice it on the rest of the network. It is also the one doing the load balancing. I will upgrade to a XG125 as soon as I can find VLP DDR4 ram so I can upgrade the ram on the router from it’s default 4GB.

The TP4100, although it has 7 ports for time serving, every port needs to be on a separate network.
TP4100 config, eth1 is management, rest are NTP ports:

  IPv4 Config

  ----------------------------------------------------------
  |Port |Address         |Subnet Mask     |Gateway         |
  |-----|----------------|----------------|----------------|
  |eth1 |192.168.1.254   |255.255.255.0   |192.168.1.1     |
  |.....|................|................|................|
  |eth2 |192.168.200.2   |255.255.255.0   |192.168.200.1   |
  |.....|................|................|................|
  |eth3 |192.168.201.2   |255.255.255.0   |192.168.201.1   |
  |.....|................|................|................|
  |eth4 |192.168.202.2   |255.255.255.0   |192.168.202.1   |
  |.....|................|................|................|
  |eth5 |192.168.203.2   |255.255.255.0   |192.168.203.1   |

So, on the router I added virtual IPs to the LAN interface, 192.168.200.1, 192.168.201.1 etc.

I added those IPs to a firewall alias and the incoming NTP rule routes round-robbin to the servers in that alias.

These IPs will move to a separate firewall port once I ran a seperate network connection from the router to the switch at the TP4100.

The cpu usage on the router hovers round 0.6 on a 4 core CPU, so 15% CPU usage.

Firewall table entries is 16%, 156918 entries out of a max of 1000000.

I will try what avij suggests and see what the log file looks like.

Sounds good so far. Sophos would be ok i think, even though its not hugely powerful.
Timeprovider 4100’s for 50 dollars sounds like a good deal to me.

I have owned a Symmetricom S200 some years ago, which did its job.
Key problem with these devices is that they are no longer supported by the manufacturer and don’t receive any updates anymore. Even to receive updates which were released some years back requires a paid support subscription. Also, they have common problems with GPS date rollovers.

Try to capture some data during the peak and share so we can help out.

I’ll add that it’s possible that there really isn’t a problem and your server handles the traffic peaks just fine. See if there’s a setting for configuring the warning threshold. Maybe the warning threshold simply needs to be increased.

On that there is some confusion in the manual.

From the specs:

High performance: 790 PTP unicast clients at 128 PPS and NTPr 20,000
tps per port, for a total of 160,000 tps per unit and NTPd at 500 tps on 3
ports for a total of 1,500 tps per unit

I’m using NTPr as NTPd is for authenticated NTP service and at very low rates.

But the error I see in the logs has this footnote in the manual:

The excessive traffic alarm is set if the count of Ethernet packets received in one
second exceeds the threshold. The threshold for overall number of packets to the
system to 13,000 packets per second. The detection level is a fixed 3,000 packets
per second for the MGMT port or internal ETH9. Rate limitation for any other inter-
face which isn’t active management port is 1,000 Packets per second. All traffic
received by the TimeProvider 4100 Ethernet ports is counted, such as ARP, ICMP,
IGMP.

From this I take it that rate limiting is happening, but I also assume that the packet limit for an NTPr port is 20000 instead of the usual 1000. And, as I’m only forwarding UTP traffic on port 123 to these ethernet ports, it can’t be any of the other traffic?

Also, my monitor status for my server shows regular, hourly dips in performance.

I did a trace at the last top of the hour, and saw a small spike at the hour, but nothing significant, and my server didn’t trigger the excess traffic warning. I will continue to tcpdump every top of the hour till I get a trace and a log message. I suspect this only happens when a specific client is using my IP.

A 2016 NIST paper shows the per-second distribution of some major time servers (figure 2).

1 Like

I have not been able to catch the excess traffic event. But just FYI, here is 30 mins ago. Trace starts at 8:59, first blip is at 9, spike is at 9:01. I don’t know if tcpdump dropped packets. I have added logging for that now.

Ok, got the excess traffic messages again, but I need to investigate further. The spike only goes to 17K pps? Will change the capture to find anything that is going to the IPs of my server.

Just an idea, as it’s 9 o’clock and 13 o’clock….maybe office hours for many.

When computers start and update their time?

Looks to me it’s offices starting computers…morning and later after lunch.

Just an idea.

Ok, I suspect this is a non-issue.

I stress tested one port on my server with ntppref(wish there was something I could test the full system with, including load balancer) and got this:

               |          responses            |        response time (ns)
rate   clients |  lost invalid   basic  xleave |    min    mean     max stddev
1000       100   0.00%   0.00% 100.00%   0.00%    +4333   +4401   +5807     64
1500       150   0.00%   0.00% 100.00%   0.00%    +4332   +4401   +5662     59
2250       225   0.00%   0.00% 100.00%   0.00%    +4333   +4403   +5836     82
3375       337   0.00%   0.00% 100.00%   0.00%    +4333   +4401   +5831     71
5062       506   0.00%   0.00% 100.00%   0.00%    +4333   +4401   +5845     67
7593       759   0.00%   0.00% 100.00%   0.00%    +4332   +4402   +5869     74
11389     1138   0.00%   0.00% 100.00%   0.00%    +4333   +4401   +5886     70
17083     1708   0.00%   0.00% 100.00%   0.00%    +4332   +4402   +7620     74
25624     2562  19.78%   0.00%  80.22%   0.00%    +4333   +4401   +5886     69

It gracefully started to rate limit at 20K pps, and more interestingly, did NOT raise the “excessive traffic” log message. Instead it raised a “Service load limit exceeded on Port2” message.

I suspect non-ntp messages is causing the warning. I looked at a tcpdump, excluded the ntp messages, and found a bunch of bootp, dns, tftp, echo messages, all on port 123. Not enough that I would think it should trigger a 1000 pps warning, but what do I now?

Sorry to have wasted everyone’s time. Now, if I can just figure out what is causing my bad score graph…

I’d say the graphs and scores are fine. As of now, your overall score is 20, which is the maximum. Note that the graph combines both score plots and offset plots on the same graph. Try hovering the mouse over some individual monitors and you’ll see the green offset plots (near the middle) and the blue score dots (top).

Yes, 20 is the max, but I see dropouts, active monitors that de-rate me, which I don’t like, as it signals an unreliable connection?

But, I’ll let it run for a few weeks and see what happens…

No regrets about wasting time! :slight_smile: You wouldn’t have known that before looking at a tcpdump.

Have you discovered which IP address(es) those packages are coming from?
If most of them are coming from, say, the same network segment or provider, you could inform them about the misuse or block and ignore them.

I don’t think it is worth the effort.

I have a list of IPs, like 41.164.163.243 which sends a lot of DNS requests(src port is 53 but dst port is 123? How does that even happen? If they try to use the pool as a dns server then the dst should still be 53?), but they don’t have an abuse contact registered on their whois database. Wonder why…

I just deprioritized the “Excessive traffic” error to an info message so I can ignore it…

I’m wondering if I can’t tell my OPNsense firewall to reject packets on port 123 that isn’t NTP.

I do see TIME and DAYTIME packets though. Is that still used?

It’s an ISP from south africa - do you provide time for that zone ?
Website of the is: neotel.co.za

  • You can try to reach them via website
  • Block the the IPs or the whole subnet
  • IIRC iptables can do that via filter but it take some resources.

They are realy rare. But both use different ports 13 / 37

Those requests coming from UDP port 53 — can you double check that they are actually DNS queries? You can use tcpdump to see which protocol they are, and in case they’re DNS queries you should see the actual DNS query inside. But I’m guessing they’re actually NTP queries.

I find it more likely that the source address of those packets is forged and the goal is to make your NTP server send NTP packets to some victim’s DNS server. On my servers I drop all traffic to my server’s UDP port 123 if the source port is either 53 (DNS) or 443 (HTTP 3). Technically it is possible to send legit queries from those source ports but I’ve chosen not to worry about that minimal possibility.

1 Like

Yes, already did. This is what wireshark says:

Those are all dst port 123 and fed to my NTP server.

That’s odd, my whois search said Afrinic/LIQUID-TOL-MNT.