Domesticating applications, OpenBSD style

Posted Jul 23, 2015 21:59 UTC (Thu) by dlang (guest, #313)
In reply to: Domesticating applications, OpenBSD style by mathstuf
Parent article: Domesticating applications, OpenBSD style

> Maybe I'm just unfamilar with this level of sysadmining, but what do you *do* with all these logs?

Fair question

different things have different uses.

The archive is to recreate anything else as needed and to provide an "authoritative source" in case of lawsuits. How long you keep the logs depends on your company policies, but 3-7 years are common numbers (contracts with your customers when doing SaaS may drive this)

being able to investigate breeches, or even just fraud are reasons for the security folks to care.

for outage investigations (root cause analysis), you want to have the logs from the systems for the timeframe of the outage (and this is not just the logs from the systems that were down, you want the logs from all other systems in case there are dependencies you need to track down). For this you don't need a huge timeframe, but being able to look at the logs during a time of similar load (which may be a week/month/year ago depending on your business) to see what's different may help.

by generating rates of logs of different categories you can spot trends in usage/load/etc

By categorizing the logs and storing them by category you can notice "hey, normally these logs are this size, but they were much larger during the time we had problems" and by doing it per type in addition to per server you can easily see if different servers are logging significantly differently when one is having problems.

Part of categorizing the logs can be normalizing them. If you parse the logs you can identify all 'login' messages from your different apps and extract the useful info from them and output a message that's the same format for all logins, no matter what the source. This makes it much easier to spot issues and alert on problems.

A good approach is what Marcus Ranum coined "Artificial Ignorance"

start with your full feed of logs, sort it to find the most common log messages, If they are significant categorize those longs and push them off for something that knows that category to report on.

Remember that the number of times that an insignificant thing happens can be significant, so generate a rate of insignificant events and push that off to be monitored.

repeat for the next most common log messages.

As you progress through this, you will very quickly get to the point where you start spotting log messages that indicate problems. Pass those logs to an Event Correlation engine to alert on them (and rate limit your alerts so you don't get 5000 pages)

Much faster than you imagine, you will get to the point that the remaining uncategorized logs are not that significant, but also that there aren't very many of them and you can do something like generate a daily/weekly report of the uncategorized messages and have someone eyeball them for oddities (and keep an eye out for new message types you should categorize)

This seems like a gigantic amount of work, but it actually scales well. The bigger your organization the more logs you have, but the number of different _types_ of logs that you have grows much slower than the total log volume.

> It seems that, to me, these log databases are larger than the actual meat of the data being manipulated in many cases.

That's very common, but it doesn't mean the log data isn't valuable. Remember that I'm talking about a SaaS type environment, not HPC. Even if the service is only being provided to your employees. HPC and scientific simulations use a lot of cpu and run through a lot of data, but they don't generate much in the way of log info.

For example, your bank records are actually very small (what's your balance, what transactions took place), but the log records of your banks systems are much larger because they need to record every time that you accessed the system and what you did (or what someone did with your userid). When you then add the need to keep track of what your admins are doing (to be able to show that they are NOT accessing your accounts and catch any who try), you end up with a large number of log messages for just routine housekeeping.

But text logs are small, and they compress well (xz compression is running ~100:1 for my logfiles), so it ends up being a lot easier to store the data than you initially think. If you are working to do this efficiently, you can also use cheap storage and end up finding that the amount of money you are spending on the logs is a trivial amount of your budget.

It doesn't take many problems solved, or frauds tracked down to pay for it (completely ignoring the value of logs in the case of lawsuits)

to post comments

Domesticating applications, OpenBSD style

Posted Jul 24, 2015 1:12 UTC (Fri) by pizza (subscriber, #46) [Link] (2 responses)

> The archive is to recreate anything else as needed and to provide an "authoritative source" in case of lawsuits. How long you keep the logs depends on your company policies, but 3-7 years are common numbers (contracts with your customers when doing SaaS may drive this)

You aren't using "logs" in the same sense that most sysadmins mean "logs" -- your definition is more akin to what journalling filesystems (or databases) refer to as logs -- ie a serial sequence of all transactions or application state changes.

I think that's why so many folks (myself included) express incredulity at your "logging" volume.

Domesticating applications, OpenBSD style

Posted Jul 24, 2015 1:49 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

When I say logs, I'm talking about the stuff generated by operating systems and appliances into syslog + the logs that the applications write (sometimes to syslog, more frequently to local log files that then have to be scraped to be gathered). This includes things like webserver logs (which I find to be a significant percentage of the overall logs, but only ~1/3)

I do add some additional data to the log stream, but it's low volume compared to the basic logs I refer to above (A few log messages/min per server)

Also, keep in mind that when I talk about sizing a logging system, most of the time I'm talking about the peak data rate. What it takes to keep up with the logs at the busiest part of the busiest day.

I want to have a logging system that can process all logs within about 2 min of when they are generated. This is about the limit as far as I've found for having the system react to log entries or having changes start showing up in graphs.

There is also the average volume of logs per day. This comes into play when you are sizing your storage.

so when I talk about 100K logs/sec or 1Gb of logs being delivered, this is the peak time.

100K logs/sec @256 bytes/log = 25MB/sec (1.5GB/min, 90GB/hour) If you send this logging traffic to four destinations (archive, search, alerting, reporting), you are at ~100MB/sec of the theoretical 125MB/sec signalling rate that gig-E gives you. In practice this is right about at or just above the limit of what you can do with default network settings (without playing games like jumbo frames, compressing the network stream, etc). The talk I posted the link to earlier goes into the tricks for supporting this sort of thing.

But it's important to realize that this data rate may only be sustained for a few min per day on a peak day, so the daily volume of logs can be <1TB/day on a peak day (which compresses to ~10GB), and considerably less on off-peak days. Over a year, this may average out to 500GB/day (since I'm no longer there I can't lookup numbers, but these are in the ballpark)

This was at a company providing banking services for >10M users.

now, the company that I recently started at is generating 50GB of windows eventlog data per day most weekdays, (not counting application logs, firewall logs, IDS logs, etc) from a couple hundred production systems. I don't yet have a feed of those logs, so I can't break it down any more than that yet, but the type of business that we do is very cyclical, so I would expect that peak days of the month/year the windows eventlog log volume will easily be double/triple that.

If you parse the log files early and send both the parsed data and the raw data, the combined parsed data can easily be 2x-3x the size of the original raw data (metadata that you gather just adds to this)

As a result, ElasticSearch needs to be sized to handle somewhere around 500G/day (3*3*50GB/day) for it's retention period to handle the peak period with a little headroom.

Domesticating applications, OpenBSD style

Posted Jul 24, 2015 1:55 UTC (Fri) by dlang (guest, #313) [Link]

as always, your log sizes and patterns may vary, measure them yourself. but I'll bet that you'll be surprised at how much log data you actually have around.