How we centralized our logs using ELK stack

r a v
7 min readOct 14, 2018

--

It all started with a discussion about having a nice dashboard like a NOC to view the status of our infra in real-time. Many proposals were made and we decided to do a POC with the Elasticsearch,Logstash,Kibana(ELK) stack.

We initially targeted to start with the error and performance logs as we already had scripts that are getting these stats on a daily basis. We planned to modify them to output the error and performance stats in a predefined pattern so we can ingest them into ELK.

Now what’s the fun if it isn’t challenging? So here are our share of complexities:

  • our estate is a mix of AIX and Linux servers.
  • AIX and Linux handle error logs differently. AIX uses errpt whereas Linux uses syslog to write to a file.
  • The management directive is that no client should be installed on the servers for collecting the logs/data.

To cut to the chase, once we’re done with the trials we finalized on the below approach for it’s simplicity.

Here’s the dashboard we came up with. It concisely shows in almost real-time the events as they come and their trend. The pie charts or the tag cloud can be used to drill down.

So here’s a brief of our plan and an explanation of how we did it.

  • The error logs should be almost in real-time. As for the perf, we wanted to log those metrics which breached a threshold continuously for five times with a three minute interval. Yes, we have a legacy monitoring tool that does the monitoring/alerting but we wanted to see if the threshold breaches have a trend.
  • AIX — a custom script runs every three minutes to collect entries from errpt and performance stats and will be sent to central log server via syslog.
  • Linux — a custom script runs every three minutes to collect performance stats and send them over to the central server.
  • The error logs will be sent to the central server via rsyslog in real-time. A filter in the rsyslog.conf ensures unwanted messages are filtered at source.
  • We created a dedicated VM to act as the central log server. This will also host the logstash to grok and forward the processed logs to the Elasticsearch cluster.
  • The scripts and the logstash filters are available at https://github.com/kottapar/scripts_and_filters

The scripts in detail

  • Performance stats:

Here’s a snippet of the logic we used. It collects the current value and then compares all the five(current plus previous four) values against the threshold. If all of them are breached, an event will be logged.

## RUN QUEUE:
rqc=$(cat $vmlog | tail -n3 | awk '{ sum+=$1} END {print sum/3}' | awk -F. '{print $1}')
rqp1=$(cat $tmp_dir/rqp1)
rqp2=$(cat $tmp_dir/rqp2)
rqp3=$(cat $tmp_dir/rqp3)
rqp4=$(cat $tmp_dir/rqp4)
if [[ ( $rqc -gt $RQTHRESH ) && ( $rqp1 -gt $RQTHRESH ) && ( $rqp2 -gt $RQTHRESH ) ]]; then
echo "$hname $ip_addr runq $rqp4,$rqp3,$rqp2,$rqp1,$rqc" >> $perflog
fi

A syslog facility local5 is configured which will forward these events to the logstash server. Here’s the logstash grok that parses these perf events. The perf_msg field added here is the actual log that we see in the dashboard. In the below example it is “runq 15,19,16,19,19”

grok
{
match => { "message" => "%{DATA:server} %{IP:ip_address} %{WORD:metric-name} %{GREEDYDATA:mtr-values}" }
add_field => { perf_msg => "%{metric-name} %{mtr-values}" }
}

event displayed in Kibana
Server Timestamp OS ip_address Event or log
aixserver 02/10/2018 21:01:01 AIX_P 1.2.3.4 runq 15,19,16,19,1

The above filter works for all perf events except the disk. For the disk related events we'll have the disk name. The format of the message is

"<hostname> <ip_addr> disk_busy <diskname> dbusyp4 dbusyp3 dbusyp2 dbusyp1 dbusyc"

and the corresponding grok filter is given below.

if "disk" in [message] {
grok {
match => { "message" => "%{DATA:server} %{IP:ip_address} %{WORD:metric-name} %{DATA:disk} %{GREEDYDATA:mtr-values}" }
add_field => { perf_msg => "%{metric-name} %{disk} %{mtr-values}" }
}
}
  • AIX error logs:

For AIX, the error logs are displayed via the errpt command.

errpt -s mmddhhmmyy displays records on and after the specified date. As the script runs every three minutes, the entries since the last three minutes are logged to a file and are sent to logstash via syslog.

timestamp=$(perl -MPOSIX -le 'print strftime "%m%d%H%M%y",localtime(time()-180)')
## append ip address to the starting of the message
# 23-10-2017 : adding logic to prevent lunz related errors from being reported

lunzdsk=$(lsdev -Cc disk|grep LUNZ|awk '{print $1}'|xargs|tr ' ' '|')
errpt -s $timestamp | grep -v IDENTIFIER | sed 's/ \{1,\}/ /g;s/^/'"$hname $ip_addr"' /g' |egrep -v "$lunzdsk|D1E21BA3" | sort -u > $errlog

if [[ -s $errlog ]]; then
/usr/bin/logger -p local5.alert -t AIX_E -f $errlog
fi

The date format for errpt is also a bit different. So the pattern

aix_timestamp %{MONTHNUM}%{MONTHDAY}%{HOUR}%{MINUTE}%{YEAR}

is defined in logstash. It is called using the patterns_dir key. We use the pattern aix_timestamp to map to err_time. We then use the date module to match the timestamp of the log to this timestamp.

Say for example the timestamp is 02/10/2018 20:47:00, but the log may have been picked up at 20:50 because we run the script every three minutes. When we use the date module, Elasticsearch inserts this event with the timestamp at 20:47 instead of the default timestamp i.e the time at which it received the event.

grok {
patterns_dir => "/etc/logstash/patterns"
match => { "message" => "%{DATA:server} %{IP:ip_address} %{WORD:err_id} %{aix_timestamp:err_time} %{WORD:err_type} %{WORD:err_class} %{WORD:err_rsrc} %{GREEDYDATA:err_log}" }
add_field => { error_msg => "%{err_id} %{err_time} %{err_type} %{err_class} %{err_rsrc} %{err_log}" }
}
date {
match => [ "err_time", "MMddHHmmyy" ]
}
  • Linux error logs:

Finally something straightforward :) All we did was to add a rsyslog filter to prevent the unwanted messages right at the source.

We specified a rsyslog template to define the pattern for the messages. Any message with a syslog priority 5 or below and which are not part of the message filter will be sent to the logstash server. That message filter is ugly, but that’s what I came up at that time. I tried to find out ways to prevent that $msg contains, but couldn't find any other way.

## This file is to be copied to /etc/rsyslog.d/ on the RHEL servers
# This ensures that any errors are sent over to ELK in the template format specified

$template linux_err,"<%PRI%>%TIMESTAMP% %hostname% %syslogtag% server1 LNX_E %msg%"

## This line ensures that the PERF alerts from the perf script in crontab will be sent to central syslog
if $syslogfacility-text == 'local5' and $syslogpriority == '6' then @1.2.3.4

if $syslogpriority < '5' and not ($msg contains 'savd:' or $msg contains 'makeconnection' or $msg contains 'martian' or $msg contains 'ff:ff:ff' or $msg contains 'messages suppressed.' or $msg contains 'imjournal: journal reloaded') then @1.2.3.4;linux_err

But how about maintaining and updating this file on all the servers? We used Ansible.

But if you don’t, we have an alternative. We were not using Ansible earlier. So we had the below hack.

Whenever a change was made, we'd copy the file to a landing dir on the servers from a central server. The perf_err script will check to see if the file's updated and if it is, it will copy it over to rsyslog.d dir and will restart the service.

#declaring syslog file
rsfile=/infra/scripts/10-lnx_error.conf
dtnow=$(date +%s)
dtfile=$(date -r $rsfile +%s)
fileday=$(date -r $rsfile +%d)
daynow=$(date +%d)

#print just the OS number like 5 or 6 or 7
os=`awk -F. '{print $1}' /etc/redhat-release | awk '{print $NF}'`

if [[ ( $os -eq 6 ) || ( $os -eq 7 ) ]]; thenif [[ ( -s $rsfile ) && ( $fileday -eq $daynow ) ]]; then
# get the time difference in min between now and file's timestamp
dff=$(echo $(( ($dtnow - $dtfile) / 60 )))
if [[ $dff -lt 6 ]]; then
/bin/cp $rsfile /etc/rsyslog.d/10-lnx_error.conf
# We couldn't find any way to get the hostname in the template via rsyslog; hence the below hack
sed -i 's/LNX_E/'"$hname LNX_E"'/g' /etc/rsyslog.d/10-lnx_error.conf
/sbin/service rsyslog restart
fi
fi

For the grok pattern for linux events, I took the easy(greedy) way out.

grok {
break_on_match => true
match => { "message" => "%{IP:ip_address} %{DATA:server} %{GREEDYDATA:error_msg}" }
}

So, in short that's how we setup our perf and error logs. Buoyed by the success of this we went on to bring the VMware ESX and vcenter logs and Windows logs to ELK and even found a new use-case.

We started ingesting our Unix, VMware and storage inventory data into ELK to make beautiful dashboards. We can just glance through the dashboard to get an idea as to how our Infra is doing. We've started trending the storage and datastore consumption so we can better understand and plan our Infra.

Here's a sample of our UNIX inventory dashboard. For obvious reasons I had to recreate it and delete/mask most of the details. But you get the gist. In addition to the below, we had the data tables that you can download as a csv, a dashboard with line graphs for the trends.

In short, once you have your data in ELK,the possibilities are endless..

So what’s your story? Please do let us know in the comments below. Also please do not hesitate to provide feedback even if it’s critical, for that’s how we improve.

--

--