What to monitor on a (Linux) server
It is surprisingly how many articles are out there about server monitoring, referring to how to use a specific tool, and the lack of sources of documentation regarding what you actually need to monitor from a best practices point of view.
A well monitored server allows to fix possible issues proactively or solve service interruptions a lot faster as the problem can be located faster and solved.
So here goes my list of things I always monitor, independent of actually what the specific purpose of the server is.
- hardware status - if fans are spinning, cpu temperature, mainboard temperature, environment temperature, physical memory status, power source status, cpu's online. Most of the well know vendors (Dell, HP, IBM) provide tools to check the hardware for the above list of items
- disk drive S.M.A.R.T. status - you can find out things like if the hdd is starting to count bad blocks or if the bad blocks are increasing fast which will give you a heads up that you need to prepare to replace the disk. Also most of the times you can monitor the HDD's temperature
- hardware raid array status / software raid status - you really want to know when an array is degraded. Unfortunately most of the organization's don't actually monitor this
- file system space available - I start with a warning when usage is at 80% and a critical alarm if usage is above 90%. For big filesystems ( >= 100G) of course this needs to be customized as 20% means at least 20G
- inodes available on the file system - again I use the 80% warning, 90% critical . This is something which isn't always obvious (when you run out of inodes) and can create a whole of other problems. Of course it applies only to file systems which have a finite amount of inodes like ext2,3,4
nato phonetic alphabet translator
Tired of looking at a table with the NATO phonetic alphabet and spelling different words over the phone i decided to do this: http://spellme.info in order to simplify and speed up the whole thing. Godaddy sells .info domains for 2$ so i got one for this thing.
Multiple domain selfsigned ssl/tls certificates for Apache (namebased ssl/tls vhosts)
This is an old problem: how to have ssl/tls name based virtual hosts with Apache .
The issue is that the ssl/tls connection is established before Apache even receives a HTTP request.When Apache receives the request already the SSL connection is established with a particular hostname - ip & ssl certificate combination so this means that it is capable of serving NameBased virtual hosts only for that particular ssl/tls certificate.
There are two possible solutions here:
- Multi domain or wildcard SSL/TLS certificates. Those are certificates which are configured with more than one name so you can create virtual hosts (in case of apache) for those domains. This is fairly easy to set up and at least for me it has worked ok in the past.
- Server Name Indication (SNI) which is an extension to the SSL/TLS protocol and allows the client to specify the desired domain earlier and the server to be notified so it supplies the correct SSL/TLS certificate depending on the requested hostname. The problem is SNI is fairly new and few server side software supports it, also client side software needs to be fairly new. On the long run this is going to be the best solution as it has been designed to overcome this specific problem
KSM (Kernel Samepage Merging) status
KSM allows physical memory de-duplication in Linux, so basically you can get a lot more out of your memory at expense of some cpu usage (because there is a thread which scans memory for duplicate pages). Typical usage is for servers running virtual machines on top of KVM but applications aware of this capability could also use it even on OS instances which aren't VMs running on KVM.
The requirements are a kernel version of at least 2.6.32 and CONFIG_KSM=y. For more details you can check the official documentation and a tutorial on how to enable it.
Below is a small script (called ksm_stat) which I wrote in order to see how much memory is "shared" and how much memory is actually being saved by using this feature.
#!/bin/bash if [ "`cat /sys/kernel/mm/ksm/run`" -ne 1 ] ; then echo 'KSM is not enabled. Run echo 1 > /sys/kernel/mm/ksm/run' to enable it. exit 1 fi echo Shared memory is $((`cat /sys/kernel/mm/ksm/pages_shared`*`getconf PAGE_SIZE`/1024/1024)) MB echo Saved memory is $((`cat /sys/kernel/mm/ksm/pages_sharing`*`getconf PAGE_SIZE`/1024/1024)) MB if ! `type bc &>/dev/null` ; then echo "bc is missing or not in path, skipping ratio calculation" exit 1 fi if [ "`cat /sys/kernel/mm/ksm/pages_sharing`" -ne 0 ] ; then echo -n "Shared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_sharing`/`cat /sys/kernel/mm/ksm/pages_shared`"|bc -q echo -n "Unshared pages usage ratio is ";echo "scale=2;`cat /sys/kernel/mm/ksm/pages_unshared`/`cat /sys/kernel/mm/ksm/pages_sharing`"|bc -q fi
Example of a machine where it just has been enabled, so it takes a while until all pages are scanned
# ksm_stat
Shared memory is 67 MB
Saved memory is 328 MB
Shared pages usage ratio is 4.87
Unshared pages usage ratio is 17.04
#
Zarafa templates for Zabbix
Recently i had to create Zabbix templates in order to monitor Zarafa Collaboration Platform installations. My employer was kind enough to make them available .
Some screenshots follow below, you can get the templates from Accelcloud's site.
upstart (System-V init replacement on Ubuntu) tips
Since Ubuntu Server 10.04 LTS (lucid) Canonical's System-V init replacement, Upstart has most of the init scripts converted to Upstart jobs. Upstart is event based and it is quite different from sysV init so one needs to adjust to it's config file structure and terminology; it is present in the server release since 8.04 LTS but then it didn't have the init scripts converted to it's format so it didn't really matter on the server release that it took over Sys-V init.
Reading the documentation is mandatory, but here are some quick tips for things at least i found dificult to discover on the project's website or in the man pages:
Default runlevel is defined here: /etc/init/rc-sysinit.conf and ofcourse it can be overridden on the kernel command line . /etc/inittab is gone and everything moved to /etc/init/ while legacy init scripts(= not converted yet to upstart format) can still be found in /etc/init.d/ together with symlinks to converted init jobs.
Managing jobs: initctl start <job> / initctl stop <job> / initctl restart <job> / initctl reload <job> ; Listing all jobs and their status: initctl list
Now here comes the horror story: seems that there is no tool (cli based) which lists what Upstart jobs will start in a particular runlevel, or better what Upstart and /etc/rc*.d jobs will start in a runlevel. There are two GUI based tools (jobs-admin and Boot-Up Manager) but no cli tools so you are left to use things like sysv-rc-conf / chkconfig / update-rc.d for the /etc/rc*.d system-V init like legacy folders and for Upstart jobs you need to manually look at the files in /etc/init/ which is cumbersome as beside the runlevel entry you also need to take into account events/dependencies like net-device-up
It seems like Canonical is thinking that nowadays a Server sysadmin must also install the GUI tools in order to manage basic things like what services start with the server.
Running Linux on Sparc hardware
I have a SunFire V210 laying around, which i bought in order to learn Solaris and get accustomed to it's hardware platform.
Now i need to move a personal project from it's current server to another one and i though that i might put to use the Sun server to use. It turns out that not that many people run Linux on the Sparc architecture and some things that i expected to work (like software raid) turned out to be complicated to set up.
Linux Distribution - basically you have two options if you're looking for an up to date and maintained distro: Debian or Gentoo . When i first installed this server (few months ago) Squeeze was not out and because i was looking for something which has newer software i decided to go with Gentoo despite the fact that i don't like compiling everything over and over and over again.
Linux Software Raid - there isn't a good place to find relevant information so i had to read a lot through discussion lists, forums and blogs until i got it working as expected ; after i finished my setup Debian Squeeze was released and the installer does help a lot.
How to find out all of the ip addresses of an Europe based ISP
You may want to block ip traffic from a particular Internet Services Provider due to different reasons , like for example a lot of crawlers and spammers are hosted there.
For Europe based providers this can be done querying RIPE NCC database : "The RIPE Database contains registration information for networks in the the RIPE NCC service region and related contact details" . This is something which can't be avoided and the data there is genuine.
To query either use the web interface or better the whois Linux/*nix command line client. For this you need to already know the AS (Autonomous System) number for that provider and this can be easily established if you know an ip address from that particular provider
$ whois -- yyy.yyy.yyy.yyy | grep '^origin:' | awk {'print $2'} ASxxxx $ whois -h whois.ripe.net -- -i or ASxxxx | grep '^route:'| awk {'print $2'}
Linux: realtime traffic monitoring and path determination
There are situations when one needs to give the answer to questions like:
- a) - what application/process is listening for inbound connections
- b) - what application/process is causing network traffic
- c) - what hosts are right now doing network traffic with our server
- d) - current rate of traffic going through the network interfaces
- e) - how much traffic is causing each workstation/server directly connected to the Linux server
- f) - which path is an outgoing packet going to take when you have multiple network cards and several routes (and more than one routing tables)
Visualize sar reports with awk and gnuplot
On several systems where only sar (part of sysstat) is collecting and storing performance data i needed to troubleshoot performance issues which occurred several hours earlier . Sar is a great tool but it is annoying that it doesn't have any option to output at the same time , on the same page, output from different reports (like cpu usage, memory usage and disk usage). If you try to request those three at the same time, it will output each report on it's on page and from there it's hard to visualize how each performance indicator evolved at a specific point in time. A solution would have been to load the data in a spreadsheet application and use vlookup function to group the data but this is time consuming and with my spreadsheet skills i don't think it can be automated.
I used awk and order to create a report from sar output, choosing the fields i considered useful in 95% of the times. Because my display resolution width is 900 i managed too squeeze in a lot of fields. In order to get a report for the date of 18th from 10 AM to 6 PM i use:
