For a few weeks i've been working on setting up a large op5 Monitor environment for testing purposes and thought it would be a good idea to share my findings. For this first post let's talk about the server setup and what it is we want to solve.
In this scenario we want to monitor 15000 hosts with 10 services per host (yes, 150k+ services), all SNMP enabled devices. We want to produce up to 100 reports per week and all metrics should be graphed as well. Our infrastructure is located in 32 different locations (regions or countries) with an even spread of devices per location. All data should be presented in the same view and all configuration handled from a single point.
Let's start with setting up our infrastructure. To keep things simple and be able to deploy this within a really short timeframe i've used a cloud service for this test. There are many cloud-providers out there but knowing this would be an IO-intensive task and requiring lots of VMs AWS was not an option for me as i have been struggling with io-intensive tasks before in this environment so i will be using City Cloud instead (https://www.citycloud.com/). I have been working with these guys before and they are always there to help optimizing things along the way. But if you would like to set this up in AWS and compare numbers, go ahead if you feel lucky and have the time - ping me with the result
For a customer we would suggest running this on physical hardware and we have three different specs, Entry, Standard and Large servers. I've used the same (as close as possible) for my VMs. I'm setting up three master-servers with a "Large-profile" and 31 pollers using the "Standard-profile" (more about our hardware specs) so in total 34 VMs. The result is an impressing environment with 222 CPU cores and 700 GB of RAM!
I have configured the 3 master-servers as peers for redundancy and also to be able to spread the reporting load between them (for example 33 reports per server), this is my main DC where all reports, configuration and incidents will be handled. For the actual monitoring each region will have their own instance of op5 Monitor configured as a poller. So let's set up 31 pollers in this setup. (If you want to know how, check out our admin-manual). "mon node ctrl" is your friend when doing something like this!
(Fast forward a "few hours") - So now when we have our basic infrastructure ready to go we need something to monitor as well. Switches may not like (depending on brand and model) being hammered with 500 SNMP-queries per second so we need something else to run our test on but still gives us the same kind of data a switch would do. For this test i have provisioned a linux-VM with snmpd hoping it would handle the load (spoiler alert - no problems).
Finally configure op5 Monitor to add all 15k hosts and 150k services! How op5 Monitor works with pollers each poller is responsible to one hostgroup so i generated the configuration using a simple script, at first using our configuration API but cURL can be slow when working with this amount of data but the API handled 1000 objects at a time (including a save deploying the configuration) with no problem. Well, not actually true, to get things working smoothly i needed to bump the memory limit in php a "bit" (>1GB) and also bump the max execution time (>1800 sec) but once this was done it worked just fine. But i wanted to get started faster so i generated the configuration on the file system instead.
Even the sun has its spots, i made a configuration error (this one is on me) :
This impressive load was a result of:
1. Not using a ramdisk for the performance data (remember IO-performance?)
2. Telling Merlin NOT to try to take over the pollers checks if they are down (or iptables not configured properly).
3. Deploy 150k checks in one go.
Once sorted the load was reduced to less than 10 and i was impressed the VM actually handled that load without becoming unresponsive!
A few performance numbers (more to come):