Understanding SNMP for Network Monitoring
Understanding SNMP for Network Monitoring
E-BOOK
The Comprehensive
Guide to SNMP
What is SNMP?
SNMP stands for Simple Network Management Protocol. SNMP provides a standard message format that the
In real life, it is often not simple; does not only apply monitoring system, routers, switches, servers, storage
to network devices; and often cannot be used for arrays, UPS devices, etc., can all speak - even though
management of devices, only monitoring. It is definitely they will be running different operating systems.
a protocol, however. :-) Of course, there are different versions of SNMP,
and different security issues, and different types of
SNMP is mainly used for the collection of data about information that the different devices can report.
devices, such as CPU load, memory usage, etc. SNMP
is supported on practically all network equipment But if you are responsible for ensuring the performance,
(switches, routers, load balancers, for example), but availability and capacity of your infrastructure, enabling
also on most server operating systems, some storage and using SNMP and a monitoring system to collect
devices, and even some server application software. and alert on data is the way to go. This system can
However, the extent of what “supporting” SNMP really scale from monitoring one device to tens of thousands,
means can vary wildly, but more on that later. alerting you when something is wrong (and hopefully,
letting you sleep when everything is OK.)
If you’re reading this, you are probably responsible for
the performance, availability and capacity of some
IT infrastructure. (If you are reading this because you
thought it was the complete guide to the State of New
Mexico Police - this is not for you.) If you have a non-
trivial (i.e. greater than zero) amount of IT infrastructure
whose availability matters – because it generates SNMP stands for Simple Network
revenue or enables others to do their jobs – then you Management Protocol. SNMP
need a way to be sure your infrastructure is working, provides a standard message format
and working well. If you want to sleep, you need an that devices being monitored and
alternative to staying up all night watching the output
monitoring systems can all speak
of the command line tool top to watch the processes
on your server. Especially if you have 1000 servers - even though they will be running
and 100 routers. different operating systems.
The SNMP agent is a software process that receives SNMP queries, retrieves the data being asked for, and
replies back. Most routers, switches, firewalls, and other systems without a full operating system will have
SNMP support built in to the software. General purpose servers (Linux, Solaris, AIX, Windows, FreeBSD, etc)
may not have an SNMP agent installed by default, depending on the installation options chosen, but one
can be added at any time. The most common SNMP agent for Linux and Unix based systems is the net-snmp
agent, which runs as snmpd (the SNMP Daemon.) Installing, configuring and running this agent will add
SNMP support to any system that supports it.
A Network Management Station is harder to pin down. It could be anything from a single linux machine with
snmpwalk that is used to do ad hoc command line queries against devices, to a simple management system
like What’s Up Gold, to a complete powerful system like LogicMonitor (where the collectors initiate the SNMP
questions, but the storage, analysis, and alerting is centralized in a SaaS infrastructure.) But as noted above:
if a system initiates SNMP questions, it can be thought of as an NMS. (Note that a system can have both the
SNMP agent and an NMS installed.)
Both SNMP agents and NMS’s will talk SNMP to each other:
i.e. a defined IP protocol - the standard message format mentioned earlier.
Root Node
Internet (1)
This long text file extract above defines the object .[Link].[Link].0 to be
the sysDescr object, and specifies that when an SNMP agent is queried
for this OID, it should return a textual description of the system.
To make this all a little less abstract, we can perform this query using a
simple tool snmpwalk, that is included in most Linux packages:
One thing to note is that OID’s can represent objects in a table, if the
SNMP agent may have multiple items with the object in question. In
this case, each row in the table will be about one of the items. For
example, interfaces - there is an OID for the Interface Description;
and another for the number of Octets received on that interface.
But a computer may have many interfaces.
Versions of SNMP
SNMP version 2c: in practical terms, v2c is identical to version 1, except it adds support for 64 bit counters. This
matters, especially for interfaces: even a 1Gbps interface can wrap a 32 bit counter in 34 seconds. This means that
a 32 bit counter being polled at one minute intervals is useless, as it cannot tell whether successive samples of 30
and 40 are due to the fact that only 10 octets were sent in that minute, or due to the fact that 4294967306 (2^32
+10) octets were sent in that minute. Most devices support snmp V2c nowadays, and generally do so automatically.
There are some devices that require you to explicitly enable v2c – in which case, you should always do so. There is
no downside.
SNMP version 3: adds security to the 64 bit counters. SNMP version 3 adds both encryption and authentication,
which can be used together or separately. Setup is more complex than just defining a community string – but then,
what security is not? But if you require security, this is the way to do it.
The only security measures for SNMP versions 1 and 2c are a community string sent in plaintext, and the ability to
limit the IP addresses that can issue queries. This is effectively no security from someone with access to the network
– such a person will be able to see the community string in plaintext, and spoofing a UDP packet’s source IP is
trivial. However, if your device is set up to only allow SNMP read only access, the risk is fairly small, and confined to
evil people with access to your network. If you have evil people with this access, people reading device statistics by
SNMP is probably not what you need to be worrying about. So, if you can accept the weak security model of SNMP
v2c, use that. If not, use V3 with encryption and authentication.
ENCRYPTION
PLAINTEXT 64 BIT
COUNTERS
AUTHENTICATION
Questions,
The thing going wrong may be causing spanning tree
to re compute, or routing protocols to reconverge,
or interface buffers to reset due to a switchover to
redundant power supply - not the time to rely on a single
Answers
packet to tell you about critical events. Traps are not
a reliable means to tell you of things that can critically
affect your infrastructure – this is the main reason to
and Traps
avoid them if possible.
There are two methods of information transfer in SNMP. On every switch, every router, every server.... But, you
One is to query an OID, and receive an answer (given may ask, don’t you have to do this to set up the SNMP
that this act of querying is usually done periodically, community on the devices anyway, to enable polling?
this is often called polling.) In order to check the Yes – but usually when SNMP communities are defined,
temperature of a Cisco device, you can poll the rows polling is enabled for entire networks or subnets. You can
of the OID table .[Link].[Link].[Link].3 to get the move your monitoring system to another IP on the same
temperature, and .[Link].[Link].[Link].6 to see if the subnet, and not have to change any configuration.
temperature is triggering any warning or error states.
But if you rely on traps, you now have to touch every
The other method of information transfer is to use Traps. device and reconfigure it to send traps to the new
Traps are initiated by the SNMP agent. i.e. instead destination. And more significantly, it’s very hard to test
of the NMS polling an OID periodically to see if the that traps will work. With polling, it’s easy to see (and be
temperature state is a cause for alarm, the device can alerted on) data not returning due to a misconfigured
just send the NMS a notification when the temperature community, firewall or access list. It is much harder to
exceeds a threshold. This sounds good, in that you will be confident that a system is set up to trap to the right
get immediate notification as soon as an alert condition place, and that access-lists are set correctly to allow
occurs, instead of having to wait for a poll to detect the the traps. (And of course, traps use a different port than
condition. Another possible advantage is that there is regular SNMP queries, so the fact polling works tells you
no load on the NMS, network or monitored device, to nothing about whether traps will work.)
support the periodic polling. However, traps have some
significant disadvantages. By definition, polls are tested every minute or so. A trap
is usually sent only when a critical event occurs, with
Firstly, consider what a trap is – a single UDP datagram, no notification or feedback if it fails. Which would you
sent from a device to notify you that something is going rather depend on for the health of your infrastructure
wrong with that device. Now, UDP (User Datagram and applications?
Protocol) packets (unlike TCP) are not acknowledged,
and not retransmitted if they get lost and don’t arrive,
since the sender has no way of knowing if it arrived or
not. So, a trap is a single, unreliable notification, sent
from a device at the exact time that a UDP packet is
least likely to make it to the management station – as,
by definition, something is going wrong.
Q&A
80
60
40
20
0
6 Dec 7 Dec 8 Dec 9 Dec 10 Dec 11 Dec 12 Dec 13 Dec 14 Dec
Device 1 Device 2
Pretty much any device you see being sold into the The fact that a device vendor may provide a MIB that
datacenter or IT space will claim “SNMP support”. This has lots of useful information in it does not necessarily
is kind of like saying that a two-year old toddler and solve your problems. For example, while APC does
Usain Bolt are both capable of running - there are provide very powerful SNMP agents, and a detailed
wildly disparate differences in what “SNMP support” private MIB - their MIB has over 4500 objects in it - not
can mean. Some devices will support a very limited set all objects are supported by all APC devices; and most
of information that is available through SNMP; some are not meaningful to ordinary use of the devices (e.g. 1.
will support all the standard mgmt objects; and some [Link].[Link].[Link].1.4: “the rectifier physical address
will support the standard objects, as well as thousands (the address on the bus).”).
of OIDs they publish in their own MIBs.
As noted above, an NMS can be as simple as a Linux workstation with SNMP utilities installed, so that you can
perform SNMP gets. In theory, you could then wrap some scripts around snmpget and snmpwalk, to query the
data you care about; compare it to some hardcoded thresholds; and run the script out of a cron job so that it
repeats every 5 minutes.
This is probably not what you would consider a real NMS, however.
So while any system that can query and show the response to an SNMP query could be called an NMS, there
are a few fundamental things that need to be there from a practical level:
Easily define what OIDs to query. Easily define how to interpret the Easily define the thresholds that
Ideally, this isn’t something you data that is returned. should trigger alerts.
even need to think about. An SNMP data can be returned Again, ideally the NMS should
NMS with true SNMP support will as gauges (e.g. the current take away a lot of the need for
discover the kind of device; then temperature in Celsius); counters this, and have pre-defined alerts
have knowledge of which OIDs (how many packets have passed for everything that could impact
are appropriate to query for that through the interface since the production systems, but there will
device; and also periodically check system started); strings, bitmaps, always be customization required
to see if there are changes in the etc. Counters need to be converted - either for systems that are not
device’s configuration requiring to a rate, in most cases, by mission-critical, and so have a
new different OIDs to check. (For subtracting the prior counter value greater tolerance for performance
example, enabling Power-over from the current, and dividing by the issues; or for custom metrics that are
Ethernet in a switch will turn on a time interval between samples. This not pre-defined. This tuning should
whole new section of the MIB tree should be automatically handled by be an easy task.
that should be queried.) The worst the NMS.
case is an NMS that requires you to
manually define what OIDs to check.
Yes, it’s technically supporting SNMP,
but it’s not making your life any
easier if you have to go through
the 4500 objects in the APC MIB,
just to ensure your UPS’s are
correctly monitored.
You’ll notice that the items above all focus on ease of use - which
should be the main goal of using an NMS - to make the job of
ensuring the operational availability, capacity and performance of
the systems easier. NMS systems that require you to modify text or
XML files, or pore through thousands of MIB files and configure all
the SNMP OIDs to query, may technically be NMS systems - but only
in the loosest sense of the word.
There are many other things that an NMS may do in this regard,
which will be of differing utility to different organizations:
• graph the variables being collected, so you can see the
historical trend of the objects being collected.
• route and deliver the alerts via different mechanisms (chat,
email, sms, voice calls) to different people, and escalate
through different people and teams. This can alternatively
be done by a separate tool.
• discover devices to be monitored via different mechanisms.
• map devices logically at different OSI layers,
or geographically.
• use different data collection mechanisms other than SNMP,
to support devices that do not provide any, or limited,
support of SNMP. An NMS that can also collect data via other
protocols such as WMI, JMX, and various other APIs can be
used to consolidate and replace multiple tools into one, and
provide a more cohesive view of the whole environment.
and so on.
However, in most uses, a simple configuration is • Some distributions will include a version of the
reasonable (assuming the host is behind a firewall, and snmp agent that uses [Link] and [Link]
not exposed to the Internet.). The simplest configuration files in the /etc directory, in conjunction with
tcpwrappers, to further limit access. You can test
is to simply set the contents of /etc/snmp/[Link]
this by adding the line snmpd: ALL to /etc/hosts.
to this:
allow, and retrying your test.
• Of course, ensure you are using the correct
rocommunity MyCommunity SNMP community string.
chkconfig snmpd on
service snmpd restart
Working Conclusion
at Scale Hopefully you’ve gained an understanding of what
SNMP is: why it is used; how it is configured; the
type of systems that use it; and some of the pitfalls
in talking about SNMP support. SNMP is the most
widely deployed management protocol; it is simple to
understand (although not always to use), and enjoys
ubiquitous support. While some systems have alternate
If you have more than one server to manage,
management systems - most notably Windows, which
you will need to set up SNMP access on all
uses WMI in preference - a good knowledge of SNMP
your devices. This is easily done with any of
will take you a long way in being equipped to monitor
the popular configuration management tools
a variety of devices and servers.
(Ansible, Chef, Puppet, CFEngine, etc).