In the Network monitoring is a commodity mythI argued that network monitoring is far from being a commodity and on the contrary needs innovation to cope with the increasing complexity.
As cote mentioned in the comments of that post, there has been some fresh blood in the IT management industry. Several open source companies/projects are tackling the monitoring problem, which is a good thing, yet I feel we're still missing some pieces. AFAIK, most of the monitoring solutions seem to be following existing paradigms :
monitoring the devices (nodes) through SNMP agent
synthetic transcations to determine the status of services running on nodes
The understanding of the network topology is missing in both paradigms. In other words, nodes are what's being monitored. Not the network. The network topology (except layer 3) is largely unknown. This limits the effectiveness of the monitoring. Monitoring tools (or rather functionality offered by the tools) can be categorized broadly as the following:
Polling the devices: Most common approach in IP networks. Most IP networking devices have an SNMP agent that supports at least MIBII so basic availability and performance information can be obtained. For more detailed information however, use proprietary MIBs is needed. Many IT management guys spent long hours trying to understand these MIBs, which data is where, compile them to be used by their monitoring tools, etc.
Listening for exceptions: Not every network device has an available agent that can be polled, especially in lower layers below IP. And when available, ability to listen for information is useful as it can be more immediate. In IP networks, these are typically SNMP traps or syslog events. In others, there are often element managers that convey messages. Again, IT management folks spent countless, often frustrating hours, trying to make sense of the traps, syslog events, etc. normalizing them, translate them into human language, identifying what is important and what's not etc.
Listening to the pipes: It is possible to learn a lot by listening to what goes on the network. Flow tools (Netflow and its kin cFlow, J-Flow, netstream, sflow, etc.) generate end to end traffic statistics based on flow of data through the network device that support it. Another approach seems to be analyzing the traffic going through a device using a span port. Although it seems this method is popular to analyze application traffic. I don't have a lot of personal experience with these tools so I'll leave it to others to explain it better or correct me. From what I see these tools often require hardware distributed throughout the network to get full visibility which may be a hurdle for adoption.
IMHO, all of the approaches I've tried to summarize above have some shortcomings. As far as I can see, the situation may improve in two ways:
someone may come up with a new technology, a clever way to monitor the network and identifytthe problems, may be discover & represent the network etc. IMO, this can only happen if some of the investment and attention in tools that target “business users” with sexy, shiny UIs flow back to the muck. When the payoff is so low (who wants to tackle a “commodity” problem?) significant investment is not likely.
The power of the community is harvested to solve tedious problems once and share rather than each user struggling to solve the same problems over and over independently. There are already some examples of this splunk is attempting to create a repository of log events and what they mean. ZipTie open source project is working on solving device configuration through collaboration of vendors and customers (how come they are not a member?)
There is a lot more that can be done in the monitoring realm, if we can manage to setup the right collaboration platform (commercially, legally as well as technically) to facilitate sharing, which is sorely lacking in IT management for whatever the reasons may be.
From what I can see, ZipTie model is particularly interesting and suitable. Ability to collaborate and share is potentially a major competitive advantage for open source projects. I believe there are opportunities here for collaboration among open source projects/companies and their users/customers.
For example, in the case of discovery and representation of the network topology, how to get the topology data out of vast number of different type of devices is can be shared. If a common model can be defined to represent the topology, adapters to populate the model for each device can be developed.
In case of trap and event log processing, the knowhow of what each trap may mean, what the varbinds are can be shared. And again if a commong model can be defined to represent the traps/events, adapters to convert the traps into the common model can be developed.
I think these activities are naturally conducive to be solved through collaboration, and the life in the trenches would improve significantly if we were tackling them together instead of drowning in them alone.
One of the things I keep thinking about as it pertains to issues like this is "Has this problem been solved before?". Related to that, I think "The more things change, the more they stay the same"
Ehh?
We have seen problems like this in the past, and from where I sit it seems to me that when an idea occurs, it races out the door and no one sits back and thinks about the problems it might have... or might cause... long term.
it could be one of several things, but as it related to this I think "Gee, when device / protocol X awere designed did they not think about how it might be instrumented up front? I relate this (beacuse I am high mileage) to the way IBM used to roll out new stuff. A new device or a new network subprotocol would come with its measurement capabilities built right in. A new device controller would have a fully architect'ed control blocks book available, and would already be tied into the newest version of the monitoring (EREP) and measurement (SMF) tool.
3rd party vendors made money making these tools easier to use, but the basic infrastructure was thought out and designed in.
When I was a subcontractor at IBM, I used to hear hallway talk basically laughing at the idea that anyone would leave something like VTAM behind for TCP/IP. VTAM was stable, measured, architect'ed, stateful, and manageable. TCP/IP "didn't even guarantee packets would be delivered!".
VTAM got slaughtered.
Yet here we are, trying to figure out how to make devices and protocols monitorable and manageable.
Of course, the way that IBM was able to do what they did was simple in concept. They controlled the standard. They controlled the releases. They made the devices. The reports took freaking experts to read, decode, and understand. Shoot, you had to be unbelievably careful about making sure you only enabled just the things you needed to monitor, or your whole computer / device / line was tied up with measuring itself. But IBM was incent'ed ed to do this. To not do it meant that they, the single throat to choke, would not be able to figure out what was wrong when things went wrong. Customers would lynch them for downtime / outages / poor performance. PC software had not yet lowered the expectations of the customer as to what they could expect from computers.
So, we have been here before. People are starting to realize the real costs of downtime, and are raising their expectations again. But the stack is not longer made by one vendor. There is not one throat to choke. It is easy to point the finger across the way and deflect blame.
For this to work, the incentive has to come back around so that everyone is on the line. Everyone can get choked if we do not all work together.
I don't think anyone wants to go back to the monolithic days of yore, except for maybe those that yearn for the says when they could just beat up one vendor. That had its limits too of course: as long as they were the only game in town, they could just take their lumps.
As long as their is no incentive to design / architect this stuff in from the get go, we are going to stay in the Tower of Babel.