Expositus Procuratio

Previous Next
5

 

In the Network monitoring is a commodity mythI argued that network monitoring is far from being a commodity and on the contrary needs innovation to cope with the increasing complexity.

 

 

As cote mentioned in the comments of that post, there has been some fresh blood in the IT management industry. Several open source companies/projects are tackling the monitoring problem, which is a good thing, yet I feel we're still missing some pieces. AFAIK, most of the monitoring solutions seem to be following existing paradigms :

 

  • monitoring the devices (nodes) through SNMP agent

  • synthetic transcations to determine the status of services running on nodes

 

The understanding of the network topology is missing in both paradigms. In other words, nodes are what's being monitored. Not the network. The network topology (except layer 3) is largely unknown. This limits the effectiveness of the monitoring. Monitoring tools (or rather functionality offered by the tools) can be categorized broadly as the following:

 

  • Polling the devices: Most common approach in IP networks. Most IP networking devices have an SNMP agent that supports at least MIBII so basic availability and performance information can be obtained. For more detailed information however, use proprietary MIBs is needed. Many IT management guys spent long hours trying to understand these MIBs, which data is where, compile them to be used by their monitoring tools, etc.

  • Listening for exceptions: Not every network device has an available agent that can be polled, especially in lower layers below IP. And when available, ability to listen for information is useful as it can be more immediate. In IP networks, these are typically SNMP traps or syslog events. In others, there are often element managers that convey messages. Again, IT management folks spent countless, often frustrating hours, trying to make sense of the traps, syslog events, etc. normalizing them, translate them into human language, identifying what is important and what's not etc.

  • Listening to the pipes: It is possible to learn a lot by listening to what goes on the network. Flow tools (Netflow and its kin cFlow, J-Flow, netstream, sflow, etc.) generate end to end traffic statistics based on flow of data through the network device that support it. Another approach seems to be analyzing the traffic going through a device using a span port. Although it seems this method is popular to analyze application traffic. I don't have a lot of personal experience with these tools so I'll leave it to others to explain it better or correct me. From what I see these tools often require hardware distributed throughout the network to get full visibility which may be a hurdle for adoption.

 

IMHO, all of the approaches I've tried to summarize above have some shortcomings. As far as I can see, the situation may improve in two ways:

 

  • someone may come up with a new technology, a clever way to monitor the network and identifytthe problems, may be discover & represent the network etc. IMO, this can only happen if some of the investment and attention in tools that target “business users” with sexy, shiny UIs flow back to the muck. When the payoff is so low (who wants to tackle a “commodity” problem?) significant investment is not likely.

  • The power of the community is harvested to solve tedious problems once and share rather than each user struggling to solve the same problems over and over independently. There are already some examples of this splunk is attempting to create a repository of log events and what they mean. ZipTie open source project is working on solving device configuration through collaboration of vendors and customers (how come they are not a member?)

 

There is a lot more that can be done in the monitoring realm, if we can manage to setup the right collaboration platform (commercially, legally as well as technically) to facilitate sharing, which is sorely lacking in IT management for whatever the reasons may be.

 

 

From what I can see, ZipTie model is particularly interesting and suitable. Ability to collaborate and share is potentially a major competitive advantage for open source projects. I believe there are opportunities here for collaboration among open source projects/companies and their users/customers.

 

 

For example, in the case of discovery and representation of the network topology, how to get the topology data out of vast number of different type of devices is can be shared. If a common model can be defined to represent the topology, adapters to populate the model for each device can be developed.

 

 

In case of trap and event log processing, the knowhow of what each trap may mean, what the varbinds are can be shared. And again if a commong model can be defined to represent the traps/events, adapters to convert the traps into the common model can be developed.

 

 

I think these activities are naturally conducive to be solved through collaboration, and the life in the trenches would improve significantly if we were tackling them together instead of drowning in them alone.

 

 

 

 



Feb 28, 2008 3:54 PM Click to view Steve Carl's profile Steve Carl

One of the things I keep thinking about as it pertains to issues like this is "Has this problem been solved before?". Related to that, I think "The more things change, the more they stay the same"

 

Ehh?

 

We have seen problems like this in the past, and from where I sit it seems to me that when an idea occurs, it races out the door and no one sits back and thinks about the problems it might have... or might cause... long term.

 

it could be one of several things, but as it related to this I think "Gee, when device / protocol X awere designed did they not think about how it might be instrumented up front? I relate this (beacuse I am high mileage) to the way IBM used to roll out new stuff. A new device or a new network subprotocol would come with its measurement capabilities built right in. A new device controller would have a fully architect'ed control blocks book available, and would already be tied into the newest version of the monitoring (EREP) and measurement (SMF) tool.

 

3rd party vendors made money making these tools easier to use, but the basic infrastructure was thought out and designed in.

 

When I was a subcontractor at IBM, I used to hear hallway talk basically laughing at the idea that anyone would leave something like VTAM behind for TCP/IP. VTAM was stable, measured, architect'ed, stateful, and manageable. TCP/IP "didn't even guarantee packets would be delivered!".

 

VTAM got slaughtered.

 

Yet here we are, trying to figure out how to make devices and protocols monitorable and manageable.

 

Of course, the way that IBM was able to do what they did was simple in concept. They controlled the standard. They controlled the releases. They made the devices. The reports took freaking experts to read, decode, and understand. Shoot, you had to be unbelievably careful about making sure you only enabled just the things you needed to monitor, or your whole computer / device / line was tied up with measuring itself. But IBM was incent'ed ed to do this. To not do it meant that they, the single throat to choke, would not be able to figure out what was wrong when things went wrong. Customers would lynch them for downtime / outages / poor performance. PC software had not yet lowered the expectations of the customer as to what they could expect from computers.

 

So, we have been here before. People are starting to realize the real costs of downtime, and are raising their expectations again. But the stack is not longer made by one vendor. There is not one throat to choke. It is easy to point the finger across the way and deflect blame.

 

For this to work, the incentive has to come back around so that everyone is on the line. Everyone can get choked if we do not all work together.

 

I don't think anyone wants to go back to the monolithic days of yore, except for maybe those that yearn for the says when they could just beat up one vendor. That had its limits too of course: as long as they were the only game in town, they could just take their lumps.

 

As long as their is no incentive to design / architect this stuff in from the get go, we are going to stay in the Tower of Babel.

Mar 25, 2008 10:50 PM Click to view Louis DiMeglio's profile Louis DiMeglio

Sometimes I think that you have to look at the 90/10 rule. In the case of discovering network topology, if you can cover discovery of connected devices through the BRIDGE-MIB and perhaps supplement that with CDP and/or LLDP you should be able to discover the Layer 2 topology of a network in most cases. It's the method that my company uses in our product and in almost all of our customer environments gets up to 90% or better accuracy in the discovery. The last 5-10% can then be corrected/supplemented by hand and there you have it.

Mar 26, 2008 5:42 AM Click to view Berkay's profile Berkay in response to: Louis DiMeglio

Hi Louis,

 

Welcome to the discussion. In my experience, layer 2 topology discovery has been more problematic. Most solutions that I'm aware of (big 4) etc. can't do it right, require a lot of custom work etc. bridge mib does not always have the information, CDP is just for Cisco and increasingly turned off thanks to rising dominance of "security" contingent in the enterprise. topology information is often in proprietary mibs, etc. and heuristics need to be used to process the data and infer the topology.

 

But layer 2 is only part of the story. Similarly, you'd need routing (BGP, OSPF, etc.) topology, vpn topology, etc. and the dependencies among these layers.

 

When people talk about silos, they typically refer to application, systems, network etc. in broad terms. But without this topology, we essentially have several silos within the network. Domains where the monitoring systems provide information that cannot be easily related to each other at best, false and misleading at worst.

Mar 26, 2008 10:00 AM Click to view Louis DiMeglio's profile Louis DiMeglio in response to: Berkay

You're right, the overall picture is much more complex. All vendors, including my company, struggle with understanding the full landscape at any given customer. If only everyone implemented their network the same way, things would be much easier! We're watching the OMC carefully and looking forward to the solutions that you guys come up with that everyone can benefit from. Thanks for the work in these areas.

Mar 26, 2008 10:15 PM Click to view Steve Carl's profile Steve Carl in response to: Louis DiMeglio

My main problems around discovery as they pertain to this are scale and identity management.

 

90/10 works on relatively small networks, but as the enterprise gets larger the numbers start to need to verge closer to 100%. For example, with 20,000 end nodes (not even counting all the plumbing it takes to hook together that much stuff) 10% is 2000 systems, and that is going to take someone a very long time to deal with manually.

 

Now, add in security policy: The larger the enterprise, the more likely the security policy requires passwords on end nodes and infrastructure are fluid enough to avoid compromise. Say they change every 90 days in our supposed 20,000 end node network. It is highly unlikely one can effectively deal with 2000 systems every 90 days, and even worse, the discovery mechanism itself has to be re-educated every 90 days to be able to keep all the maps up to date. The right thing to do of course would be to have the identity management tool tap the discovery tool on the shoulder whenever critical passwords change that discovery needs, but as far as I am aware, no one has done that.

Click to view Berkay's profile

Berkay

Member since: Dec 31, 2007

Thoughts on IT management

View Berkay's profile