This group of pages has my contributions to the Internet community.
This is a posting in HTML format of a paper originally presented as an invited talk at the Network Analysis User Group meeting in Washington, DC in September 1993. While the specific technology and numbers are dated, the design principles have stood the test of time.
University Networking Services, +1 612 625 8888, 130 Lind Hall, firstname.lastname@example.org
The University of Minnesota's network management system is a philosophy that pervades all aspects of our data network. This paper will present the design of our data network and show how network management concerns entered into each part of the design.
We operate a typical large enterprise network. It extends to five campuses throughout the state plus another dozen smaller sites. Our "flagship" protocol is TCP/IP. We also support DECNET, AppleTalk, and Novell IPX. We could route OSI CLNS if necessary, but at this time only one site within our network has requested it, and that only for a single, specialized connection.
We support Ethernet and LocalTalk hardware interfaces. All protocols are available to those hosts with Ethernet interfaces, but only AppleTalk and TCP/IP are available to LocalTalk devices.
Our network connects over 15,000 computers over 470-odd IP network numbers (subnets). There are 203 AppleTalk zones.
We implement this user access with nearly 60 Cisco routers, over 460 twisted-pair hubs, over 100 Ethernet-to-Ethernet bridges, and over 250 Shiva FastPaths that connect LocalTalk to Ethernet networks.
We follow these overall design principles:
The answer to this question was never in doubt: from our experience, we had to fully route all protocols in order to survive at all. This conclusion came from the following points:
So, what does routing gain for us?
The last point is the real key item.
Without it, we have people "hooking into a network," one which can't be changed or grown without affecting all existing users.
With it, we are in the business of offering packet delivery services, with the interface to those services specified by hardware/software combinations. Now, we can grow and change "our" part of the network without disturbing the existing users.
We reviewed the various topologies and settled on a compound star for these reasons:
The core of our star is a cluster of six Cisco AGS+ routers, each with 18 Ethernet interfaces and interconnected with an FDDI ring. These systems are located within a few feet of each other at the central telecommunications facility. We consider the FDDI ring to be more of a Cisco "bus extender" than a network with an identity of its own.
We currently have fibre run to about 70% of our buildings. Where it exists, we use it as a fibre Ethernet which runs to a central point within each building. Links shorter than the 1 000 m in the fibre Ethernet standard are implemented as repeater links (see the next section). Links longer than this distance terminate in a router (preferred) or bridge at the remote end.
Our campus is large, and not all fibre runs directly back to the telecommunications center. We have established roughly a dozen remote routers (each Cisco AGS+) which feed connections to their local area, then have one connection back to the main switching hub. These connections are all Ethernet for now, but can be upgraded as traffic warrants. We have set up the star to be as small and "bushy" as possible to minimize the number of hops that packets must make.
If you note, this design uses the Cisco routers as network switches. A fully-configured router can switch traffic at rates approaching 1 Gbps. We use networks (i.e., Ethernet), as a point-to-point communications protocol. This architecture often goes by the name "inverted backbone."
The only statistics in the core MIB of the SNMP protocol are counts of events crossing an interface. The compound star design matches the physical network to the model used for statistics. This matching makes it easy for us to gather any required statistics.
We have on the order of one hundred building connections, but something like 150 times that many host connections. We have thus spent a great deal of time developing a good growth plan for the local connections. The plan we use is nicknamed the "country road" plan. It is modeled after that of the road system, in which a road starts off as a dirt lane and is monitored and improved as required to handle the traffic.
The first connection in a building is made like this:
fibre --------- twisted pair =============== | hub | xxxxxxxxxxxxxxxxxxxxx host (< 1 000 m) | | | | | | --------- fibre --------- --------- twisted pair =============== | Cisco | ------------- | hub | xxxxxxxxxxxxxxxxxxxxx host (> 1 000 m) | IGS/L | | | --------- | | | | ---------
The fibre is on a dedicated Ethernet interface in the telecommunications center or remote router. The local router is only required for longer fibre runs. The router can be attached to the hub by fibre, twisted pair, or thinnet.
The second through twelfth connections in each building (we use 12 port hubs) are handled in the obvious way.
Connections 12 through 24 are handled as:
fibre --------- twisted pair =============== | hub 1 | xxxxxxxxxxxxxxxxxxxxx host (< 1 000 m) | | xxxxxxxxxxxxxxxxxxxxx host | | ... +-| | xxxxxxxxxxxxxxxxxxxxx host | --------- | | --------- +-| hub 2 | xxxxxxxxxxxxxxxxxxxxx host | | xxxxxxxxxxxxxxxxxxxxx host | | ... | | xxxxxxxxxxxxxxxxxxxxx host ---------
In other words, the hubs are co-located where possible and connected to each other with thinnet. This is the only allowable use of thinnet in our network.
If, for distance reasons, a hub must be located remotely, we use twisted pair for the connection. If the hub's AUI port is free, we use a twisted pair transceiver off the hub's AUI port for inter-hub connections. This keeps counting as clean as possible (12 ports, 12 devices).
In many cases we are attaching to "private" networks. These can be old thick- or thin- net networks or new 10BaseT networks for which all of the devices and hubs are owned by the local department Such networks hang off one of the twisted pair interfaces, and a bridge is always inserted between the private network and the backbone.
In fact, the term "private" mainly refers to who paid for the equipment. Regardless of who owns it, we follow the same host registration (all computers attached to the Ethernet must be registered) procedures and all hubs and other network equipment are monitored. The user is also responsible for ensuring that their equipment is running the correct software releases.
We also keep an eye on the total number of hosts attached to one router interface. Our current guidelines call for a maximum of one hundred hosts per interface, with a typical network should have no more than sixty or so. If the network gets larger than this, a second building backbone is established and operated in parallel.
This limit comes not from pure performance issues, but from fault prevention and isolation concerns. We have found that if networks are kept small, they operate reliably. As they get large -- even well within nominal specifications -- they tend to fail. The causes of failure are numerous, ranging from faulty equipment to below-par installations to host software misconfigurations. However, it is far cheaper to simply keep the size small and not worry about these problems than it is to be trying to fix them all, and still wind up having to split things.
We reuse existing pairs when possible. All pairs are tested before being okayed for use. When new wire has to be pulled, we are pulling level 5 cabling but terminated according to level 3 techniques. When level 5 termination equipment is available, we will reconsider this decision.
Most of our existing fibre is 50 or 62.5 micron multimode. For typical building distances, the data rate for such fibre tops out at about 500 Mbps. We will switch to using 8 micron single mode as soon as devices (transceivers, modems, etc.) are available: we expect this to happen soon.
We have only a few types of supported network equipment, so stocking spares is fairly easy. We always make sure that we have one of everything in our spare stock. Some devices (e.g., FastPaths and hubs), we have more spares for because of the large number of installed units. These devices fail rarely, so our spare stock doubles as a pool of devices (all but one that is) available for quick installation.
We assume that we have to have spares to cover a complete chassis and all cards. In the case of a core hub Cisco AGS+, this ties up over $70,000 in a fully-configured spare. This sounds like (and is) a lot of money, but if someone should ever drop an A/C line into one of our routers and fry the whole thing, our users will be very happy that they can be up and running now instead of having to wait until the next business day for parts.
We have all routers and monitoring equipment on UPS power. Buildings with only bridges and hubs do not have UPS. Our reasoning is that if a building loses power, it doesn't buy us much to have the hubs up if the computers are down. However, if a router should go down, that can cause a change in the network routing tables, thus violating the principle that what happens in one building should not be able to affect other buildings.
Most of our power outages are short term (under a second to a few minutes). Thus, we use UPS units with a nominal 20 minute power supply (which is more like an hour in practice). We feel that this is a reasonable compromise between protection and expense.
We try to have good working relationships with our networking equipment vendors. While we look for many things in a vendor, I would say that the most important is active participation on their part in the evolution of networking. There is no way that we can specify enough detail in a purchase request to ensure that a router or other device will work. Besides, if we did, no one would bother reading the document. Instead, we need to be sure that as new problems are uncovered and new needs arise, the vendor will be improving and adapting their products to meet those needs.
We see adding FDDI to the range of interfaces available for our users. We consider it a stopgap, but an important one that will last for a number of years. Our current plans are to add a parallel compound star network of FDDI concentrator hubs. Local connections will be over twisted pair. We have not set an upper limit on the number of hosts on a single ring, but it will be fairly small.
We see our network moving to ATM over the next few years. The ultimate configuration will be something like:
Our ultimate goal will be to have a dedicated Ethernet available for every existing host and twisted pair ATM (running from 20-50 Mbps) for all new host installations. (At last, fast enough to download printer inits!) Finally, 155 and 600 Mbps fibre connections will be available on demand anywhere. We expect that only about one to ten percent of connections will require this higher level of throughput.
As ATM switching technology evolves, we foresee migrating away from a compound star to a mesh architecture to gain performance and reliability. But this assumes that they get the software working (:-).
So far, I have reviewed our hardware design. But what about the software and data management aspects of network management? The next section will describe the data that we collect. The following sections will cover what we do with that data.
At this time, we collect the following categories of data for each host:
and this information about how it is connected:
We also collect additional information for a few specific types of hosts, for example, Shiva FastPaths and Novell servers.
While we do not have complete information on all hosts (if only!), we do have enough by and large to manage the network. For almost all hosts, we either have the data or have enough information to (eventually) track down who does. I say eventually, because a University has high turnover among system administrators, who are often graduate or undergraduate students, and it can take perseverance to convince (xxxxxxxx find) someone to take responsibility for a computer.
We use this information for a number of purposes:
At this time, we use the IP address as the unique identifier in our database. We chose this identifier for the following reasons:
So far, it works fine. However, we need to do something else to accommodate dynamic address assignment, a need that is here now and will grow in the future. Right now, if a user has a problem, they can give their IP address. This is sufficient to locate them in our and Telecommunication's databases. We can then start tracking down the problem. But imagine this dialogue:
User: I'm having problems. Help Person: What's your IP address? User: I don't know. It isn't working. Help Person: What's your Ethernet address? User: I don't know. What's an "ethernet address?" Help Person: Where are you? User: I don't know. In some building. Help Person: AAAARRRGGGHH!!!
We use this data for these purposes:
% lookup-dns -v norge.unet.umn.edu looking in /home/named/umn/unet norge A 126.96.36.199 ;isup 24 Medium Iecho - ;host Sun - Unix ;room 130 lindh lindh 031 Lind Hall MX 0 unet.unet.umn.edu. ;enet 08:00:20:09:80:cd g ----- unit ----- unet.umn.edu @ns Networking Services @ns -;CIS/NS;130 Lind;5 8888;email@example.com ----- network ----- 188.8.131.52 @ns net lindh NS internal network @ns -;CIS/NS;130 Lind;5 8888;firstname.lastname@example.orgThis program extracts the DNS information and expands the maintainer and location information as it goes. It gives you just about everything that you need to contact the host's maintainer.
A key part of our system is a series of programs that are run each night. These programs verify and expand upon the data in our files. Key steps:
We communicate with each Shiva FastPath and fetch copies of:
A program scans our file of IP network number assignments. It expands the building abbreviations and checks that the assigned status (in use, open) matches the running IP routing table. It also checks that the default route (the IP address a.b.c.254) is assigned (if in use) or not assigned (if open).
Unlike IP where we can tightly control the routing tables, AppleTalk and Novell IPX routing is more open. We therefore follow a different strategy. For AppleTalk network numbers and zone names, and IPX network numbers, we have:
network 50162: 00:80:19:0C:07:EE 50162.57 unknown device 00:80:D3:A0:0A:76 50162.246 184.108.40.206 [atalk.out] 08:00:89:A2:17:81 50162.130 220.127.116.11 [atalk.out] 08:00:89:A2:20:41 50162.193 18.104.22.168 [atalk.out] AA:00:04:00:D7:DB 50162.71 22.214.171.124 [Cisco ip-arp] 008019 Dayna Communications "Etherprint" product 0080D3 Shiva Appletalk-Ethernet interface 080089 Kinetics AppleTalk-Ethernet interface AA0004 DEC Local logical address for systems running DECNET
We have programs that fetch data from the Shiva FastPaths and Novell servers and cross-check that data against our configuration files. Here is a sample FastPath result:
IPAddress: 126.96.36.199 kbox.geom.umn.edu: host kbox.geom.umn.edu <=> 188.8.131.52 looking in /home/named/umn/geom kbox A 184.108.40.206 ;fast 24 public - ;host FastPath - Native kbox-81 A 220.127.116.11 kbox-82 A 18.104.22.168 kbox-83 A 22.214.171.124 kbox-84 A 126.96.36.199 kbox-85 A 188.8.131.52 kbox-86 A 184.108.40.206 kbox-87 A 220.127.116.11 kbox-88 A 18.104.22.168 kbox-89 A 22.214.171.124 ----- address counts ----- Assigned static 0, unassigned static 0, dynamic 9 addresses
atalk-snmp-check for router 126.96.36.199, is Shiva FastPath4, K-STAR Patch 9.1p2 92/07/14 has been up 1584939 seconds (18.34 days) fpPromVersion is 510 fpBufferAvail is 224 => FastPath4 with/kit fpBufferDrops is 40 ifPhysAddress.1 is 08 00 89 a0 26 28 start status net config zone config zone 0 operational unconfigured unconfigured 31947 operational configured configured Geometry Center 52467 operational garnered garnered Geometry Center 0 unconfigured unconfigured unconfigured -------------------------------------------------- router is 188.8.131.52 name is fmc.gw.umn.edu / Ethernet 1 appletalk zone Geometry Center -------------------------------------------------- IPAddress: 184.108.40.206 Status: running What: FastPath4+ Where: 5th.floor.machine.room x1300 u-office x1300 * - 1300 South Second St (fmc) Contact: @levy @levy Stuart Levy;Geometry;? x1300;-;email@example.com Schedule: - LANmarkPort: - IPTalk: - LocalTalk: 31947 Geometry Center EtherTalk: 52467 Geometry Center NovellTalk: - OtherTalk: - RemoteTalk: - SerialNo: 101962 Failures: - Notes: Old S/N 80086 LastChange: phil, 13 Apr 93; new default net number
In addition, we have a number of "housekeeping" programs run. These programs manage the "old" copies of these input and output files, check syntax, and perform other tasks.
Remember this? It is, after all, the first thing that comes to mind when one uses the term "network management." Well, we do it, too. We use a locally-written program that, in essence, reads in a configuration file and sits in a loop, firing off queries to devices and recording the results. The basic queries that we use are:
For each directive, you can specify its checking interval and the actions to take if it is down or changes state. The actions include running an arbitrary Unix command.
The configuration file is built from the per-host data that we maintain. For example, the entry for a twisted pair hub is:
lindh-hub-1 A 220.127.116.11 ;hubh 24 SNMP-public ! ;host HP Hub Native ;room S22 lindh ; lines 001-012
This expands into the following series of directives:
The program that does the expansion keys off the device type, so we can change the monitoring that we do on all devices in a class in one simple operation.
Aside from simply being able to handle the number of devices that it does (we currently monitor about 6,800 data points in the network), the program has these features:
Those dependencies are calculated by a program, which takes as input a starting place in the network. The easiest way to debug this program was to print out it's idea of the network map. A little (well, lot) of polishing later, we now have a program that can automatically create and print a map of the network, with all data fetched from the configuration files or the network itself. This has proven very useful.
The operations staff needs to know what (if any) devices are down right now. So, we have a program that sorts through all the data collected by the network monitoring program and shows which ones are down at any given time. This program displays simple ASCII text. No fancy graphics, but the only people who see it aren't the sort who are impressed by gloss.
We also have programs to post-process the statistics and produce reports. We collect over 10 MBytes of raw statistics each day, and each daily report is about a megabyte. Of all this, we only actually look at parts that are important on any given day.
So far, the people who have looked in detail at our system have all told us that we are way in advance of commercially-available packages. That said, we would like to eventually be able to switch to commercial packages, so that we can put our energy into improving the network in other ways. After all, networking became much better when we could buy Cisco routers instead of maintaining (or writing) gated on a Sun...
In particular, we would like to improve these areas:
All counts and statistics on our network were produced by our network management software. While accurate at the time produced, these numbers change over time and so all numbers were converted to approximations.
Finseth, Craig (1992) "Thoughts on Network Management at the University of Minnesota". Published in ConneXions December 1992 and available from mail.unet.umn.edu in ~ftp/unet/wiring/net-management.
Midden, Marshall (1993) "Making a Network Map". Published in ConneXions June 1993.
Craig A. Finseth Networking Services Computer and Information Services University of Minnesota 130 Lind Hall 207 Church St SE Minneapolis MN 55455-0134 Craig.A.Finsethfirstname.lastname@example.org email@example.com +1 612 624 3375 desk +1 612 626 1002 fax
I am Craig A. Finseth.
Last modified Tuesday, 2010-03-02T20:27:09-06:00.