1 of 92

Title: Do’s and Don’ts for a Large-Scale OV Deployment

Session #: 262

Speaker: Tom Santanello

Company: Castello Systems

2 of 92

Agenda

  • Background

  • General Do’s and Don’ts

  • Product Specific Implementations

  • Configuration Management of EM Systems

  • Correlation Composer

3 of 92

Background

  • Outsourced Infrastructure

    • Large outsourcing effort with only a portion of the total infrastructure outsourced
      • Outsourcing effort covers Local and Campus networks
      • Three major “air-gapped” networks
      • 2000+ Servers in three major Data Centers as well as “local” servers
      • User workstations in 100+ locations within a 50 Mile radius of the main campus
    • No existing Consolidated Operations Center
    • Client retained complete control over governance process for everything deployed
    • No coherent Change or Configuration Management process
    • 350 SLA’s to start with (approx 160 today)

  • Our Task

    • Deploy a modernized Enterprise Management capability
    • Replace “Owner Monitoring” with “Operations Management”
    • Deployed across multiple networks
    • Function from a single Operations Center
    • Provide for SLA monitoring and reporting capabilities

AND Complete all deployments in 30 Months

- Piece of Cake – Right…….

4 of 92

Build Approach - Three Primary Builds

  • Build 1 – “Quick Hits”
    • OV Performance Insight
    • OV Network Node Manager 6.3
    • ON Internet Services 3.5
      • All three networks

  • Build 2
    • Foundry IronView
    • Cisco CiscoWorks
      • All three networks

  • Build 3
    • OV Operations Unix
    • OV Operations Windows
    • OV Network Node Manager 6.4.1
    • OV Extended Topology
    • OV Service desk
    • OV Service Information Portal
    • OV Reporter
    • OV Storage Area Manager
    • OV Performance Manager
    • SupportSoft
      • Main Network

  • Builds 4 - XX
    • Upgrade to NNM 7.01
    • Upgrade to NNM 7.5
    • Multiple upgrades to OVIS, OVR, OVPM
    • Upgrade to OVOW 7.2
    • Patch Upgrades
    • Deploy Build 3 servers to the other two networks
    • Upgrades to IronView and CiscoWorks

5 of 92

Architecture

OVPI

OVPI

OVPI

OVPI

OVPI

6 of 92

Data Flow

OVPI

7 of 92

Organizational Issues

  • Total Chaos!!
    • Multiple organizations responsible for each IT area
    • No clear understanding of what EM is within the delivery organization
    • No buy-in by the Delivery organizations
    • No direct access to servers
    • No dedicated Server SA’s
    • Failing SLA’s

  • Other Modernization projects were also way behind schedule
    • Impact: We had to account for an environment that was mainly legacy when we had planned and tested on a modernized infrastructure
      • Ex. Windows servers were supposed to be deployed on Active Directory which was not available

AND …

Management continually asked …Where’s my Operations Center?

8 of 92

Agenda

  • Background

  • General Do’s and Don’ts

  • Product Specific Implementations

  • Configuration Management of EM Systems

  • Correlation Composer

9 of 92

  • Do – Make sure that Management fully understands what they are committing to
    • Make sure they understand the “people” commitment necessary

  • Do – Make sure Management understands what they are getting in each phase of the project

  • Do – Get agreements from Management in writing on staffing and responsibility issues

  • Don’t – Expect that Management will “get it” the first time
    • Make sure you have regular briefings leading up to deployment

  • Don’t – Let Management set expectations in their own mind. Ensure that their expectations match what you are delivering

  • Don’t – Assume that that once Management commits to a multi-million dollar purchase that they will champion the effort

Organizational Do’s and Don’ts

10 of 92

User Community Do’s and Don’ts

  • Do – Start engaging end-users early and frequently
    • We did this but……..

  • Do – Expect resistance
    • Time
    • Prior tools
    • Trust

  • Don’t – Assume that users will understand the goals easily

  • Don’t - underestimate the reluctance of users to use new tools – even simple tools

  • Don’t - expect that all users will “Get it” just because you do

  • Do – Find pain points that you can address from day one of the deployment

  • Do – Publish deployment strategy and follow-up from day one to the largest possible audience

  • Do – Focus on early adopters

11 of 92

Deployment Do’s and Don’ts

  • Do – take bite sized chunks

  • Do – Come up with something to deploy early that shows value
    • In our case this was upgrading the existing OVNNM’s, OVPI and deploying new instances of OVIS
      • In the case of OVIS - Don’t – assume the value will be recognized (previous slide!!)

  • Don’t early deploy something you don’t fully understand or don’t agree on

    • PI deployment was doomed from the beginning
      • Wrong personnel on our side
      • Scaled in-correctly
      • Conflict with end-users on requirements

  • Don’t spend a lot of time in the lab “designing the system”

    • Real environment often times will throw you too many curves
    • Focus on a limited set of results that you know you will pay off on “day one”
    • “Toolbox” approach proved that requirements would come in once we deployed. That being the case we should have deployed OVO much earlier !!!!!!

12 of 92

Agenda

  • Background

  • General Do’s and Don’ts

  • Product Specific Implementations

  • Configuration Management of EM Systems

  • Correlation Composer

13 of 92

Specific Implementations

  • SLA Support
    • Openview Internet Services

  • Configuration Management Database
    • Openview Service Desk

  • Outages
    • Openview Operations, Internet Services, and Service Desk

  • Metrics Collection and Presentation
    • Openview Operations, Performance Manager, Reporter

  • Management Servers
    • Openview Operations Unix and Windows

  • End-User View & Event Instructions
    • Openview Operations Java Console

  • Management Deployment Strategy

14 of 92

OV Internet Services SLA Polling

Every Area has:

OV Internet Services Probe

OV Problem Diagnosis Probe

OV NDAOM Probe

ICD Function

Probes Provide:

Service Availability and Response Time (HTTP, DNS, EMAIL, …..

Path Availability and Response Time

Probe

Probe

Probe

Probe

Probe

Probe

Probe

Probe

Network 1 == 39 pollers

Network 2 ===25 Pollers

Network 3 == 25 Pollers

Currently 2000+ pairs

Opps!!

15 of 92

OV Internet Services SLA Response�Time and Availability Measurements

  • SLA Support
  • Configuration Management Database
  • Outages
  • Metrics Collection and Presentation
  • Management Servers
  • Network View
  • Event Instructions
  • OVO Java Console

16 of 92

OV Internet Services – Do’s and Don’ts

  • Don’t – underestimate complexity of setting up customer and service groups that make sense.
    • OVIS has an easy mechanism to add targets/groups however...
      • You may quickly realize that what you have in place does not make sense the larger it grows
      • Once the groups are setup, they are not easy to change!!

  • Do – Ensure you have a big enough machine if you go over 1000 pairs

  • Do - Build OVR reports for the OVIS administrator
    • This helps during baselining period to identify which pairs are a problem

  • Do – Setup DNS and Mail probes as early as

possible

  • Don’t underestimate the work involved in keeping

Http probes up to date with valid pages

  • Do - Setup OVO Scheduled Actions on Web Servers

to see what pages are valid and are accessed by end-users

17 of 92

OV Internet Services�Custom Probes – iperf probe

Do – Use Custom Probes

LogMessage("OVISTarget == "+OVISTarget());

OVISAvailable=false; //probe unavailable

Cmd=“ iperf.exe -c "+OVISTarget+" -u ";

WshShell=new ActiveXObject(“ WScript.Shell");

OVISStart("ResponseTime");

Exec=WshShell.Exec(Cmd);

sO=Exec.StdOut;

while (!sO.AtEndOfStream)

………….

LogMessage(“ Jitter size is "+Opt[7]);

OVISSetMetric("Jitter",Opt[7]);

………

  • Use Cases
  • iperf, NIS, File Shares, NFS

18 of 92

OVIS Redundancy

  • Problem – Requirement to have redundancy on all systems

    • OVIS does not have a high-availability solution
    • Remote probes do not have a way to switch to a second OVIS Management Server
    • Replicate as little data as possible on a system running Internet Services, Reporter and Performance Manager
    • Minimal time to implement – bare bones solution

  • Solution – Manual Failover Process between two Management Servers running:
    • Windows Advanced Server 2000
    • SQL 2000
    • Openview Internet Services
    • Openview Reporter
    • Openview Performance Manager

19 of 92

OVIS Redundancy - cont

Step 2 – Add the virtual IP address to the “Advanced TCP/IP Settings” tab in the Local Area Connection Properties configuration box on the Primary server

    • By adding only the Virtual IP to the Primary server you are only allowing that server to conduct transactions with the OVIS Probes

    • Normal OVR collection and reporting is still conducted at scheduled times
      • This restricts the amount of data that has to be kept in synch between the two servers to a minimum

Step 1 – Create virtual Nodename in DNS

that will migrate between the two

Management servers

20 of 92

OVIS Redundancy - cont

Step 3 – Change the OVIS Services entry on the Backup server to Manual and stop the services

Step 4 – Setup Backup Plan on both servers

    • Consists of three SQL Server jobs on run on Primary server –
      • Full – Daily
      • Differential – Every four hours
      • Transaction Log – Every 15 Minutes

    • Backups are copied after each run to the Backup server

    • Scheduled Restore Job run on Backup server
      • Daily (after full backup run on primary)
      • Goal is to use only latest Differential and Transaction backups when doing the actual failover

21 of 92

OVIS Redundancy cont

  • Step 5 – Actual Failover – Primary Server
    • Virtual IP removed from Network Configuration on Primary server. OVIS probe transactions start queuing up on OVIS probes.
    • Run iopscollector.exe and iopsmaint.exe to flush probe data.
      • Verify each has completed in order before continuing
    • Stop OVIS services and change to Manual in Services configuration
    • Start SQL Enterprise Manager and expand to SQL Server Agent -> Jobs
      • Highlight “Reporter Backup-Transaction Log” and select “start job”

22 of 92

OVIS Redundancy - cont

  • Step 6 – Actual Failover – Backup server
    • Start SQL Server 2000 Enterprise Manager -> Query Analyzer -> Reporter
    • Open RestoreFrom <primary server> script and start by hitting F5
      • If primary server went down hard then run RestoreFrom <backup server>
    • Add Virtual IP to Network Settings
    • Change Services entry for OVIS to Auto and start the OVIS services.
    • Start OVIS Configuration Manager and verify that data is coming in

    • In SQL 2000

Enterprise Manager

drill down to

SQL Server Agent ->

Jobs and highlight

“Reporter

Backup-Full”

and then select

“start job”

23 of 92

OV Service Desk�Configuration Management Database

  • Current input/outputs
    • OVO, NNM, ET CW2K, IronView, Marimba, McAfee, Legacy Databases

  • Input - Layered approach to building the CMDB
    • 2 NNM instances dedicated solely to continuous discovery of the network
      • This is the base for the CMDB. Everything from this point will not get into the CMDB unless NNM has discovered the device
    • CW2K and IronView are the next two layers to fill in data on network devices
    • Legacy Server Database, OVO and Marimba fill in Server layers
    • Legacy Workstation Database and Marimba provide data on workstations

  • Output
    • Data source for Aperture, Remedy Asset module, Security reports
    • Data Source for tracking of workstations that have been migrated to the Modernized network

  • Future
    • Source for by port/speed billing
    • Synchronization of all EM Tools

  • Pitfalls
    • Is NNM Discovering all areas?
    • Name resolution a huge problem across all the input sources

24 of 92

OV Service Desk�Configuration Management

Network

Devices

Desktops

HPOU Operations

Data Repository

Service Desk

Problem Diagnosis

NDAOM

Oracle

NNM / ET

S

D

P

R

O

L

I

A

N

T

1

8

5

0

R

HPOW Operations

NNM

/ ET

UNIX

Servers

Security

Devices

WINTEL

Servers

Op

Access

Web Access

EA Portal

Remedy

P

R

O

L

I

A

N

T

1

8

5

0

R

CiscoWorks/

IronView

SLA

Reports

TMS

S

D

P

R

O

L

I

A

N

T

1

8

5

0

R

Marimba /

Supportsoft

CM Sources

NNM - Discovery of all in-scope devices (Network/Servers/Workstations

IronView/CiscoWorks – Configuration/Inventory

Extended Topology – Configuration/Inventory

Marimba – Configuration/Inventory

Macfee – Version

OVOU/OVOW – Service Data

Legacy Server Databases

Data Consolidation into Service Desk

Events Generated from CM changes

if they are unauthorized

CM Changes Mapped to Remedy Ticket Changes

to Identify true events

4

3

2

1

Events

6

CM Database

Mapping of SD CM

data to Remedy

Asset Module

5

25 of 92

OV Service Desk�Configuration Management Database

26 of 92

OV Service Desk�Operator Java Console Drill Down�

JSP Interface

IronView

NNM

Marimba

27 of 92

OV Service Desk�Operator Java Console Drill Down�

Foundry Switch Report

28 of 92

OV Service Desk Do’s and Don’ts

  • Don’t exclude OVSD because you don’t need all the components

    • In our case the Client has a huge investment in Remedy and that’s not going to change – BUT – we did need a CMDB
      • Remedy, Aperture, and home built databases were not going to go away
      • Long ramp-up time to pair down the number of sources
      • As new tools are deployed new sources become a CM source
      • CMDB portion of OVSD allowed us to import/export from/to an unlimited number of sources and provide an immediate consolidated source to users

  • Do – Take advantage of certain components if they fulfill a need

  • Do – Take advantage of the Data Exchange component

  • Do – Take advantage of the OV API’s

  • Don’t – Think that anything you do in OVSD is easy in the long run!!!

  • Do – Use a Web-Based Interface

    • Custom JSP interface is slowly becoming the Customer Facing mechanism for the entire Toolset along with the OVO Java Console
    • Goal is to provide pointers to every possible source of information on a device

29 of 92

�Scheduled Outage Integration – �Use Case

  • Scheduled outages must be accounted for in order for SLA measurements to be correct

    • All scheduled outage information enters the system via Remedy

    • OV Service Desk acts as the broker to the other OV products

    • Outage data pushed to OVIS to stop probing a given target

    • Outage data pushed to OVO to suppress events

Remedy

Service

Desk

OVO

OVIS

Scheduled Outage

Entered

Outage information

pushed to both OVO and

OVIS

Data

Repository

Crystal

Reports

Operator

Console

Events/Polling

Suppressed

WO Created

30 of 92

Mass Outages using OVOU

  • Problem – Series of outages in local datacenters over a several month period
    • Need to shut down all managed servers
    • Need to follow a set order bringing the servers down
    • Need to track the server going down and when it is completely down
    • Need to track when a server starts to come back up and when all services are started

  • Solution – Use OVO and external script to manage the process of:
    • Shutting down the servers
    • Tracking which servers are actually down
    • Tracking when servers start to come back up
    • Tracking when servers are completely back up

  • Considerations
    • Not all servers can be completely powered off
    • Need to avoid message storms
    • Management needs to know exactly what servers are down and track progress of the servers coming back up

31 of 92

Mass Outages using OVOU - cont

Step 1 - Create Application Group “Outage

    • Create Applications to stop Solaris, Windows 2003/2000 and Windows NT servers
      • tsshutdn is used on 2000/2003 servers
      • ovntshutdown is used on NT nodes
        • Note NT servers do not completely power off
      • shutdown is used on Solaris servers

    • Create Applications to send message from server being shutdown to start the outage

    • Create single Application that:
      • Logs the start time of the outage for each server
      • Starts external process on OVO Mgmt Svr that tracks progress of servers going down and up

    • Create Outage Message group

32 of 92

Mass Outages using OVOU - cont

33 of 92

Mass Outages using OVOU - cont

Step 2 - Create Node groups that contain the servers according to the time/platform that they belong to. For example:

    • 1630_NT
    • 1630_2000
    • 1630_Sol
    • 1730_NT
    • ……
      • Do this step rather than trying to find the nodes at the time of the outage
        • Less confusing!!
      • Could also use Node Hierarchy and make available to non-admin users

34 of 92

Mass Outages using OVOU - cont

Step 3 - Create external process (mon_outage) on OVO

- This is called by the OVO application created in Step 1 Process does the following:

    • Monitor agent status of managed node by running opcragt against the node every 25 seconds
      • Send Major message when agent no longer available

    • Monitor down status of managed node via ICMP request
      • Send Critical message when node no longer answers request

    • Wakes up every 60 seconds and check status of node via ICMP request
      • Send Minor message when server is reachable again

    • Monitors agent status running opcragt every 25 seconds
      • Sends Normal message when opcragt is successful

35 of 92

Mass Outages using OVOU - cont

Why did we do it this way?

    • Detach from OVO in case of an outage on OVO

    • Allow for non-managed nodes to be tracked as well
      • Run mon_outage against a file containing nodenames – opcragt always fails, but ICMP request does not

    • Some of the outages are up to 24 hours
      • Did not want to hold onto a message that long

    • Uses single Message Group “Outage” with Message Keys. Only one message is in the console at a given time for a server.
      • Tracking is easy – If it is not Green………

    • Low overhead
      • Put together very quickly with nothing deployed to the managed node

36 of 92

OV Performance Manager - Problem

  • Problem - Need a graph in the Operations Center showing status of all mail server queues
      • Exchange Information Stores
      • Exchange Connector Servers
      • Sendmail Servers

    • No graph in place currently met this requirement
    • No Sendmail metric collection out-of-the box
    • MTA queues were measured but not stored
    • Screen Real Estate on Operation Center Video Wall is an issue
    • No notion of overall mail queues

  • Solution – Web-based graph of the

queue sizes built using OVPM, DSI, OVO

opctranm, and Exchange SPI

*Developed by HPC&I

37 of 92

OV Performance Manager - Steps

  • Exchange

    • Configure MTA Work Queue monitoring on OVOW
      • Store metrics
    • Create OVPM graph template for MTA and Connector MTA Queues

  • Sendmail
    • Create script to retrieve Sendmail queue lengths from Solaris servers
    • Create DSI measurement configuration file for Sendmail servers
    • Create OVPM graph template for MTA and Connector MTA Queues

  • OpenView Performance Manager
    • Create multi-frame html file to display all

three graphs in the same browser window.

38 of 92

OV Performance Manager

Step 1 - Exchange - Configure MTA Work Queue monitoring on OVOW

Need to turn on DSI collection

39 of 92

OV Performance Manager

Step 2 – Exchange

    • Create OVPM graph template for MTA and Connector MTA Queues

FAMILY: Exchange Mail Queues

GRAPH: Exchange MTA QUEUES

DESCRIPTION: Exchange MTA QUEUES

GRAPHTITLE: Exchange MTA QUEUES

YAXISTITLE: Messages

GRAPHBACKGROUND: None

JAVAGRAPHS: Yes

GRAPHMULTIPLEGRAPHS: Yes

GRAPHTYPE: bar

STACKED:

DATARANGE: 4 Hours

ENDDATE: now

GRAPHMETRICSPERGRAPH: 21

AUTOFRESH:

POINTSEVERY: auto

DSN: S2

DATASOURCE: CODA

SYSTEMNAME: exch_svr_1

CLASS: EA:Ex55MTAWorkQUEUE

METRIC: WorkQueueLength

LABEL: exch_svr1

COLOR: Orange

MARKER: marble

…………………..

This was created in previous step

40 of 92

OV Performance Manager

Step 3 - Sendmail - mailqchk.sh - Run on Managed Node

SENDMAIL=/usr/lib/sendmail

SENDMAILVER=`echo \\$Z | ${SENDMAIL} -bt -d0 | tail -1 | head -n 1 | cut -f 2 -d ' '`

# Determine the current sendmail mail queue length.

QueueHeader=`${SENDMAIL} -bp | head -1 | sed -e "s/(/ /"`

if [[ "${SENDMAILVER}" = 8.1?.* ]]; then # newer sendmailversion has other output format

QueueLength=`echo $QueueHeader | awk ' /empty/ {print 0}; /request/ {print $2 }' -`

else

QueueLength=`echo $QueueHeader | awk ' /empty/ {print 0}; /request/ {print $3 }' -`

fi

if [ -n "$QueueLength" ]

then

echo `hostname` `date +%m.%d-%H:%M:%S` ${QueueLength}

exit 0

else

echo Error: `hostname` `date +%m.%d-%H:%M:%S`

exit 1

fi

exit 0

41 of 92

OV Performance Manager

Step 4 - Sendmail – Create mailqsget.sh which collects and logs metrics

sub() {

……………

JOBFILE="/tmp/mailqsget.job"

MAILQCHK="/var/opt/OV/bin/OpC/cmds/mailqchk.sh"

………..

echo $fqnode > $JOBFILE

echo "!" $MAILQCHK >>$JOBFILE

ret=`$OPCTRANM -t $TIMEOUT $JOBFILE 2>&1` ### opctranm to start mailqck.sh on sendmail svr

……….

Line=`echo $ret | grep "^$node"` ######## Grab output of mailqchk.sh

……….

if [ $? -eq 0 ] ; then

qlen=` echo $Line| cut -f 3 -d ‘ ‘ ` ######### Grab Queue length

else

qlen="-1"

fi

………..

$dsiinput=“$dsiinput $qlen”

………

echo $dsiinput

……….

}

……….

sub | / opt/perf/bin/dsilog /var/opt/perf/datafiles/mailqs MAILQS >>$DSILOG 2>&1

######## Last pipes stdout of sub into stdin of dsilog program

42 of 92

OV Performance Manager

Step 5 - Sendmail - Create DSI measurement configuration file for Sendmail servers – mailqs.sp

  • This is compiled by running sdlcomp and creates three files in /var/opt/perf/databases
      • Mailqs
      • Mailqs.MAILQS
      • Mailqs.desc

CLASS MAILQS = 10001

INDEX BY DAY

MAX INDEXES 30

ROLL BY DAY

RECORDS PER HOUR 12;

METRICS

Sendmail_svr_1 = 101

PERCISION 1;

Sendmail_svr_2 = 102

PERCISION 1;

………………

43 of 92

OV Performance Manager

Step 6 - OVPM Server - Create multi-frame html file to display all three graphs

  • <html><head><title>Multi-Pane Mail Graphs</title>

<frameset framespacing="2" rows="40%,20%,40%">

<frame name = "F1" scrolling="no" marginwidth="0" marginheight="0" SRC="http://ovis.com/hpov_iops/cgi-bin/analyzer.exe?-GRAPHTEMPLATE:+ExchangeMailQueues+-GRAPH:+&quot;Exchange+MTA+Queues&quot;">

<frame name = "F2" scrolling="no" marginwidth="0" marginheight="0" SRC="http://ovis.com/hpov_iops/cgi-bin/analyzer.exe?-

GRAPHTEMPLATE:+ExchangeMailQueues+-GRAPH:+&quot;Exchange+Connector+MTA+Queues&quot;">

<frame name = "F3" scrolling="no" marginwidth="0" marginheight="0" SRC="http://ovis.com/hpov_iops/cgi-bin/analyzer.exe?-

GRAPHTEMPLATE:+MailQueues+-

GRAPH:+&quot;Sendmail+Queues&quot;">

</head> <body>

<H1> No frames

</body>

</html>

44 of 92

OV Performance Manager – Mail Graph

45 of 92

OV Operations Unix or �OV Operations Windows ?

  • Do - Both!!

    • Both products have good and bad points

    • We use OVOW to provision the Windows nodes

    • Using flexible management policies Windows agents send messages directly to OVOU

    • This takes advantage of the strong knowledge that OVOW has in the Windows world, but still keeps all the operators on one console.

46 of 92

OV Operations Unix / �OV Operations Windows

  • Problem – How do I keep OVOU in synch with the nodes in OVOW when using both systems?

  • Solution

    • Schedule an Action on OVOW server that uses ovpmutil to dump Service Map to a file in a URL space

      • ovpmutil cfg xml dnl “<c:\inetpub\wwroot\<your path>”

    • Schedule an Action on OVOU server to run a Perl script performing a HTTP Get that extracts the service file from the OVOW server

    • A Perl script is then run which …

      • Uploaded OVOW Service Map into OVOU
      • Dumps all nodes in OVOU
      • Parses the OVOW Service Map file
      • Generates an event for any node in the OVOW Service Map that is not in OVOU
      • Generates an event for any Windows node in OVOU that is not in OVOW

47 of 92

OV Operations Unix – �OPC Internal Messages

  • Problem – How to manage environment where huge numbers of legacy servers exist that are not stable

  • Solution – Intercept OPC Internal messages at both the Agent and Mgmt Server level and make conditions for all repeating events

    • Assign all messages to OVO Admin initially

    • Over time split between Tier 1, 2 and 3

    • Make distinction between critical errors and those that can be fixed when time allows.

    • Identify problems that will NEVER be fixed on old servers!!!!
      • Good Composer task

Use OpC Number in Description

48 of 92

OVO Java Console Operator View

  • Do – Make the presentation to the operator the number one priority !

  • DoPut yourself in the shoes of the operator!!!!

Because …

    • If the operator’s cannot make sense of the tool or …
    • If the operator’s cannot get what they want out of the tool or …
    • If the operator’s have to go to multiple tools to get an answer or …
    • If the operator’s platforms won’t support the tool then …

The operator’s are not going to use the tool or in the best case they will not get what they need out of the tool

49 of 92

OVO Java Console Operator View

Do – Think about how you want to tie multiple groups of users together

Operators view in OVO linked to OSPF Areas

Loc 1, Loc 2, …

Network Infrastructure +

Collection +

Topology +

Events ==

Operators View

NE/DC/ISS/EM all have the same view

OSPF Area 0

OSPF

Area 1n ++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 0 + 1n

Poller

Poller

Poller

Server Farm

Server Farm

NNM

Collection Station

NNM

Collection Station

Poller

50 of 92

Message Instructions

  • Critical task from day one to have valid instructions that include:

    • Message Description, Thresholds, Troubleshooting Steps, Escalation Steps

  • Web-based

    • Include hyperlinks to other sources
      • Troubleshooting steps -> OVPM, OVR, OVIS
      • Escalation Steps ->Phone Listings, Outages

  • How are you going to account for the messages?

    • What events have instructions
    • What events have the same instructions
    • Do the severities make sense
      • Especially in conjunction with like events

51 of 92

Message Instructions

  • Our Approach

    • Use Message Type field in OVO to hold the Event ID

      • EA000X – Servers
      • EA500X – Network devices

    • Provide Application which runs from Java Console and points directly to the correct Message Instruction

      • Web Based

      • Stored in SupportSoft Knowledge Base
        • Searches can also be run against all instructions
        • Associated “Hits” are also displayed

52 of 92

Message Instructions

  • CM Controlled via Excel

  • Spreadsheet is published to Web via Macro

  • Event line item in Excel links to actual Event Webpage

53 of 92

Protected Enclave Mgmt

  • Do – develop a strategy ahead of time on how to manage protected areas

54 of 92

Protected Enclave Mgmt

Least to most capability

55 of 92

Agenda

  • Background

  • General Do’s and Don’ts

  • Product Specific Implementations

  • Configuration Management of EM Systems

  • Correlation Composer

56 of 92

Tying it all together -�Or how to we keep ourselves sane?

  • Problem

    • No notion of CM in OVO

    • No easy way to see things like operator assignments, node assignments, nested profiles

    • No way to figure out if all the tools are synched up between systems as far as Nodes are concerned

    • Responsibility Matrix is impossible to figure out if you have lots of Node Groups/Message groups

    • Too many areas where Message Traffic can spiral out of control

57 of 92

Tying it all together – Do’s and Don’ts

  • Do – Think about how you can make use of the Web for reporting
    • Keep out of consoles where is makes sense
      • OVOU, OVOW, OVIS

  • Don’t - Create a life-cycle mess with CM and documentation

  • Do – Think about how you can merge all documentation into a on-line search engine and tie this in with existing Operations Center knowledge data bases.

  • Do - find a way to allow for quick retrievals, updates and creation – We used SupportSoft as a Knowledge Base

  • Don’t – Overlook the complexity of keeping Nodes, Users, Events, etc straight once you’ve deployed

    • This was the area that gave us the most problems – even after giving it a lot of thought and having a plan in place

    • Synching up nodes across all EM systems was the area we underestimated the most – one year into the large deployment and we continue to struggle with this issue.

58 of 92

Tying it all together – Do’s and Don’ts - cont – Exception Reports

2. Do – Build exception reports on everything you can

59 of 92

Tying it all together – Do’s and Don’ts �- cont - Synching Up EM Systems

  • Problem – How do we ensure that all systems have the correct nodes
  • Solution – Enforce a hierarchy

OVO

Inventory

report

Seed

File

Reporting

ET

Filter

NNM

IV

Fdry

Config

Bkups

inFdryNotinNNM

CMDB

Events

HTTPget

2

1

3

4

5

6

7

8

60 of 92

Tying it all together – Do’s and Don’ts - cont

2. Do – Take advantage of the EM systems to monitor themselves and other EM systems

    • OVR reports, OVPM DSI Collection, OVIS port availability

OVR report with summarized OVOU message rate over last 96 hrs

61 of 92

Tying it all together – Do’s and Don’ts - cont

  • Create OVR reports on OVO Message Rate -

4,8,12,24,48 hour

1, 2, 3, 4, 7, 14, 21, 30 day

62 of 92

Tying it all together – Do’s and Don’ts - cont

3. Don’t – Forget to monitor the individual EM systems

    • One error can cause huge problems if left undetected
    • We had three problems that were accounting for 80K messages a day that were undetected for a period of time

  • Problem
    • What’s going on under the hood during the initial deployment stages
      • Malformed template/policy messages
      • DNS resolution issues
      • IP address issues when monitoring switches

  • Solution
    • 1. Turn on MSG tracing and leave it on
      • OPC_TRACE TRUE
      • OPC_TRACE_AREA MSG
      • OPC_TRACE_TRUNC FALSE
    • 2. Create Template to monitor trace file for discarded messages
    • 3. Action can do DNS Lookup, DB, lookups, etc
    • 4. No apparent overhead
      • Will need scheduled action to truncate trace file with opcsv -trace

63 of 92

Tying it all together – Do’s and Don’ts - cont

  • Problem – Need to know if OVIS SLA probes are having problems or maybe have bad targets

  • Solution – Use the Tool!!!
    • (Well, somewhat)
    • Create ASP page to query OVIS on current SLA levels.
    • If there is a problem area the first step is to see if the probes are the problem

  • We also have Web pages for:
    • Status of OVIS Pollers
    • Last time received
    • Distribution of various targets among pollers
    • Worst Performing pollers
    • The poller or the path to the target can just as easily be the problem – You have to manage that variable aggressively

64 of 92

Tying it all together – Do’s and Don’ts - cont

Problem – How to manage User Responsibilities

Solution – Build a repeatable process outside of OVO to manage responsibilities

Step 1 – Setup an Excel Spreadsheet to track assignments and use an Excel Macro to publish the spreadsheet to a Web Page

Step 2 – Decide on a clear distinction between the Tier levels (Sol-Adm-T2, Sol-Adm-T3)

Step 3 - Start out with higher level Message Groups and then break them down to smaller groups

65 of 92

Tying it all together – Do’s and Don’ts - cont

Step 4 - Get rid of all the Message Groups that don’t make sense and simplify

    • Gotcha – Don’t get rid of Misc Message Group!! All messages with invalid MsgGrp uses this MsgGrp

Step 5 - Track various states of Message Groups

    • Approved
    • Need to be baselined
    • Deleted

Step 6 – Decide how you are going to classify your users

  • Tier Level
  • Function
  • Authority

66 of 92

OVO Operator Assignment

Tier Profiles(Users)

Step 7 – Break down application profiles so that they match your classification of users

  • Create profiles for
    • All Users
    • Network Tier 1
    • Network Tier 2
    • ………….

Step 8 – Now map your users and applications to each other

App Profiles

67 of 92

OVO Operator Assignment

Tier Profiles(Users)

Step 9 – Now do the same with your events. Develop classifications by:

  • Tier Levels
  • Responsibilities
  • …….

Step 10 – Map your Event Profiles to Users the same way you did with Applications

Event Profiles

68 of 92

OVO Operator Assignment

Step 11 – Now map your User Profiles to User Templates

  • This is just a One-to-one mapping

Step 12 – Make the actual assignments in the User Profile Bank

  • All items in Row One are created

Step 13 – Create the User Templates in the User Bank

  • Number at the beginning is important as the number simplifies use when many users exist

User Template Assignments

Tier Profiles(Users)

69 of 92

OVO Operator Assignment

Step 14 – Create the Users

A. Copy User Template

B. Change Data for User

  • Never Change the user again!!!!

  • What does all this buy us?
    • Simple way to see what is actually assigned
    • Simple way to walk through the effect of making a change.
    • Make changes at the lowest level
      • Ex. Want to assign new app to All Users – Only one template is changed

User Profiles

App and Event Profiles

70 of 92

OVO Operator Assignment

  • Remember - Number at the beginning of the User Template is important in order to put the User Templates at the Beginning of the GUI display

    • Much easier to find
    • Avoids mistakes

71 of 92

Agenda

  • Background

  • General Do’s and Don’ts

  • Product Specific Implementations

  • Configuration Management of EM Systems

  • Correlation Composer

72 of 92

Correlation Composer – Problem/Solution

  • Problem
    • Request to add fields to the OVO Java Console in an effort to assist end-users in identifying and classifying devices
    • Some of the data is accessible from the CMDB JSP interface, but users would like the data in the event
    • Some of the data requested is easily accessible directly from OVO
      • Some of the data is available via Node Group assignments
    • Some of the data needs to be pulled from CMDB

  • Solution
    • Use Composer to access data from ECS datastore and from external data source

  • Considerations
    • Performance overhead of datastores
    • No way to get multiple elements associated with a single key from a datastore using Composer and built-in functions
    • Overhead of using external perl function
    • Size of Java Console display
      • What impact will the additional fields have?

73 of 92

Correlation Composer – General Outline

  • Create Directory Structure
  • Create OVO Node Groups for Customer / Special categories
  • Add any necessary fields to CMDB
  • Build OVO and CMDB reports that will be used as input
  • Modify CO.conf for new Custom Message Attributes (CMA)
  • Create namespace file

  • Create Scripts
    • Create script to build factstore
    • Create script to build datastore

  • Create Correlators
    • Create Enhance Correlator to perform Lookup to datastore
    • Create Enhance Correlator to perform Lookup to external source through perl function
    • Create Enhance Correlator to adhere to precedence order
    • Modify Original Message Text to include original source

74 of 92

  1. Create Directory Structure in line with ECS 3.3

    • /etc/opt/OV/share/conf/ecs/CIB/scripts – scripts to build datastores and factstores
    • /etc/opt/OV/share/conf/ecs/CIB/stores – repository for data and fact stores
    • /etc/opt/OV/share/conf/ecs/CIB/tmp – tmp directory to process input data to build stores
    • /etc/opt/OV/share/conf/ecs/CIB/drivers – drivers to test Correlator rules

      • Existing Directories

    • /etc/opt/OV/share/conf/ecs/CIB – Default repository for OVO and NNM fact and datastores. Also location for namespace file. In our case OVONameSpace.conf
    • /opt/OV/contrib/ecs/external/perl – location for perl function that is read in as well as flat file containing lookup data

Correlation Composer – Step 1

75 of 92

2. Determine input sources and data that you want to populate Custom Message Attributes (CMA) fields with

    • Create Node Groups representing the categories and services we want to use as CMA’s ( Cust A, Cust B, High Priority, DNS, AT, etc, etc) in the format of:
      • S_<customer> (Ex. S_CustA_Nt, S_CustB_Ux …….)

    • Build OVO report containing Nodename and all Node Groups that the node belongs to in the format of:
      • Nodename NodegroupA NodegroupB (Ex. Svr1 S_CustA S_CustB …)

    • Create additional field in CMDB that contains all services associated with a server
      • Location and Remedy Queue already existed

    • Generate CMDB report that has all OVO nodes and associated Locations, Remedy Queues, and OVO Services in the format of:
      • Nodename Location Remedy Queue Services
      • (Ex. Svr1 Bldg1 RemedyQueue1 File,Print )

Correlation Composer – Step 2

76 of 92

Correlation Composer– Step 3

3. Modify $OV_BIN/CO.conf to setup CMA’s

    • CMA_Location == Location of device
    • CMA_Net == Network device resides on
    • CMA_Cat == Special Category that the device falls into
    • CMA_Service == OVO defined services
    • CMA_Customer == Special Customer that the device supports
    • CMA_WatchList == Six Sigma High Priority servers
    • CMA_SWTicket == Remedy Queue

77 of 92

Correlation Composer– Step 4

4. Create $OV_CONF/ecs/CIB/OVONameSpace.conf

  • Defines names of all the individual factstores we are going to use
    • Intended use is not for developer mode, but this works very well for us since we have so many rules.

OVO_NNM.fs == NNM and Syslog

OVO_Security.fs = Firewall

OVO_OVIS.fs = OVIS

OVO_OVOU.fs == Solaris

OVO_OVOW.fs == Windows

OVO_Specific.fs = Node specific

OVO_OVOUAdm.fs == Mgmt Svr

OVO_Lookup.fs == Lookup rules

  • Lookup defined as accessing

datastore or external data source

  • All factstores stored in

$OV_CONF/ecs/CIB/stores

78 of 92

Correlation Composer– Step 5

5. Create script that builds ecs_comp.fs – $OV_CONF/ecs/CIB/scripts/fact_commit

  • Merge

  • Backup

  • Copy

  • Load

79 of 92

Correlation Composer– Step 6

  1. Create script that builds ecs_comp.ds – data_commit

    • Retrieve and Parse CMDB report from Service Desk that contains Nodename, Location, Remedy Queue, and Services via perl httpget

    • Build $OV_CONTRIB/ecs/external/perl/teams.txt

      • Nodename Location Remedy Queue Services (Services in CSV format)

    • Retrieve and Parse OVO report that contains Nodename and all Node Group assignments

    • Build elements going into datastore based on Node Group name ( S_<customer> )

        • For each special customer
          • grep customer from OVO report
          • Build Add Fact line
          • Append onto new $OV_CONF/ecs/CIB/stores/ecs_comp.ds in the format of:

  • ADD DATA(“CMA_CustAList” , [“svr1.com” , “svr2.com” , ….. ])
  • ADD DATA(“CMA_CustBList” , [“svr3.com” , “svr4.com” , ….. ])

    • We have 7 separate categories ranging from 5 to 112 servers/network devices

80 of 92

Correlation Composer– Step 6 – cont

  • $OV_CONF/ecs/CIB/scripts/data_commit script

  • Create Special

Customer datastore

entries

  • Verify

  • Copy

  • Load

  • Backup

81 of 92

Correlation Composer– Step 7

82 of 92

Correlation Composer– Step 8

8. Lookup To Datastore

  • Ignore if Forwarded

  • Set Constant Key
  • Type “Lookup”
  • Set text string that goes in

CMA field

  • Do Lookup – Is the current Node

name in the CMA_Cust1 list?

  • From Step 6

ADD DATA(“CMA_Cust1List” , [“svr1.com” , “svr2.com” , ….. ])

83 of 92

Correlation Composer– Step 8 - Cont

  • Alter Specification

  • Alter the specification of

the alarm and put the

    • Constant Value

in the CMA_Customer

CMA field

_CMA_CustAList

Defined in Step 3

Evaluates to “Cust1”

84 of 92

Correlation Composer– Step 9

  • Step 9 - Access external data source via perl function:

  • Open file

  • Match

  • Set var for

    • Nodename
    • SWTicket
    • Location
    • Service
  • Return values

85 of 92

Correlation Composer – Step 9 - Cont

  • External Lookup

  • Set Key

  • Location == Index 4
  • Service == Index 5
  • SwTicket == Index 3

  • Debug Line that is put in Message Text when debugging

Get data in index #

Call Perl function

86 of 92

Correlation Composer– Step 9 - Cont

  • Alter Specification

87 of 92

Correlation Composer– Step 9 - Cont

  • Precedence

  • If Key matches AND it is not in the higher precedence category

Two keys in this case – one to match primary key and one to match secondary key

88 of 92

Correlation Composer – Step 10

  • Do – 1. Check to see if an event has already gone through a Correlator
  • Do – 2. Capture original Message Source and append to Original Message Field
  • Do – 3. Change the new Message ID back to the Original Message ID

1

2

3

2

2

89 of 92

Correlation Composer– Step 11

  • Sit back and have a beer

Node_A

Node_E

Node_F

Node_I

Node_G

Node_H

Node_B

Node_C

Node_D

Node_E

Node_F

Node_I

Node_G

Node_H

Node_C

Node_D

Node_I

Node_G

Node_H

Location

Remedy Queue

Special Customer

Special Category

Service

90 of 92

Correlation Composer – Do’s and Don’ts

  • Do – Spend some time in the lab
    • Not as well documented as it should be
    • Lot’s of little nuances

  • Don’t - forget about other Management servers
    • Informational servers are not really an impact
    • Full Message Forwarding has to be taken into account
      • FORWORDED != TRUE

  • Do – Use the namespace.conf file
    • After a point you will have problem keeping everything straight if you have many rules

  • Do – Make sure you understand the precedence of Correlator Rules, Duplicate Message Suppression, and Message Keys.

91 of 92

Correlation Composer – �Do’s and Don’ts - cont

  • Do – Monitor performance impact of External Perl function

    • File Open for every event that comes in has overhead
      • Sun Blade 100 – Backup when > 1000 messages in one minute
      • Sun 280R – Backup when > 2000 messages in one minute
      • Sun 880R (8 900MHZ CPU’s, 32 GB Memory, SAN) > Backup when > 5000 messages in one minute

  • Don’t - forget about tracing
    • Lot’s of output, but invaluable during testing and troubleshooting

  • Do – Test with drivers or a dump of ECS events as you add more and more rules

92 of 92

Questions?

Tools enable the process, they are not the process

A Fool With a Tool is Still a Fool