3 of 92

Background

Outsourced Infrastructure

Large outsourcing effort with only a portion of the total infrastructure outsourced

Outsourcing effort covers Local and Campus networks
Three major “air-gapped” networks
2000+ Servers in three major Data Centers as well as “local” servers
User workstations in 100+ locations within a 50 Mile radius of the main campus

No existing Consolidated Operations Center
Client retained complete control over governance process for everything deployed
No coherent Change or Configuration Management process
350 SLA’s to start with (approx 160 today)

Our Task

Deploy a modernized Enterprise Management capability
Replace “Owner Monitoring” with “Operations Management”
Deployed across multiple networks
Function from a single Operations Center
Provide for SLA monitoring and reporting capabilities

AND Complete all deployments in 30 Months

- Piece of Cake – Right…….

4 of 92

Build Approach - Three Primary Builds

Build 1 – “Quick Hits”

OV Performance Insight
OV Network Node Manager 6.3
ON Internet Services 3.5

All three networks

Build 2

Foundry IronView
Cisco CiscoWorks

All three networks

Build 3

OV Operations Unix
OV Operations Windows
OV Network Node Manager 6.4.1
OV Extended Topology
OV Service desk
OV Service Information Portal
OV Reporter
OV Storage Area Manager
OV Performance Manager
SupportSoft

Main Network

Builds 4 - XX

Upgrade to NNM 7.01
Upgrade to NNM 7.5
Multiple upgrades to OVIS, OVR, OVPM
Upgrade to OVOW 7.2
Patch Upgrades
Deploy Build 3 servers to the other two networks
Upgrades to IronView and CiscoWorks

5 of 92

Architecture

OVPI

6 of 92

Data Flow

OVPI

7 of 92

Organizational Issues

Total Chaos!!

Multiple organizations responsible for each IT area
No clear understanding of what EM is within the delivery organization
No buy-in by the Delivery organizations
No direct access to servers
No dedicated Server SA’s
Failing SLA’s

Other Modernization projects were also way behind schedule

Impact: We had to account for an environment that was mainly legacy when we had planned and tested on a modernized infrastructure

Ex. Windows servers were supposed to be deployed on Active Directory which was not available

AND …

Management continually asked …Where’s my Operations Center?

8 of 92

Agenda

Background

General Do’s and Don’ts

Product Specific Implementations

Configuration Management of EM Systems

Correlation Composer

9 of 92

Do – Make sure that Management fully understands what they are committing to

Make sure they understand the “people” commitment necessary

Do – Make sure Management understands what they are getting in each phase of the project

Do – Get agreements from Management in writing on staffing and responsibility issues

Don’t – Expect that Management will “get it” the first time

Make sure you have regular briefings leading up to deployment

Don’t – Let Management set expectations in their own mind. Ensure that their expectations match what you are delivering

Don’t – Assume that that once Management commits to a multi-million dollar purchase that they will champion the effort

Organizational Do’s and Don’ts

10 of 92

User Community Do’s and Don’ts

Do – Start engaging end-users early and frequently

We did this but……..

Do – Expect resistance

Time
Prior tools
Trust

Don’t – Assume that users will understand the goals easily

Don’t - underestimate the reluctance of users to use new tools – even simple tools

Don’t - expect that all users will “Get it” just because you do

Do – Find pain points that you can address from day one of the deployment

Do – Publish deployment strategy and follow-up from day one to the largest possible audience

Do – Focus on early adopters

11 of 92

Deployment Do’s and Don’ts

Do – take bite sized chunks

Do – Come up with something to deploy early that shows value

In our case this was upgrading the existing OVNNM’s, OVPI and deploying new instances of OVIS

In the case of OVIS - Don’t – assume the value will be recognized (previous slide!!)

Don’t early deploy something you don’t fully understand or don’t agree on

PI deployment was doomed from the beginning

Wrong personnel on our side
Scaled in-correctly
Conflict with end-users on requirements

Don’t spend a lot of time in the lab “designing the system”

Real environment often times will throw you too many curves
Focus on a limited set of results that you know you will pay off on “day one”
“Toolbox” approach proved that requirements would come in once we deployed. That being the case we should have deployed OVO much earlier !!!!!!

12 of 92

Agenda

Background

General Do’s and Don’ts

Product Specific Implementations

Configuration Management of EM Systems

Correlation Composer

13 of 92

Specific Implementations

SLA Support

Openview Internet Services

Configuration Management Database

Openview Service Desk

Outages

Openview Operations, Internet Services, and Service Desk

Metrics Collection and Presentation

Openview Operations, Performance Manager, Reporter

Management Servers

Openview Operations Unix and Windows

End-User View & Event Instructions

Openview Operations Java Console

Management Deployment Strategy

14 of 92

OV Internet Services SLA Polling

Every Area has:

OV Internet Services Probe

OV Problem Diagnosis Probe

OV NDAOM Probe

ICD Function

Probes Provide:

Service Availability and Response Time (HTTP, DNS, EMAIL, …..

Path Availability and Response Time

Probe

Network 1 == 39 pollers

Network 2 ===25 Pollers

Network 3 == 25 Pollers

Currently 2000+ pairs

Opps!!

15 of 92

OV Internet Services SLA Response�Time and Availability Measurements

SLA Support
Configuration Management Database
Outages
Metrics Collection and Presentation
Management Servers
Network View
Event Instructions
OVO Java Console

16 of 92

OV Internet Services – Do’s and Don’ts

Don’t – underestimate complexity of setting up customer and service groups that make sense.

OVIS has an easy mechanism to add targets/groups however...

You may quickly realize that what you have in place does not make sense the larger it grows
Once the groups are setup, they are not easy to change!!

Do – Ensure you have a big enough machine if you go over 1000 pairs

Do - Build OVR reports for the OVIS administrator

This helps during baselining period to identify which pairs are a problem

Do – Setup DNS and Mail probes as early as

possible

Don’t underestimate the work involved in keeping

Http probes up to date with valid pages

Do - Setup OVO Scheduled Actions on Web Servers

to see what pages are valid and are accessed by end-users

17 of 92

OV Internet Services�Custom Probes – iperf probe

Do – Use Custom Probes

LogMessage("OVISTarget == "+OVISTarget());

OVISAvailable=false; //probe unavailable

Cmd=“ iperf.exe -c "+OVISTarget+" -u ";

WshShell=new ActiveXObject(“ WScript.Shell");

OVISStart("ResponseTime");

Exec=WshShell.Exec(Cmd);

sO=Exec.StdOut;

while (!sO.AtEndOfStream)

………….

LogMessage(“ Jitter size is "+Opt[7]);

OVISSetMetric("Jitter",Opt[7]);

………

Use Cases
iperf, NIS, File Shares, NFS

18 of 92

OVIS Redundancy

Problem – Requirement to have redundancy on all systems

OVIS does not have a high-availability solution
Remote probes do not have a way to switch to a second OVIS Management Server
Replicate as little data as possible on a system running Internet Services, Reporter and Performance Manager
Minimal time to implement – bare bones solution

Solution – Manual Failover Process between two Management Servers running:

Windows Advanced Server 2000
SQL 2000
Openview Internet Services
Openview Reporter
Openview Performance Manager

19 of 92

OVIS Redundancy - cont

Step 2 – Add the virtual IP address to the “Advanced TCP/IP Settings” tab in the Local Area Connection Properties configuration box on the Primary server

By adding only the Virtual IP to the Primary server you are only allowing that server to conduct transactions with the OVIS Probes

Normal OVR collection and reporting is still conducted at scheduled times

This restricts the amount of data that has to be kept in synch between the two servers to a minimum

Step 1 – Create virtual Nodename in DNS

that will migrate between the two

Management servers

20 of 92

OVIS Redundancy - cont

Step 3 – Change the OVIS Services entry on the Backup server to Manual and stop the services

Step 4 – Setup Backup Plan on both servers

Consists of three SQL Server jobs on run on Primary server –

Full – Daily
Differential – Every four hours
Transaction Log – Every 15 Minutes

Backups are copied after each run to the Backup server

Scheduled Restore Job run on Backup server

Daily (after full backup run on primary)
Goal is to use only latest Differential and Transaction backups when doing the actual failover

21 of 92

OVIS Redundancy cont

Step 5 – Actual Failover – Primary Server

Virtual IP removed from Network Configuration on Primary server. OVIS probe transactions start queuing up on OVIS probes.
Run iopscollector.exe and iopsmaint.exe to flush probe data.

Verify each has completed in order before continuing

Stop OVIS services and change to Manual in Services configuration
Start SQL Enterprise Manager and expand to SQL Server Agent -> Jobs

Highlight “Reporter Backup-Transaction Log” and select “start job”

22 of 92

OVIS Redundancy - cont

Step 6 – Actual Failover – Backup server

Start SQL Server 2000 Enterprise Manager -> Query Analyzer -> Reporter
Open RestoreFrom <primary server> script and start by hitting F5

If primary server went down hard then run RestoreFrom <backup server>

Add Virtual IP to Network Settings
Change Services entry for OVIS to Auto and start the OVIS services.
Start OVIS Configuration Manager and verify that data is coming in

In SQL 2000

Enterprise Manager

drill down to

SQL Server Agent ->

Jobs and highlight

“Reporter

Backup-Full”

and then select

“start job”

23 of 92

OV Service Desk�Configuration Management Database

Current input/outputs

OVO, NNM, ET CW2K, IronView, Marimba, McAfee, Legacy Databases

Input - Layered approach to building the CMDB

2 NNM instances dedicated solely to continuous discovery of the network

This is the base for the CMDB. Everything from this point will not get into the CMDB unless NNM has discovered the device

CW2K and IronView are the next two layers to fill in data on network devices
Legacy Server Database, OVO and Marimba fill in Server layers
Legacy Workstation Database and Marimba provide data on workstations

Output

Data source for Aperture, Remedy Asset module, Security reports
Data Source for tracking of workstations that have been migrated to the Modernized network

Future

Source for by port/speed billing
Synchronization of all EM Tools

Pitfalls

Is NNM Discovering all areas?
Name resolution a huge problem across all the input sources

24 of 92

OV Service Desk�Configuration Management

Network

Devices

Desktops

HPOU Operations

Data Repository

Service Desk

Problem Diagnosis

NDAOM

Oracle

NNM / ET

HPOW Operations

NNM

/ ET

UNIX

Servers

Security

Devices

WINTEL

Servers

Access

Web Access

EA Portal

Remedy

CiscoWorks/

IronView

SLA

Reports

TMS

Marimba /

Supportsoft

CM Sources

NNM - Discovery of all in-scope devices (Network/Servers/Workstations

IronView/CiscoWorks – Configuration/Inventory

Extended Topology – Configuration/Inventory

Marimba – Configuration/Inventory

Macfee – Version

OVOU/OVOW – Service Data

Legacy Server Databases

Data Consolidation into Service Desk

Events Generated from CM changes

if they are unauthorized

CM Changes Mapped to Remedy Ticket Changes

to Identify true events

Events

CM Database

Mapping of SD CM

data to Remedy

Asset Module

25 of 92

OV Service Desk�Configuration Management Database

26 of 92

OV Service Desk�Operator Java Console Drill Down�

JSP Interface

IronView

NNM

Marimba

27 of 92

OV Service Desk�Operator Java Console Drill Down�

Foundry Switch Report

28 of 92

OV Service Desk Do’s and Don’ts

Don’t exclude OVSD because you don’t need all the components

In our case the Client has a huge investment in Remedy and that’s not going to change – BUT – we did need a CMDB

Remedy, Aperture, and home built databases were not going to go away
Long ramp-up time to pair down the number of sources
As new tools are deployed new sources become a CM source
CMDB portion of OVSD allowed us to import/export from/to an unlimited number of sources and provide an immediate consolidated source to users

Do – Take advantage of certain components if they fulfill a need

Do – Take advantage of the Data Exchange component

Do – Take advantage of the OV API’s

Don’t – Think that anything you do in OVSD is easy in the long run!!!

Do – Use a Web-Based Interface

Custom JSP interface is slowly becoming the Customer Facing mechanism for the entire Toolset along with the OVO Java Console
Goal is to provide pointers to every possible source of information on a device

29 of 92

�Scheduled Outage Integration – �Use Case�

Scheduled outages must be accounted for in order for SLA measurements to be correct

All scheduled outage information enters the system via Remedy

OV Service Desk acts as the broker to the other OV products

Outage data pushed to OVIS to stop probing a given target

Outage data pushed to OVO to suppress events

Remedy

Service

Desk

OVO

OVIS

Scheduled Outage

Entered

Outage information

pushed to both OVO and

OVIS

Data

Repository

Crystal

Reports

Operator

Console

Events/Polling

Suppressed

WO Created

30 of 92

Mass Outages using OVOU

Problem – Series of outages in local datacenters over a several month period

Need to shut down all managed servers
Need to follow a set order bringing the servers down
Need to track the server going down and when it is completely down
Need to track when a server starts to come back up and when all services are started

Solution – Use OVO and external script to manage the process of:

Shutting down the servers
Tracking which servers are actually down
Tracking when servers start to come back up
Tracking when servers are completely back up

Considerations –

Not all servers can be completely powered off
Need to avoid message storms
Management needs to know exactly what servers are down and track progress of the servers coming back up

31 of 92

Mass Outages using OVOU - cont

Step 1 - Create Application Group “Outage”

Create Applications to stop Solaris, Windows 2003/2000 and Windows NT servers

tsshutdn is used on 2000/2003 servers
ovntshutdown is used on NT nodes

Note NT servers do not completely power off

shutdown is used on Solaris servers

Create Applications to send message from server being shutdown to start the outage

Create single Application that:

Logs the start time of the outage for each server
Starts external process on OVO Mgmt Svr that tracks progress of servers going down and up

Create Outage Message group

32 of 92

Mass Outages using OVOU - cont

33 of 92

Mass Outages using OVOU - cont

Step 2 - Create Node groups that contain the servers according to the time/platform that they belong to. For example:

1630_NT
1630_2000
1630_Sol
1730_NT
……

Do this step rather than trying to find the nodes at the time of the outage

Less confusing!!

Could also use Node Hierarchy and make available to non-admin users

34 of 92

Mass Outages using OVOU - cont

Step 3 - Create external process (mon_outage) on OVO

- This is called by the OVO application created in Step 1 Process does the following:

Monitor agent status of managed node by running opcragt against the node every 25 seconds

Send Major message when agent no longer available

Monitor down status of managed node via ICMP request

Send Critical message when node no longer answers request

Wakes up every 60 seconds and check status of node via ICMP request

Send Minor message when server is reachable again

Monitors agent status running opcragt every 25 seconds

Sends Normal message when opcragt is successful

35 of 92

Mass Outages using OVOU - cont

Why did we do it this way?

Detach from OVO in case of an outage on OVO

Allow for non-managed nodes to be tracked as well

Run mon_outage against a file containing nodenames – opcragt always fails, but ICMP request does not

Some of the outages are up to 24 hours

Did not want to hold onto a message that long

Uses single Message Group “Outage” with Message Keys. Only one message is in the console at a given time for a server.

Tracking is easy – If it is not Green………

Low overhead

Put together very quickly with nothing deployed to the managed node

36 of 92

OV Performance Manager - Problem

Problem - Need a graph in the Operations Center showing status of all mail server queues

Exchange Information Stores
Exchange Connector Servers
Sendmail Servers

No graph in place currently met this requirement
No Sendmail metric collection out-of-the box
MTA queues were measured but not stored
Screen Real Estate on Operation Center Video Wall is an issue
No notion of overall mail queues

Solution – Web-based graph of the

queue sizes built using OVPM, DSI, OVO

opctranm, and Exchange SPI

*Developed by HPC&I

37 of 92

OV Performance Manager - Steps

Exchange

Configure MTA Work Queue monitoring on OVOW

Store metrics

Create OVPM graph template for MTA and Connector MTA Queues

Sendmail

Create script to retrieve Sendmail queue lengths from Solaris servers
Create DSI measurement configuration file for Sendmail servers
Create OVPM graph template for MTA and Connector MTA Queues

OpenView Performance Manager

Create multi-frame html file to display all

three graphs in the same browser window.

38 of 92

OV Performance Manager

Step 1 - Exchange - Configure MTA Work Queue monitoring on OVOW

Need to turn on DSI collection

39 of 92

OV Performance Manager

Step 2 – Exchange

Create OVPM graph template for MTA and Connector MTA Queues

FAMILY: Exchange Mail Queues

GRAPH: Exchange MTA QUEUES

DESCRIPTION: Exchange MTA QUEUES

GRAPHTITLE: Exchange MTA QUEUES

YAXISTITLE: Messages

GRAPHBACKGROUND: None

JAVAGRAPHS: Yes

GRAPHMULTIPLEGRAPHS: Yes

GRAPHTYPE: bar

STACKED:

DATARANGE: 4 Hours

ENDDATE: now

GRAPHMETRICSPERGRAPH: 21

AUTOFRESH:

POINTSEVERY: auto

DSN: S2

DATASOURCE: CODA

SYSTEMNAME: exch_svr_1

CLASS: EA:Ex55MTAWorkQUEUE

METRIC: WorkQueueLength

LABEL: exch_svr1

COLOR: Orange

MARKER: marble

…………………..

This was created in previous step

40 of 92

OV Performance Manager

Step 3 - Sendmail - mailqchk.sh - Run on Managed Node

SENDMAIL=/usr/lib/sendmail

SENDMAILVER=`echo \\$Z | ${SENDMAIL} -bt -d0 | tail -1 | head -n 1 | cut -f 2 -d ' '`

# Determine the current sendmail mail queue length.

QueueHeader=`${SENDMAIL} -bp | head -1 | sed -e "s/(/ /"`

if [[ "${SENDMAILVER}" = 8.1?.* ]]; then # newer sendmailversion has other output format

QueueLength=`echo $QueueHeader | awk ' /empty/ {print 0}; /request/ {print $2 }' -`

else

QueueLength=`echo $QueueHeader | awk ' /empty/ {print 0}; /request/ {print $3 }' -`

if [ -n "$QueueLength" ]

then

echo `hostname` `date +%m.%d-%H:%M:%S` ${QueueLength}

exit 0

else

echo Error: `hostname` `date +%m.%d-%H:%M:%S`

exit 1

exit 0

41 of 92

OV Performance Manager

Step 4 - Sendmail – Create mailqsget.sh which collects and logs metrics

sub() {

……………

JOBFILE="/tmp/mailqsget.job"

MAILQCHK="/var/opt/OV/bin/OpC/cmds/mailqchk.sh"

………..

echo $fqnode > $JOBFILE

echo "!" $MAILQCHK >>$JOBFILE

ret=`$OPCTRANM -t $TIMEOUT $JOBFILE 2>&1` ### opctranm to start mailqck.sh on sendmail svr

……….

Line=`echo $ret | grep "^$node"` ######## Grab output of mailqchk.sh

……….

if [ $? -eq 0 ] ; then

qlen=` echo $Line| cut -f 3 -d ‘ ‘ ` ######### Grab Queue length

else

qlen="-1"

………..

$dsiinput=“$dsiinput $qlen”

………

echo $dsiinput

……….

}

……….

sub | / opt/perf/bin/dsilog /var/opt/perf/datafiles/mailqs MAILQS >>$DSILOG 2>&1

######## Last pipes stdout of sub into stdin of dsilog program

42 of 92

OV Performance Manager

Step 5 - Sendmail - Create DSI measurement configuration file for Sendmail servers – mailqs.sp

This is compiled by running sdlcomp and creates three files in /var/opt/perf/databases

Mailqs
Mailqs.MAILQS
Mailqs.desc

CLASS MAILQS = 10001

INDEX BY DAY

MAX INDEXES 30

ROLL BY DAY

RECORDS PER HOUR 12;

METRICS

Sendmail_svr_1 = 101

PERCISION 1;

Sendmail_svr_2 = 102

PERCISION 1;

………………

43 of 92

OV Performance Manager

Step 6 - OVPM Server - Create multi-frame html file to display all three graphs

<html><head><title>Multi-Pane Mail Graphs</title>

<frame name = "F2" scrolling="no" marginwidth="0" marginheight="0" SRC="http://ovis.com/hpov_iops/cgi-bin/analyzer.exe?-

GRAPHTEMPLATE:+ExchangeMailQueues+-GRAPH:+"Exchange+Connector+MTA+Queues"">

<frame name = "F3" scrolling="no" marginwidth="0" marginheight="0" SRC="http://ovis.com/hpov_iops/cgi-bin/analyzer.exe?-

GRAPHTEMPLATE:+MailQueues+-

GRAPH:+"Sendmail+Queues"">

</head> <body>

<H1> No frames

</body>

</html>

44 of 92

OV Performance Manager – Mail Graph

45 of 92

OV Operations Unix or �OV Operations Windows ?

Do - Both!!

Both products have good and bad points

We use OVOW to provision the Windows nodes

Using flexible management policies Windows agents send messages directly to OVOU

This takes advantage of the strong knowledge that OVOW has in the Windows world, but still keeps all the operators on one console.

46 of 92

OV Operations Unix / �OV Operations Windows

Problem – How do I keep OVOU in synch with the nodes in OVOW when using both systems?

Solution

Schedule an Action on OVOW server that uses ovpmutil to dump Service Map to a file in a URL space

ovpmutil cfg xml dnl “<c:\inetpub\wwroot\<your path>”

Schedule an Action on OVOU server to run a Perl script performing a HTTP Get that extracts the service file from the OVOW server

A Perl script is then run which …

Uploaded OVOW Service Map into OVOU
Dumps all nodes in OVOU
Parses the OVOW Service Map file
Generates an event for any node in the OVOW Service Map that is not in OVOU
Generates an event for any Windows node in OVOU that is not in OVOW

47 of 92

OV Operations Unix – �OPC Internal Messages

Problem – How to manage environment where huge numbers of legacy servers exist that are not stable

Solution – Intercept OPC Internal messages at both the Agent and Mgmt Server level and make conditions for all repeating events

Assign all messages to OVO Admin initially

Over time split between Tier 1, 2 and 3

Make distinction between critical errors and those that can be fixed when time allows.

Identify problems that will NEVER be fixed on old servers!!!!

Good Composer task

Use OpC Number in Description

48 of 92

OVO Java Console Operator View

Do – Make the presentation to the operator the number one priority !

Do – Put yourself in the shoes of the operator!!!!

Because …

If the operator’s cannot make sense of the tool or …
If the operator’s cannot get what they want out of the tool or …
If the operator’s have to go to multiple tools to get an answer or …
If the operator’s platforms won’t support the tool then …

The operator’s are not going to use the tool or in the best case they will not get what they need out of the tool

49 of 92

OVO Java Console Operator View

Do – Think about how you want to tie multiple groups of users together

Operators view in OVO linked to OSPF Areas

Loc 1, Loc 2, …

Network Infrastructure +

Collection +

Topology +

Events ==

Operators View

NE/DC/ISS/EM all have the same view

OSPF Area 0

OSPF

Area 1n ++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 1n++

OSPF

Area 0 + 1n

Poller

Server Farm

NNM

Collection Station

NNM

Collection Station

Poller

50 of 92

Message Instructions

Critical task from day one to have valid instructions that include:

Message Description, Thresholds, Troubleshooting Steps, Escalation Steps

Web-based

Include hyperlinks to other sources

Troubleshooting steps -> OVPM, OVR, OVIS
Escalation Steps ->Phone Listings, Outages

How are you going to account for the messages?

What events have instructions
What events have the same instructions
Do the severities make sense

Especially in conjunction with like events

51 of 92

Message Instructions

Our Approach

Use Message Type field in OVO to hold the Event ID

EA000X – Servers
EA500X – Network devices

Provide Application which runs from Java Console and points directly to the correct Message Instruction

Web Based

Stored in SupportSoft Knowledge Base

Searches can also be run against all instructions
Associated “Hits” are also displayed

52 of 92

Message Instructions

CM Controlled via Excel

Spreadsheet is published to Web via Macro

Event line item in Excel links to actual Event Webpage

53 of 92

Protected Enclave Mgmt

Do – develop a strategy ahead of time on how to manage protected areas

54 of 92

Protected Enclave Mgmt

Least to most capability

55 of 92

Agenda

Background

General Do’s and Don’ts

Product Specific Implementations

Configuration Management of EM Systems

Correlation Composer

56 of 92

Tying it all together -�Or how to we keep ourselves sane?

Problem

No notion of CM in OVO

No easy way to see things like operator assignments, node assignments, nested profiles

No way to figure out if all the tools are synched up between systems as far as Nodes are concerned

Responsibility Matrix is impossible to figure out if you have lots of Node Groups/Message groups

Too many areas where Message Traffic can spiral out of control

57 of 92

Tying it all together – Do’s and Don’ts

Do – Think about how you can make use of the Web for reporting

Keep out of consoles where is makes sense

OVOU, OVOW, OVIS

Don’t - Create a life-cycle mess with CM and documentation

Do – Think about how you can merge all documentation into a on-line search engine and tie this in with existing Operations Center knowledge data bases.

Do - find a way to allow for quick retrievals, updates and creation – We used SupportSoft as a Knowledge Base

Don’t – Overlook the complexity of keeping Nodes, Users, Events, etc straight once you’ve deployed

This was the area that gave us the most problems – even after giving it a lot of thought and having a plan in place

Synching up nodes across all EM systems was the area we underestimated the most – one year into the large deployment and we continue to struggle with this issue.

58 of 92

Tying it all together – Do’s and Don’ts - cont – Exception Reports

2. Do – Build exception reports on everything you can

59 of 92

Tying it all together – Do’s and Don’ts �- cont - Synching Up EM Systems

Problem – How do we ensure that all systems have the correct nodes
Solution – Enforce a hierarchy

OVO

Inventory

report

Seed

File

Reporting

Filter

NNM

Fdry

Config

Bkups

inFdryNotinNNM

CMDB

Events

HTTPget

60 of 92

Tying it all together – Do’s and Don’ts - cont

2. Do – Take advantage of the EM systems to monitor themselves and other EM systems

OVR reports, OVPM DSI Collection, OVIS port availability

OVR report with summarized OVOU message rate over last 96 hrs

61 of 92

Tying it all together – Do’s and Don’ts - cont

Create OVR reports on OVO Message Rate -

4,8,12,24,48 hour

1, 2, 3, 4, 7, 14, 21, 30 day

62 of 92

Tying it all together – Do’s and Don’ts - cont

3. Don’t – Forget to monitor the individual EM systems

One error can cause huge problems if left undetected
We had three problems that were accounting for 80K messages a day that were undetected for a period of time

Problem

What’s going on under the hood during the initial deployment stages

Malformed template/policy messages
DNS resolution issues
IP address issues when monitoring switches

Solution

1. Turn on MSG tracing and leave it on

OPC_TRACE TRUE
OPC_TRACE_AREA MSG
OPC_TRACE_TRUNC FALSE

2. Create Template to monitor trace file for discarded messages
3. Action can do DNS Lookup, DB, lookups, etc
4. No apparent overhead

Will need scheduled action to truncate trace file with opcsv -trace

63 of 92

Tying it all together – Do’s and Don’ts - cont

Problem – Need to know if OVIS SLA probes are having problems or maybe have bad targets

Solution – Use the Tool!!!

(Well, somewhat)
Create ASP page to query OVIS on current SLA levels.
If there is a problem area the first step is to see if the probes are the problem

We also have Web pages for:

Status of OVIS Pollers
Last time received
Distribution of various targets among pollers
Worst Performing pollers
The poller or the path to the target can just as easily be the problem – You have to manage that variable aggressively

64 of 92

Tying it all together – Do’s and Don’ts - cont

Problem – How to manage User Responsibilities

Solution – Build a repeatable process outside of OVO to manage responsibilities

Step 1 – Setup an Excel Spreadsheet to track assignments and use an Excel Macro to publish the spreadsheet to a Web Page

Step 2 – Decide on a clear distinction between the Tier levels (Sol-Adm-T2, Sol-Adm-T3)

Step 3 - Start out with higher level Message Groups and then break them down to smaller groups

65 of 92

Tying it all together – Do’s and Don’ts - cont

Step 4 - Get rid of all the Message Groups that don’t make sense and simplify

Gotcha – Don’t get rid of Misc Message Group!! All messages with invalid MsgGrp uses this MsgGrp

Step 5 - Track various states of Message Groups

Approved
Need to be baselined
Deleted

Step 6 – Decide how you are going to classify your users

Tier Level
Function
Authority

66 of 92

OVO Operator Assignment

Tier Profiles(Users)

Step 7 – Break down application profiles so that they match your classification of users

Create profiles for

All Users
Network Tier 1
Network Tier 2
………….

Step 8 – Now map your users and applications to each other

App Profiles

67 of 92

OVO Operator Assignment

Tier Profiles(Users)

Step 9 – Now do the same with your events. Develop classifications by:

Tier Levels
Responsibilities
…….

Step 10 – Map your Event Profiles to Users the same way you did with Applications

Event Profiles

68 of 92

OVO Operator Assignment

Step 11 – Now map your User Profiles to User Templates

This is just a One-to-one mapping

Step 12 – Make the actual assignments in the User Profile Bank

All items in Row One are created

Step 13 – Create the User Templates in the User Bank

Number at the beginning is important as the number simplifies use when many users exist

User Template Assignments

Tier Profiles(Users)

69 of 92

OVO Operator Assignment

Step 14 – Create the Users

A. Copy User Template

B. Change Data for User

Never Change the user again!!!!

What does all this buy us?

Simple way to see what is actually assigned
Simple way to walk through the effect of making a change.
Make changes at the lowest level

Ex. Want to assign new app to All Users – Only one template is changed

User Profiles

App and Event Profiles

70 of 92

OVO Operator Assignment

Remember - Number at the beginning of the User Template is important in order to put the User Templates at the Beginning of the GUI display

Much easier to find
Avoids mistakes

71 of 92

Agenda

Background

General Do’s and Don’ts

Product Specific Implementations

Configuration Management of EM Systems

Correlation Composer

72 of 92

Correlation Composer – Problem/Solution

Problem

Request to add fields to the OVO Java Console in an effort to assist end-users in identifying and classifying devices
Some of the data is accessible from the CMDB JSP interface, but users would like the data in the event
Some of the data requested is easily accessible directly from OVO

Some of the data is available via Node Group assignments

Some of the data needs to be pulled from CMDB

Solution

Use Composer to access data from ECS datastore and from external data source

Considerations

Performance overhead of datastores
No way to get multiple elements associated with a single key from a datastore using Composer and built-in functions
Overhead of using external perl function
Size of Java Console display

What impact will the additional fields have?

73 of 92

Correlation Composer – General Outline

Create Directory Structure
Create OVO Node Groups for Customer / Special categories
Add any necessary fields to CMDB
Build OVO and CMDB reports that will be used as input
Modify CO.conf for new Custom Message Attributes (CMA)
Create namespace file

Create Scripts

Create script to build factstore
Create script to build datastore

Create Correlators

Create Enhance Correlator to perform Lookup to datastore
Create Enhance Correlator to perform Lookup to external source through perl function
Create Enhance Correlator to adhere to precedence order
Modify Original Message Text to include original source

74 of 92

Create Directory Structure in line with ECS 3.3

/etc/opt/OV/share/conf/ecs/CIB/scripts – scripts to build datastores and factstores
/etc/opt/OV/share/conf/ecs/CIB/stores – repository for data and fact stores
/etc/opt/OV/share/conf/ecs/CIB/tmp – tmp directory to process input data to build stores
/etc/opt/OV/share/conf/ecs/CIB/drivers – drivers to test Correlator rules

Existing Directories

/etc/opt/OV/share/conf/ecs/CIB – Default repository for OVO and NNM fact and datastores. Also location for namespace file. In our case OVONameSpace.conf
/opt/OV/contrib/ecs/external/perl – location for perl function that is read in as well as flat file containing lookup data

Correlation Composer – Step 1

75 of 92

2. Determine input sources and data that you want to populate Custom Message Attributes (CMA) fields with

Create Node Groups representing the categories and services we want to use as CMA’s ( Cust A, Cust B, High Priority, DNS, AT, etc, etc) in the format of:

S_<customer> (Ex. S_CustA_Nt, S_CustB_Ux …….)

Build OVO report containing Nodename and all Node Groups that the node belongs to in the format of:

Nodename NodegroupA NodegroupB (Ex. Svr1 S_CustA S_CustB …)

Create additional field in CMDB that contains all services associated with a server

Location and Remedy Queue already existed

Generate CMDB report that has all OVO nodes and associated Locations, Remedy Queues, and OVO Services in the format of:

Nodename Location Remedy Queue Services
(Ex. Svr1 Bldg1 RemedyQueue1 File,Print )

Correlation Composer – Step 2

76 of 92

Correlation Composer– Step 3

3. Modify $OV_BIN/CO.conf to setup CMA’s

CMA_Location == Location of device
CMA_Net == Network device resides on
CMA_Cat == Special Category that the device falls into
CMA_Service == OVO defined services
CMA_Customer == Special Customer that the device supports
CMA_WatchList == Six Sigma High Priority servers
CMA_SWTicket == Remedy Queue

77 of 92

Correlation Composer– Step 4

4. Create $OV_CONF/ecs/CIB/OVONameSpace.conf

Defines names of all the individual factstores we are going to use

Intended use is not for developer mode, but this works very well for us since we have so many rules.

OVO_NNM.fs == NNM and Syslog

OVO_Security.fs = Firewall

OVO_OVIS.fs = OVIS

OVO_OVOU.fs == Solaris

OVO_OVOW.fs == Windows

OVO_Specific.fs = Node specific

OVO_OVOUAdm.fs == Mgmt Svr

OVO_Lookup.fs == Lookup rules

Lookup defined as accessing

datastore or external data source

All factstores stored in

$OV_CONF/ecs/CIB/stores

78 of 92

Correlation Composer– Step 5

5. Create script that builds ecs_comp.fs – $OV_CONF/ecs/CIB/scripts/fact_commit

Merge

Backup

Copy

Load

79 of 92

Correlation Composer– Step 6

Create script that builds ecs_comp.ds – data_commit

Retrieve and Parse CMDB report from Service Desk that contains Nodename, Location, Remedy Queue, and Services via perl httpget

Build $OV_CONTRIB/ecs/external/perl/teams.txt

Nodename Location Remedy Queue Services (Services in CSV format)

Retrieve and Parse OVO report that contains Nodename and all Node Group assignments

Build elements going into datastore based on Node Group name ( S_<customer> )

For each special customer

grep customer from OVO report
Build Add Fact line
Append onto new $OV_CONF/ecs/CIB/stores/ecs_comp.ds in the format of:

ADD DATA(“CMA_CustAList” , [“svr1.com” , “svr2.com” , ….. ])
ADD DATA(“CMA_CustBList” , [“svr3.com” , “svr4.com” , ….. ])

We have 7 separate categories ranging from 5 to 112 servers/network devices

80 of 92

Correlation Composer– Step 6 – cont

$OV_CONF/ecs/CIB/scripts/data_commit script

Create Special

Customer datastore

entries

Verify

Copy

Load

Backup

81 of 92

Correlation Composer– Step 7

82 of 92

Correlation Composer– Step 8

8. Lookup To Datastore

Ignore if Forwarded

Set Constant Key
Type “Lookup”
Set text string that goes in

CMA field

Do Lookup – Is the current Node

name in the CMA_Cust1 list?

From Step 6

ADD DATA(“CMA_Cust1List” , [“svr1.com” , “svr2.com” , ….. ])

83 of 92

Correlation Composer– Step 8 - Cont

Alter Specification

Alter the specification of

the alarm and put the

Constant Value

in the CMA_Customer

CMA field

_CMA_CustAList

Defined in Step 3

Evaluates to “Cust1”

84 of 92

Correlation Composer– Step 9

Step 9 - Access external data source via perl function:

Open file

Match

Set var for

Nodename
SWTicket
Location
Service

Return values

85 of 92

Correlation Composer – Step 9 - Cont

External Lookup

Set Key

Location == Index 4
Service == Index 5
SwTicket == Index 3

Debug Line that is put in Message Text when debugging

Get data in index #

Call Perl function

86 of 92

Correlation Composer– Step 9 - Cont

Alter Specification

87 of 92

Correlation Composer– Step 9 - Cont

Precedence

If Key matches AND it is not in the higher precedence category

Two keys in this case – one to match primary key and one to match secondary key

88 of 92

Correlation Composer – Step 10

Do – 1. Check to see if an event has already gone through a Correlator
Do – 2. Capture original Message Source and append to Original Message Field
Do – 3. Change the new Message ID back to the Original Message ID

89 of 92

Correlation Composer– Step 11

Sit back and have a beer

Node_A

Node_E

Node_F

Node_I

Node_G

Node_H

Node_B

Node_C

Node_D

Node_E

Node_F

Node_I

Node_G

Node_H

Node_C

Node_D

Node_I

Node_G

Node_H

Location

Remedy Queue

Special Customer

Special Category

Service