PeopleSoft

Oracle 11g Database Fault Diagnostic Infrastructure

September 28, 2008 · Leave a Comment

ADR – Automatic Diagnostic Repository (ADR)

Oracle 11g introduces a new directory structure for organizing the flat file administrative objects such as the alert log, trace files, dumps, audit logs, etc.

The Structure looks like the following screenshot:

Picture 1.png

Oracle has introduced not only a new directory template for organizing the various Oracle Administrative files – alert log, audit logs, trace files, core dumps, etc. but a entirely new methodology for Managing Diagnostic Data. It is a comprehensive infrastructure for collecting and managing diagnostic data. Diagnostic data in Oracle’s terms includes the trace files, dumps and core files that exist in previous releases along with new types of diagnostic data that enables Oracle users as well as Oracle’s Support Group to identify, investigate, trace and resolve problems.

Now when a critical error occurs, it is automatically assigned an incident number and diagnostic data for that error (such as its associated trace files) are captured and and tagged with this incident number. The data is then stored in the Automatic Diagnostic Repository (ADR) – a flat file based repository outside the database that looks like the above picture. This data can later be retrieved by referencing the incident number and analyzed.

The motivating factors behind this from Oracle’s perspective are (taken from Oracle’s Database Administrator’s Guide 11g Release 1 11.1):

* First-failure diagnosis
* Problem prevention
* Limiting damage and interruptions after a problem is detected
* Reducing problem diagnostic time
* Reducing problem resolution time
* Simplifying customer interaction with Oracle Support

In order to accomplish these goals Oracle focused on automating the capture of diagnostic data and storing it outside of the database in the new ADR structure, standardizing the trace formats, health checks – the DBA can invoke these manually as well as automating these health checks, Data Recovery Advisor – integrates with the database health checks and RMAN to display data corruption problems, assess the extent of each problem and classify the problem (critical, high priority, low priority), description of the impact, recommendations, an automating the repair process, SQL Test Case Builder – for many SQL-related problems, obtaining a reproducible test case is key to resolving or diagnosing the issue accurately – this tool automates the sometimes difficult and time-consuming process of gathering as much information as possible and the environment in which it occurred – you can upload this information to Oracle Support to enable their support personnel to easily and accurately reproduce the problem – that is the concept according to Oracle and the introduction of Incident Packaging Service (IPS) and Incident Packages – is the key to Oracle’s automation of the whole diagnostic methodology and infrastructure. The IPS enables the gathering of diagnostic data – traces, dumps, health check reports, … – pertaining to a critical error and package this information into a zip file for transmission to Oracle Support.

Since the key to managing all of this information is the Incident Number this number is tagged to all of the related files which enables searching through all of this information and selecting all of the files with a specific Incident Number for addition to a zip file for transmission to Oracle Support. The Incident Packaging Service identifies the required files automatically during the first phase of the collection process – IPS collects these files and stores them in an intermediate logical structure call and incident package (package). Packages are stored in the Automatic Diagnostic Repository (ADR). If you choose to, you can access this intermediate logical structure, view and modify its contents, add or remove additional diagnostic data at any time and when you are ready you can create or recreate the zip file and transmit the file to Oracle Support.

Incidents and Problems

The new Infrastructure introduces two concepts for Oracle Database: problems and incidents.

A problem is a critical error in the database. Critical errors manifest as internal errors, such as ORA-00600, or other severe errors, such as ORA-07445 (operating system exception) or ORA-04031 (out of memory in the shared pool). Problems are tracked in the ADR. Each problem has a *problem key*, which is a text string that describes the problem. It includes an error code (such as ORA 600) and in some cases, one or more error parameters.

An Incident is a single occurrence of a problem. When a problem (critical error) occurs multiple times, an incident is created for each occurrence. Incidents are time stamped and tracked in the Automatic Diagnostic Repository (ADR). Each incident is identified by a numeric incident ID, which is unique within the ADR. When an incident occurs, the database:

* Makes and entry in the alert log
* Sends an incident alert to Oracle Enterprise Manager (Enterprise Manager)
* Gathers first-failure diagnostic data about the incident in the form of dump files (incident dumps)
* Tags the incident dumps with the Incident ID
* Stores the incident dumps in and ADR subdirectory created for that incident

Diagnosis and resolution of a critical error usually begins with an initial incident alert and is displayed on the Enterprise Manager Database Home page. You can then drill down and pull up its associated details.

Incident Flood Control

Given the fact that Oracle is being allowed to generate diagnostic data you could be wondering what is to prevent runaway incident generation that could result in consuming too much space and possibly bring your Oracle instances to a halt because the filesystem where the ADR is located at becomes full and the OS can no longer write to the alert logs. To address this potential hazard Oracle applies what it terms flood-control to incident generation after certain thresholds are reached. A flood-controlled incident is an incident that generates and alert log entry, that is recorded in the ADR, but it does not generate incident dumps. Flood-controlled incidents provide a method for informing you that a critical erros is occurring repeatedly but Oracle essentially prevents itself from adding to the problem by controlling the amount of data generated. You can choose to view or hide flood-controlled incidents when viewing incidents with Enterprise Manager or the ADR utility ADRCI.

Basic Threshold Levels for flood-control are predetermined and cannot be changed. The are defined as follows:

After five incidents occur for the same problem in one hour, any subsequent incidents for the same exact problem key are flood-controlled. Normal (non-flood-controlled) recording of incidents for that problem key begin again after an hour has expired.

After twenty-five (25) incidents occur for the same problem in one day, subsequent incidents for this problem key are flood-controlled. Normal recording of incidents for that problem key will resume after the 24 hour window has expired.

In addition, after fifty (50) incidents for the same problem key occur in one hour, two hundred fifty (250) incidents for the same problem key occur in one day, subsequent incidents for this problem key are not recorded at all in the ADR. In these cases, the database writes a message to the alert log indicating that no further incidents will be recorded. As long as incidents continue to be generated for this problem key, this message is added to the alert log every ten minues until the hour or the day expires. Upon expiration of the hour or day, normal recording of incidents for that problem key begin again.

Categories: Oracle Managment
Tagged: ,