oswbb operates as a set of background processes on the server and gathers OS data on a regular basis, invoking such Unix utilities as vmstat, netstat and iostat. oswbb can be downloaded from this note.
NOTE: oswbb is available through MOS and can be downloaded as a tar file. The user then copies the file oswbb.tar to the directory where oswbb is to be installed and issues the following commands.
tar xvf oswbb.tar
traceroute -r -F node1
traceroute -r -F node2
ARG1 = snapshot interval in seconds.
ARG2 = the number of hours of archive data to store.
ARG3 = (optional) the name of a compress utility to compress each file automatically after it is created.
ARG4 = (optional) an alternate (non default) location to store the archive directory.
./startoswbb.sh 60 10 gzip
./startoswbb.sh 60 10 gzip /u02/tools/oswbb/archive
./startoswbb.sh 60 48 NONE /u02/tools/oswbb/archive
Sample iostat file produced by oswbb
|r/s||Shows the number of reads/second|
|w/s||Shows the number of writes/second|
|kr/s||Shows the number of kilobytes read/second|
|kw/s||Shows the number of kilobytes written/second|
|wait||Average number of transactions waiting for service (queue length)|
|actv||Average number of transactions actively being serviced|
|wsvc_t||Average service time in wait queue, in milliseconds|
|asvc_t||Average service time of active transactions, in milliseconds|
|%w||Percent of time there are transactions waiting for service|
|%b||Percent of time the disk is busy|
- Average service times greater than 20msec for long duration.
- High average wait times.
Sample mpstat file produced by oswbb
|xcal||Processor cross-calls (when one CPU wakes up another by interrupting it).|
|ithr||Interrupts as threads (except clock)|
|icsw||Involuntary context switches|
|migr||Thread migrations to another processor|
|smtx||Number of times a CPU failed to obtain a mutex|
|srw||Number of times a CPU failed to obtain a read/write lock on the first try|
|syscl||Number of system calls|
|usr||Percentage of CPU cycles spent on user processes|
|sys||Percentage of CPU cycles spent on system processes|
|wt||Percentage of CPU cycles spent waiting on event|
|idl||Percentage of unused CPU cycles or idle time when the CPU is basically doing nothing|
- Involuntary context switches (this is probably the more relevant statistic when examining performance issues.)
- Number of times a CPU failed to obtain a mutex. Values consistently greater than 200 per CPU causes system time to increase.
- xcal is very important, show processor migration
|-a||The command output will use the logical names of the interface. It will also report the name of the IP address found through normal IP address resolution methods.|
|-i||This triggers the Interface specific statistics, the columns of which are outlined in table [bla-KR]|
|-n||This causes the output to use IP addresses instead of the resolved names|
Sample netstat file produced by oswbb
|name||Device name of interface|
|Mtu||Maximum transmission unit|
|Net||Network Segment Address|
|address||Network address of the device|
|queue||Number in the Queue|
- RAWIP (raw IP) packets
- TCP packets
- IPv4 packets
- ICMPv4 packets
- IPv6 packets
- ICMPv6 packets
- UDP packets
- IGMP packet
- Collisions (Collis)
- Output packets (Opkts)
- Input errors (Ierrs)
- Input packets (Ipkts)
Network collision rate = Output collision / Output packets
Input Error Rate = Ierrs / Ipkts.
%segment-retrans=(tcpRetransSegs / tcpOutDataSegs) * 100
%byte-retrans = ( tcpRetransBytes / tcpOutDataBytes ) * 100
Sample file produced by oswbb
- Example 1: Interface is up and responding:
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 1492 byte packets
1 X.X.X.X 1.015 ms 0.766 ms 0.755 ms
- Example 2: Target interface is not on a directly connected network, so validate that the address is correct or the switch it is plugged in is on the same VLAN (or other issue):
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
traceroute: host X.X.X.X is not on a directly-attached network
- Example 3: Network is unreachable:
traceroute to X.X.X.X, (X.X.X.X) 30 hops max, 40 byte packets
Network is unreachable
Sample ps file produced by oswbb
|f||Flags s State of the process|
|uid||The effective user ID number of the process|
|pid||The process ID of the process|
|ppid||The process ID of the parent process.|
|d||Processor utilization for scheduling (obsolete).|
|pri||The priority of the process.|
|ni||Nice value, used in priority computation.|
|addr||The memory address of the process.|
|sz||The total size of the process in virtual memory, including all mapped files and devices, in pages.|
|wchan||The address of an event for which the process is sleeping (if blank, the process is running).|
|stime||The starting time of the process, given in hours, minutes, and seconds.|
|tty||The controlling terminal for the process (the message ?, is printed when there is no controlling terminal).|
|time||The cumulative execution time for the process.|
|cmd||The command name process is executing.|
- The information in the ps command will primarily be used as supporting information for RAC diagnostics. If for example, the status of a process prior to a system crash may be important for root cause analysis. The amount of memory a process is consuming is another example of how this data can be used.
- provide an accurate snapshot of the system and process state,
- not be one of the top processes itself,
- be as portable as possible.
Sample top file produced by oswbb
load averages: 0.11, 0.07, 0.06 12:50:36
136 processes: 133 sleeping, 2 running, 1 on cpu
Memory: 2048M real, 1061M free, 542M swap in use, 1605M swap free
Individual process fields
|PID||Process ID of process|
|USERNAME||Username of process|
|THR||Process thread PRI Priority of process|
|NICE||Nice value of process|
|SIZE||Total size of a process, including code and data, plus the stack space in kilobytes|
|RES||Amount of physical memory used by the process|
|STATE||Current CPU state of process. The states can be S for sleeping, D for uninterrupted, R for running, T for stopped/traced, and Z for zombied|
|TIME||The CPU time that a process has used since it started|
|%CPU||The CPU time that a process has used since the last update|
|COMMAND||The task's command name|
- Large run queue. Large number of processes waiting in the run queue may be an indication that your system does not have sufficient CPU capacity.
- Process consuming lots of CPU. A process which is "hogging" CPU is always suspect. If this process is an oracle foreground process it's most likely running an expensive query that should be tuned. Oracle background process should not hog CPU for long periods of time.
- High load averages. Processes should not be backed up on the run queue for extended periods of time.
- Low swap space. This is an indication you are running low on memory.
Sample vmstat file produced by oswbb
|r||Number of processes that are in a wait state and basically not doing anything but waiting to run|
|b||Number of processes that were in sleep mode and were interrupted since the last update|
|w||Number of processes that have been swapped out by mm and vm subsystems and have yet to run|
|swap||The amount of swap space currently available free The size of the free list|
|pi||kilobytes paged in|
|po||kilobytes paged out|
|de||anticipated short-term memory shortfall (Kbytes)|
|sr||pages scanned by clock algorithm|
|Bi||Disk blocks sent to disk devices in blocks per second|
|In||Interrupts per second, including the CPU clocks|
|Cs||Context switches per second within the kernel|
|Us||Percentage of CPU cycles spent on user processes|
|Sy||Percentage of CPU cycles spent on system processes|
|Id||Percentage of unused CPU cycles or idle time when the CPU is basically doing nothing|
- Large run queue. Adrian Cockcroft defines anything over 4 processes per CPU on the run queue as the threshold for CPU saturation. This is certainly a problem if this last for any long period of time.
- CPU utilization. The amount of time spent running system code should not exceed 30% especially if idle time is close to 0%.
- A combination of large run queue with no idle CPU is an indication the system has insufficient CPU capacity.
- Memory bottlenecks are determined by the scan rate (sr) . The scan rate is the pages scanned by the clock algorithm per second. If the scan rate (sr) is continuously over 200 pages per second then there is a memory shortage.
- Disk problems may be identified if the number of processes blocked exceeds the number of processes on run queue.