Rules-of-thumb for monitoring Sun Solaris statistics

Rules-of-thumb for monitoring Sun Solaris statistics

by
Harry
Murray

Level: Intermediate
Works with: Domino
Updated: 01-Jul-2002

Ensuring that your Domino server is running at peak performance is a daunting task. You need to monitor both the Domino statistics and the operating system (OS) statistics, and you need to know how to interpret those statistics to make sure that no system resources are restricting performance. You may also need to try to tune for the workload running on the server. The art of tuning involves making sure that OS settings and Domino settings provide the best performance.

The column this month focuses on monitoring the OS statistics for Sun Solaris. It helps you with the interpretation of those statistics and also suggests some possible OS tuning changes that may help performance. This is the second part of a series on performance monitoring and performance tuning tips. The first part of the series covered the Windows operating system, and in the future, we'll cover the AIX operating system (IBM UNIX).

We have attempted to extract the most important OS statistics and resource usage bottleneck thresholds from the written information available and from Domino performance engineers. We also work very closely with Sun engineers and appreciate their many ideas, which we've incorporated here. We'll tell you only what you need to know to make sure your system has no resource bottlenecks and what standard tools you can use to accomplish this (for at least the majority of situations). We'll also include some OS tuning possibilities as options to pursue as well as a list of references in case you want to dig deeper.

For this article, we assume that you have a basic knowledge of the Sun Solaris OS. The performance monitoring tools that we discuss are the ones shipped with Solaris. We also discuss the Domino Platform statistics, which offer an alternate way to monitor OS statistics.

We are attempting to put a stake in the ground based on what has worked for us. We hope this stimulates discussion that will help to further refine these ideas. Please let us know what has worked well for you so that we can modify our recommendations if necessary. These recommendations might also need modification as the OS and Domino change and as we do more performance analysis. We're hoping that the result will be a simple set of rules-of-thumb that can serve as a guide to ensuring optimal performance from your servers.

Guidelines for monitoring system performance and tuning systems
Before focusing on Sun Solaris, let's review some of the fundamental ideas about monitoring system performance and tuning systems that we discussed in the first article:

Before you concern yourself with bottlenecks, you need to set up your system optimally. Bottlenecks may disappear after properly setting up your system. For example, if the pagefile is too small or on a slow disk, the CPU may be high trying to compensate for this problem.
You most certainly want to extract the best performance from your servers. This will ensure the best response time for your users and minimize the number of servers needed.
It’s important to collect performance data to establish a baseline before tuning or making either software or hardware changes to the server. This will help determine what effect the change has on performance. It’s also best, if possible, to make only one change at a time and then to monitor its effect on performance. The baseline is also needed to compare future trends in resource usage.
You should collect performance data on a daily, weekly, and monthly basis for trend analysis. This allows you to pro-actively solve problems before they result in degraded response times for users. If data is being collected when a problem occurs, it will make it easier to determine the cause of the problem. You should maintain the historical data for capacity planning.
When collecting data, collect only the minimum amount of data needed to determine if there is a problem and to determine trends. We'll list the minimum statistics to collect to be able to detect if a system is in trouble or headed for trouble. Initially, you can collect data at intervals of every five minutes or longer during peak time. If a problem surfaces, you then may need to collect more detailed performance data.
Remember that the resources are all interrelated. A bottleneck in one will affect the others. For example, if you don’t have enough memory, you will see higher CPU because the CPU needs to work harder page swapping.
Also remember that you should not need to change OS tuning parameters unless you have a specific problem. You might adversely affect performance if you are not careful. It’s logical to expect that, if there were a tuning setting that would “magically” improve performance, then the OS vender would have shipped it that way. If you do make changes, make them one at a time and check the performance impact prior to making another change.
The important categories of statistics to monitor are CPU, disk I/O, memory, network, and Domino statistics. Domino statistics will not be covered in this article with the exception of the Domino Platform stats.

UNIX performance monitoring tools
Next, let's take a look at the tools and statistics for Solaris and Domino Platform stats. Some of the standard UNIX tools shipped with UNIX that are used to monitor the performance of the system resources are:

vmstat, which is one of the simplest and most useful tools because it reports important data in the categories of CPU, memory utilization , and disk-I/O
iostat, which is also important because you can use it in conjunction with vmstat to determine if there is a disk bottleneck
netstat, which is used to monitor critical network activity
sar, or system activity reporter, which you can use to record a large set of system statistics that includes time stamping of the statistics. (This tool is most useful for capacity planning and trend analysis. It does not collect network stats.)

You can run these four tools from the OS prompt. For example, to use vmstat:

$ vmstat 300

In this case, data is averaged and output over 5 minute intervals (300 seconds) for an indefinite period of time.

To capture the output to a file, use:

$ vmstat 300 > vmstat.out

To send both to the screen and to an output file, use:

$ vmstat 300 tee vmstat.out

(Unlike with Intel, there is minimal additional overhead incurred when sending output to the screen on the latest UltraSPARC-III.)

You can use the UNIX man pages to get help on the various switches and definitions of terms. For example:

$ man vmstat

It’s relatively easy to create a UNIX shell script that automatically runs a few of the utilities at the desired interval. Here's a sample script; note that the & sets the tool to run in the background. (Note that you could run each of these commands manually at the OS prompt.)

#!/bin/csh
vmstat 300 >vmstat.out &
iostat -xtc 300 >iostat.out &
netstat -i 300 >netstat.out &

The following sections provides more details about these tools and the statistics that are important to monitor. The acceptable range for the important statistics are discussed later in the Resource bottleneck threshold rules-of-thumb section.

Details about vmstat
vmstat is one of the simplest and most useful tools because it reports important data in the categories of CPU, memory utilization, and disk-I/O. To see the system activity for 3 seconds with a 1 second reporting interval use:

vmstat 1 3

The following is an example of the results of doing a vmstat 1 3. The bolded columns (r, b, sr, us, and sy) are most important. (Note that the first output line of the data is really the accumulated statistics since the system startup. That line is not included in the examples in this article.)

Sample vmstat results

In the process (procs) group of statistics, there are two important stats, r and b:

r is the number of processes in the CPU run queue.
b is the number of processes blocked for resources I/O, paging, and so forth.

In the memory group of statistics, the important stat is sr:

sr is the number of pages scanned and can be an indicator of a RAM shortage.

The cpu group of statistics gives a breakdown of the percentage usage of CPU time. On MP systems, this is an average across all processors.

us is the percentage of user CPU time.
sy is the percentage of system CPU time.

Details about iostat
To see the system activity over 300 seconds with a 1 second reporting interval use:

iostat 300 2

You can add the switch -x to provide extended statistics, which makes the output more readable because each disk has its own line. You can also add the -c switch to report the percentage of time the system has spent in user mode, in system mode, waiting for I/O, and idling. The following is an example of the results of doing a iostat with these switches, specifically:

iostat -cx 300 1

The bolded columns (svc_t, %b, us, sy, and wt) are most important.

results of iostat -cx 300 1

In the extended device group of statistics, there are two important stats, svc_t and % b:

svc_t is the average service time, in milliseconds, of the disk.
%b is the percent of time the disk is busy (transactions in progress).

In the cpu group of statistics, there are three important statistics, us, sy, and wt:

us and sy are the percent CPU for user and system respectively.
wt is the wait on I/O times for the cpu.

Details about netstat
The netstat command can show you how healthy your network is. Using the -i switch summarizes all the network interfaces. Another switch that can be useful is the -s switch, which lists all the protocols. The following is an example of the results of doing a netstat -i:

Sample results of netstat -i

The column colls indicates the number of collisions of the network packets. If there are no collisions, then the network is probably not experiencing a performance problem. (Note that you will not see collisions on servers on switched networks.)

Because netstat's viewpoint is that of just one node on the network, it’s usefulness is limited in assessing over all Domino server performance. A single 100 Mb/s pipe can
deliver requests and absorb responses much faster than all but the very largest Domino servers can handle. Network issues are best viewed from the perspective of the entire network using network sniffer hardware or software.

Domino Platform stats
Domino Platform stats let you capture OS statistics such as CPU utilization from within Domino. As a Domino systems administrator, you may find it easier to collect these statistics using Domino rather than the OS tools.

The Platform stats are integrated into Domino Monitoring (Events4.nsf) and Reports (Statrep.nsf). You can set thresholds and alarms for server resource usages within Domino. Platform stats are available via the server console, the new Domino 6 Administration client, and the Web client. Both Domino statistics and OS statistics can be collected together, both in real-time and for historical trend analysis. Platform stats are grouped into five categories: logical disk, memory, network, CPU, and miscellaneous system stats. For some stats, average, minimum, and peak values are calculated.

Platform stats were introduced for Solaris in Domino 5.0.2. Domino R5 supports Solaris 2.6 through 8. Domino 6 supports Solaris 8.

Note that in R5, you need to include the following NOTES.INI setting to activate Platform stats:

Platform_Statistics_Enabled=1

When you issue a "show stat platform" command in Domino 6 on Solaris, over 150 statistics are generated. (The naming convention for the statistics that are reported is the same across platforms to make it easier for you to support multiple server platforms.) Some of the particularly interesting statistics include:

Platform.LogicalDisk.10.PctUtil.Avg = 0
Platform.LogicalDisk.10.ServiceTimeinmsecs.Avg = 0
Platform.Memory.ScanRatePagesPerSec.Avg = 0
Platform.Network.3.PctCollisionRate = 0
Platform.Network.3.PctUtilBandwidth = Not Available
Platform.System.PctCombinedCpuUtil.Avg = 20.7

See the Domino 6 Platform stats sidebar for a complete listing of the reported statistics.

When you issue a "show stat platform" command in Domino R5 on Solaris, fewer statistics are reported. Some of the particularly interesting reported statistics include:

Platform.LogicalDisk._Total.1._Total.1.PctTime.avg = .1
Platform.LogicalDisk._Total.1._Total.1.ServiceTime.avg = 49.8
Platform.Memory.PagesPerSec.avg = 86.7
Platform.System.TotalUtil.avg = 99

See the Domino R5 Platform stats sidebar for a complete listing of the reported statistics.

Resource bottleneck threshold rules-of-thumb
This section attempts to establish rules-of-thumb to determine if important system resources are on the verge of being a bottleneck and limiting server performance.

Disk bottlenecks
Disk bottlenecks are the most likely bottlenecks. Here are the thresholds you should look for using the different monitoring tools.

Using iostat:

The significant bottleneck threshold is %b (percent time disk busy) > 20% AND (20 ms < svc_t (ServiceTime) < 30 ms)
The critical bottleneck threshold is %b (percent time disk busy) > 20% AND ( svc_t (ServiceTime) > 30 ms)

Using vmstat:

A significant bottleneck threshold occurs if b (processes blocked for resources) approaches r (# in run queue)
A critical bottleneck threshold occurs if b (processes blocked for resources) = or > r (# in run queue)

The relevant Domino R5 Platform stats are:

Platform.LogicalDisk._Total.1._Total.1.PctTime.avg
Platform.LogicalDisk._Total.1._Total.1.ServiceTime.avg

The relevant Domino 6 Platform stats are:

Platform.LogicalDisk.#.PctUtil.avg
Platform.LogicalDisk.#.ServiceTimeinmsec.avg

Memory bottlenecks
Here are the memory bottleneck thresholds you should look for using the different monitoring tools.

According to the book, Sun performance and Tuning, Java and the Internet, by Adrian Cockcroft and Richard Pettit, using vmstat:

A significant bottleneck threshold occurs if sr (free page scan rate) > 200 scans/second
A critical bottleneck threshold occurs if sr (free page scan rate) > 300 scans/second

(Note that the last point could indicate thrashing because active and inactive pages will be stolen from the Process Working Set.)

You can also issue the command vmstat -s to show paging. The si and so columns are pages swapped in and out.

The relevant Domino R5 Platform stat is:

Platform.Memory.PagesPerSec.avg

The relevant Domino 6 Platform stat is:

Platform.Memory.ScanRatePagesPerSec.avg

Note however, that for Solaris 2.6 and 7 systems running Domino, we see very high scan rates, upwards of 4000 pages per second at the high end. Despite Cockcroft's guideline we think such rates are perfectly normal for an application that uses the file system heavily. Unfortunately, the high scanning rates caused by file system activity will completely mask any other scanning sources.

The file system in Solaris 8 generates little or no page scanning. Any scanning that does occur must arise from somewhere else. A few bursts here and there are no cause for alarm, but sustained scanning probably means something's not right. Sustained scanning at low rates may mean that NSF_Buffer_Pool_Size should be reduced a little. Try reducing it in 50 MB steps. High rates may mean that more RAM is needed or that the system is very badly mistuned.

A low (relatively speaking) scan rate does not prove that you've got the optimal amount of RAM. Domino scales its own memory use to the RAM size and may avoid scanning and paging, but it might still benefit from larger caches and so on if more RAM were made available.

CPU bottlenecks
Here are the CPU bottleneck thresholds you should look for using the different monitoring tools.

Using vmstat:

A significant bottleneck occurs if process r (run queue) is > 2 times the # of CPUs, and/or %CPU >75%
A critical bottleneck occurs if “sy” (% CPU system time) is greater than .25 “us” (user/application % CPU time) and/or %CPU >85%

The relevant Domino R5 Platform stat is:

Platform.System.TotalUtil.avg

The relevant Domino 6 Platform stats are:

Platform.System.CPUQueueLen.avg
Platform.System.PctCombinedCpuUtil.avg (total CPU utilization average)

When you run “uptime” at the OS prompt, it gives the load average, which is the sum of the run queue length and number of jobs currently on the CPUs. It gives an average over 1 minute, 5 minute, and 15 minute periods. This can be used as an average run queue length to see if you need more CPU power.

Network bottlenecks
Here are the network bottleneck thresholds you should look for using the different monitoring tools.

Using netstat, when collisions are greater than 5 percent of the packets sent, you are starting to experience network saturation. That is colls / packets >5%.

Domino R5 Platform stats do not have a network stat available.

The relevant Domino 6 Platform stats are:

Platform.Network.#.PctCollisionRate
Platform.Network.#.PctUtilBandwidth

Note that this last statistic is an example of a stat that is calculated from other stats.

Remember that you will not have collisions if you have a switched network. In that case, you should determine if the byte rate is approaching saturation. If you are running Domino 6, you can use the Platform stat Platform.Network.#.PctUtilBandwidth. A value over 30 percent may indicate that the network is near saturation. If you are running R5, then you can compute the percent network utilization or obtain it from a network sniffer. The command netstat -i discussed previously, will give you output bytes and input bytes so that you can compute the percent utilization.

A few tuning tips
It’s worth repeating that usually the best performance is achieved using the default OS settings. However, after collecting good performance baseline data, you can try making changes to the /etc/system file one at a time to see if they improve performance. Note that you will need to reboot the machine after changing a parameter. You can find a discussion of most of the recommended settings in the white paper, “Domino 5 on Solaris: Common Tuning Tips.”

The following recommendations apply to Domino servers handling large amounts of HTTP traffic, for example, WebMail or iNotes Web Access. They'll have little or no effect on systems devoted primarily to NRPC. These tuning recommendations can be found in the white paper, “Capacity Planning of Lotus Domino Server on Sun Enterprise Servers for iNotes Web Access.”

Listen backlog
Increase the size of the listen backlog. Backlog specifies the maximum size the queue of pending connections can grow to. By using a bigger backlog, the Domino server can deal with bursts of traffic more gracefully by queuing requests. The size of the listen backlog can be increased as follows:

Increase the size of the TCP listen queues by executing the following commands as root:

/usr/bin/ndd -set/dev/tcp tcp_conn_req_max_q 4096
/usr/bin/ndd -set/dev/tcp tcp_conn_req_max_q0 4096
Add the following line to the httpd.cnf file in the Notes data directory before bringing up the Domino server:

listenbacklog 4096

Segmap percent
Segmap percent determines the amount of kernel virtual address space reserved for file I/O operations. Increasing the segmap percent results in a reduction of virtual-to-physical memory mappings/unmappings and crosscalls. Note that, although Domino is a 32-bit application, it can be hosted on 64-bit Solaris and can take advantage of the 64-bit kernel virtual address space for caching the file system data.

Segmap percent can be increased by adding the following line to the /etc/system file and then rebooting the system:

set segmap_percent=32

Increase the buffering in the streams device driver to deal with the bursts in network traffic more gracefully. To increase it, add the following line to the /etc/system file and then reboot:

set sq_max_size=512

File descriptor limit
Domino requires a large number of file descriptors to manage databases and user connections, and the default file descriptor limit is 1024. The file descriptor limit must be increased to support a large number of HTTP requests to the Domino server. Sun recommends changing this parameter to 8192. To set this parameter, you must append the following line to the /etc/system file to increase the number of file descriptors for the Domino server:

set rlim_fd_max=8192

After making any change to the /etc/system file, reboot Solaris so that the new settings take effect.

Note that this recommendation is only valid for stateless protocols such as HTTP. For other protocols, such as NRPC, Sun recommends that you append the following line to the file /etc/system to increase the number of file descriptors for the Domino server:

set rlim_fd_max=65536

After making this or any change in the /etc/system file, reboot Solaris to have the new settings take effect.

In addition, if you upgrade to a new version of Solaris, any line added to /etc/system should be removed and added again only after verifying that it is still valid.

Also, if you run older application software and Domino on the same Solaris server, make sure that this setting does not affect the older applications. Such programs may have problems managing a large number of files and may need to be run on a different server. Consider consulting your support analyst regarding the system commands that are needed to document file descriptor limit modifications to accommodate older applications.

Conclusion
The rules-of-thumb for performance tuning are constantly changing as the OS, applications, and our knowledge changes. This article presents a snapshot of what the Domino Performance team thinks is important at this point. Hopefully, it has given you some new and helpful ideas. As always, we appreciate hearing from you, especially if you have drawn conclusions different from ours based on your experience. Look for rules of thumb for other operating systems in future Performance Perspective columns.

Sun Solaris references

Sun Performance and Tuning, Java and the Internet, 2nd edition, by Adrian Cockcroft & Richard Pettit, Prentice Hall, 1998. ISBN:O130952494
Solaris Internals, Core Kernel Architecture, by Jim Mauro and Richard McDougal, Prentice Hall, 2000. ISBN: 0130224960
IBM Redbook: Lotus Domino for R5 Sun Solaris (SG24-5969-1)
The Sun and Lotus Software page of the Sun Web site
The Domino for Sun Solaris page of Lotus.com
The Lotus Performance Zone
The NotesBench Consortium Web site
Information on the priority paging patch on the Sun Web site
Sun patches and patch clusters at SunSolve Online on the Sun Web site
Sun product documentation and help on the Sun Web site

ABOUT THE AUTHOR
Harry Murray has worked on the Domino Performance team for four years. Prior to that, he worked on the Compaq Computer Application Systems Engineering Performance team.