IS YOUR SYSTEM EXPERIENCING HIGH CPU use, memory faulting, poor response time, high disk activity, long-running batch jobs, poorly performing SQL or client/server or Web requests? Are you about to spend thousands of dollars on more memory, faster disk drives, or other upgrades? Before you do, make sure you know why you're having performance issues. A couple of days gathering data from your system and analyzing it from the highest level down to individual lines of code is time well spent - and will likely yield a result you may not have considered. New hardware may have little, if any, effect on system responsiveness, because the root cause of 90 percent of all performance issues is excessive I/O in one or more applications on your system.
This white paper will focus on part of the process of identifying and solving the root causes of these application performance issues. The sample data was collected using a toolset for the iSeries from MB Software called the Workload Performance Series®. It collects data from a system and presents it in a way that lets you analyze your environment from various perspectives and different levels of detail. The data is gleaned from a real analysis of an actual system, so it accurately reflects what you might expect to find in your own environment. But regardless of whether you use the Workload Performance Series or another data-gathering solution, what's important is how you use the data to address the root causes of performance issues within your environment.
First, look at a high-level view of what's going on in your system. View CPU use by subsystem. This particular system had significant response time issues, and when we look at the data over a 24-hour period, we find that QBATCH consumed 55 percent of the CPU on this system. MIMIXSBS consumed another 15 percent, and QINTER took 11 percent. Typically, we find that a one-to-one relationship exists between CPU use and I/O performed, and the subsystem that consumes the most CPU also performs the most I/O. We explore this relationship in more detail later on; but if we can reduce these 8.5 billion I/Os, we're going to improve the duration of jobs, consume less CPU, experience better response time, less memory use, less memory faulting, and less disk activity.
Physical I/O is one of the slowest things on a box, relative to accessing data that's already in memory or being used internally in your programs. We want to block data as efficiently as possible, so whenever disk arms are physically going to disk to retrieve data we want to make sure its being done with the fastest disk drives, the most disk arms, and the newest caching algorithms. But we also want to make sure that we're only performing I/O when absolutely necessary to accomplish a task. Looking at the same data in another way, confirming the one-to-one relationship between CPU and I/O. QBATCH, which was responsible for 55 percent of the CPU consumption, is responsible for 52 percent of the I/O.
On the other hand, sorting the subsystem data by number of jobs shows that more jobs run in the QPGMBATCH subsystem than in QBATCH. Initiating and terminating jobs creates a tremendous amount of overhead, even if the job only takes one CPU second each time it's run. It starts a new job number, opens 15 files, loads 12 programs into memory just to do a tiny bit of work; then it has to remove all those programs from memory, close all the files, terminate the job, and perhaps even generate a job log each time.
In QPGMBATCH, that happened 288,793 times in a week - that could have been one job sitting in the background, waiting on a data queue. Those files could have been left open, and the programs left in memory; then QPGMBATCH would only be responsible for one percent of the jobs on the system, not 23 percent. In turn, you may have that much less I/O, that much less CPU.
If we look at the same data by job name, we find more areas to drill down into for root cause. (Figure 4) shows the top 10 CPU-consuming jobs on the system. Thousands of jobs run on this system all day long, but on job (PCC7C8619) was responsible for 22 percent of all CPU consumed on this system - that's a great opportunity. If you optimize that process to consume only 1 percent of the CPU, the whole CPU usage curve that rises at 8 a.m. and drops at 5 p.m. in a typical bell curve would drop by 21 percent. That's a dramatic improvement not only in CPU and system capacity, but also in I/O - this job performs almost two billion physical I/Os (remember, I/O typically is the largest resource-consuming function on the system). The second-largest CPU hog (CL1072CL) takes up 14 percent, which isn't much better.
The same job data, sorted by physical I/O (Figure 5), shows that PCC7C8619 also performs the most I/O, clearly causing the CPU issue. However, CL1072CL isn't here; maybe that job doesn't have an I/O issue, but another problem, such as excessive initiation and termination. From the previous Figures, we know that it impacts the system negatively, but we'll have to dig deeper to find the reason.
Sorting the data by number of jobs (Figure 6) shows that the jobs running under the name ROBOTCNL comprise 38 percent of all jobs on the system. The QZRCSRVS jobs are remote procedure calls coming from other systems; the chart shows that 60,126 times during the test period, another platform did a remote procedure call and triggered a call to a native program - that's something that could have been done more efficiently. It's simpler to just use that built-in Client Access capability for doing these remote procedure calls, but when you discover what kind of impact it has on the system from a performance standpoint, you might consider doing things more efficiently. For example, you could implement a TCP/IP socket client and server capability within that application, or use data queues instead of remote procedure calls - either could have consumed a lot less resource.
QZDASOINIT is ODBC requests responsible for 10 percent of the jobs on the system. All day long, these job start connections, open SQL requests, do a little bit of work and close the requests. Over and over again, these requests start and stop jobs, open and close files, and load and unload programs into and out of memory.
Typically, we believe that the end user using the WRKQRY command, SQL, or some type of client/server application like MS Access is the performance issue on our box. But when we look at utilization by user we may find that what's being done in operations is actually the culprit. Regardless of what application you're running, or what tools you're using to schedule jobs or replicate data for high availability, you must address the underlying root causes of issues within your environment.
(Figure 7) immediately shows that two user IDs (ROBOT and MIMIXOWN) are responsible for a significant amount of the CPU resource consumed on this box. The third, TRANSFER, is probably transferring data between platforms. BATCH and EDIOPR round out the top five resource consumers. But the hundreds of other users on the system represent a very small piece of what's actually being utilized from a resource standpoint. (Figure 8) shows once again that the top CPU consumer is the top I/O performer, and hammers home the old message: focus on the I/O and you will address many of the performance issues that you experience on your systems.
(Figure 9) shows the data sorted by number of jobs, and immediately highlights an area for improvement. Note that the EDI process - EDIOPR - accounts for 57 percent of the jobs on the system. This job might be running every three seconds to check for data to send or receive through EDI. If you have a job that's running every three seconds, checking, checking, and checking again all day long, but never actually doing work, you're wasting a lot of resource. Using triggers, data queues, or other better-performing techniques for detecting transactions at the proper status could dramatically reduce the amount of resource consumed by this process.
Continue drilling down into the data; which files are most often accessed, and which could be archived or even purged? The key thing to remember is that you only want to process the data that will be selected - don't read a million records if you're only going to select 1,000. If you have seven years' worth of history in a general ledger transaction history file, and your accounting users only use it to look for recent transactions to close the books at month's end, you need to streamline that file. Move the old data from the current history file into an archived history file. Don't delete it - it's hard to get users to agree to delete anything. Instead, create a file that has only the last month's data in it for daily use, and move the previous six years' data to an archived file. If you users want current information, they can access one file, and if they want old history (which doesn't happen nearly as often) they can access the other.
(Figure 10) shows that reorganizing database files could help reduce I/O of the applications that access them. One main production library (MHSFLP) is responsible for 43 percent of the storage on the system, using almost a terabyte of disk space and comprising nearly 2 billion records. How much of that library could be archived?
The same data, when sorted by deleted records, shows that library IHA440FP has 700 million active records and 72 million deleted ones (Figure 11). Interestingly, it's not even the main production data library - in fact, it's only a fraction of the size - but it accounts for 70 percent of the deleted records on the system. Performing some basic system management tasks, such as database file reorgs, could dramatically improve performance of all processes that access these files.
(Figure 12) goes even deeper into the data, to the user level. It shows that user QSECOFR is responsible for creating database files that account for 46 percent of the storage, TESTBENCH accounts for 18 percent, and ROBOT accounts for 13 percent. But SMARINO, an end user, is consuming 300 GB of disk; is this end user going wild with queries? You should keep a close eye on this data.
Looking at file size by database name (Figure 13) yields a few surprises as well. The CLMFILESAV file immediately jumps out; this is a save file that's apparently never been used, but it's 112 GB - 13 percent of the disk on the system. OBNMAS accounts for 100 GB, but has only been used 16 times in its history; is this another 100 GB of wasted space?
The files on (Figure 14) offer great opportunity for improvement through simple file reorgs. IER200P1 has 29 million deleted records and only 9 million active ones. IIT001W03 is an even better opportunity, simply because it will take less time to clean up - it only has 88,000 active records.
Looking at queries provides an even more detailed level of analysis. This example (Figure 15) shows the system during normal business hours. QINTER had 13 billion skipped records; put another way, 13 billion records were read and not selected, and therefore didn't need to be read. If you read a record and don't select it, that's a full table scan, and full table scans unnecessarily consume tons of resource.
This example is particularly bad. In one day, users performed a total of 3,000 queries - which resulted in 14 billion I/Os - to select 8,000 records. This seems like a ridiculous example, but it's real. This invisible use of resource occurs every day in many environments. It's not truly 'invisible,' but it happens so quickly that most people never see it. These 3,000 requests might have appeared only momentarily, and by the time you have refreshed your screen they were gone. The 14 billion unnecessary I/Os could have paralyzed the system from an interactive response time standpoint, but were individually invisible until we analyzed the data this way.
CO405JTCP isn't much better; it performed three billion I/Os via almost 3,000 queries to select 38 million records. However, another issue exists in this subsystem - why is it selecting so much? What in the world are they doing with so many records for each query? This problem results from applications that preload entire data sets into memory, while the user only accesses the first small set of records. When the user pages down, the application accesses the next page of data out of memory. From a coding perspective, it's much easier to preload 10,000 records into memory and let the user page through them; from a performance perspective - with hundreds of simultaneous users - it's deadly. For system performance, it's worth the additional time and effort to code the application properly - the application should read 10 records at a time, and not a single record more. When the user pages down, read 10 more records. It's horribly inefficient to load 10,000 records into memory that the user will never even think of reading.
Let's look at it another way (Figure 16), sorted by records selected. Remember CO405JTCP? Not surprisingly, it accounts for 89 percent of all of the selected records. QBATCH, which made an appearance in our high-level analysis, is number two. The worst performer, however, is QINTER. It reads 13 billion records to select only 8,000. (Figure 17) shows the same data organized by number of queries, and again, QINTER stands out; it accounts for half of all the queries on the system. Unnecessary initiation and termination is the likely culprit here.
(Figure 18) shows the data by job. Remember PCC7C8619? It was our largest CPU consumer, our largest physical I/O performer, and now we that it has the most skipped records: 36 percent of the total. That's more than 3 billion records skipped to select 38 million. This job is consuming 22 percent of our CPU, and now you see why.
We need to address a few issues here. First, we must make sure that this job has the proper permanent indexes so that it reads 38 million records instead of 3 billion. Next, we should find out why it's selecting so many records each time.
Finally, we need to ask ourselves, 'Why is it running so many times?' Most likely, you have 15 users throughout the company all wanting a copy of a single report, so they're all running the job individually. It would've been a lot more efficient to run the job once and print 15 copies of that report.
Ultimately, you want to get to the level of detail that shows what database needs tuning. (Figure 19) shows what looks to be the master file (ENBMAS) that can clearly benefit from additional indexing. The application performed 2,000 queries to select 300,000 records, and it skipped 7 billion records along the way. Fixing this problem won't involve the building of 1,000 new logical files on your system. Some people are very reluctant to build any new logical files, and in fact some corporate standards prohibit it. What a mistake that is from a performance standpoint. You don't want redundant indexes of course, but you sometimes need a new one, and developers or database administrators must be allowed to properly index and tune databases so that applications can perform well. In this example, we see three databases that desperately need indexing - they're reading billions of records to select just a few million records.
(Figure 20) shows the data sorted by selected records, while (Figure 21) shows it sorted by queries. Notice that the same files are showing up in each of these graphs; when you can see this happen, the areas that need tweaking become obvious. From a program standpoint, (Figure 22) shows that IOMEM001 is skipping almost 12 billion records in 3,000 queries to select 3,000 records. That's 3,000 full table scans in one day to select one record each time.
Even beyond the database level is the index level. (Figure 23) seems confusing at first, but it's saying this: if the file ENPMAS had a new permanent logical file keyed by MEMBNO, GRPNUM, EFFDAT, and RECTYP, you would eliminate almost 7 billion I/Os on this system. These I/Os were occurring because an access path didn't exist as a permanent logical or it didn't get used. Further analysis will show what's causing that - look at the job log of a job that's running queries with debug turned on, and see what that operating system is doing. See what decisions it's making.
(Figure 24) sorts the query data by selected records. SVCMAS is another master file that's missing an index key by MODCOD. Because of that there is 10 times more I/O than is necessary. But another issue exists here as well: why did we select 40 million records? The reason is that we probably neglected to put another criteria in this query. We selected by MODCOD and downloaded 10 million records, then did another pass to further subset the data. If we had done all of that in one pass, we might have been able to select just a few thousand records.
Use Good Judgment
This white paper focuses primarily on SQL-related issues. You also may have code issues with your RPG, Cobol, CL, or C code. Or, you may have 2 GB of journal receivers per day that your high availability tool is having trouble trying to process. Other MB Software white papers focus on these other areas.
No matter what analysis you do, you must use good judgment. You would never want to build 1,000 new permanent logical files. You have to know when to stop. Look at different periods of time and different samples of data; maybe the numbers vary dramatically during month end, weekend or day end processing. Look at the data during a variety of key business times, and correct the issues that will have the greatest impact on your system.