IBM i (AS/400, iSeries) Journal Analysis and Optimization

Regardless of the intended use of journaling on your systems, 50 to 80 percent of the typical journal receiver transactions of large systems are generated unintentionally and for unnecessary purposes. This huge inflation in transaction volume can significantly impact system performance and can hinder the ability of your high availability or data warehousing solutions to keep data current.

This white paper covers three main areas: journaling uses and benefits; understanding the costs and implications of journaling; and ongoing management and administration challenges.

Basically, journaling is recording every time a change occurs on the system. That can be a change that's occurring inside a database file; a record or a transaction that's being added, updated, or deleted; a new program that's being installed; a data area change; or a stream file changing on the integrated file system (IFS).

Journaling Uses and Benefits

Journaling can give great benefit to many aspects of your day-to-day business operations, from high availability issues to auditing to commitment control. Let's look at a few of these areas and examine journaling considerations within each.

High Availability

Your main production system will have a lot of data to back up for disaster recovery purposes, and regardless of what tool you use for high availability, it will require that you turn journaling on. Depending on what level of high availability you're looking for, you might journal just your production data libraries where your physical and logical files are stored, or you might journal the whole IFS if you have IBM i (AS400, iSeries) Journal Analysis and Optimization applications that use it to store documents or images.

Regardless of whether you're doing selective journaling or you're journaling the entire system, fifty to eighty percent of transactions in your journal receivers are wasted - they're transactions that don't need to be occurring. Your first instinct might be to go journal everything on your system. However, there is a huge cost associated with that. The bulk of that data isn't necessary; this becomes apparent when you determine that temporary user spaces on the IFS account for 89 percent of the transactions in your journals, causing huge DASD, CPU, I/O, and memory demand on your system, and causing a data replication tool to have to filter through gigabytes of data that doesn't need to exist. Those temporary user spaces were used internally to keep track of pointers and data, but were not needed for any purpose to be replicated over to another box. That amount of data has a huge effect on the system, whether it's your source system, the target system, or the network in between. If you need to upgrade your source system to send workload to another box, to move your query users over to another system, or to create a hot backup system, upgrading the source system doesn't make sense. The whole premise is that you're moving workload to free up resources on the source system, so let's make sure we're actually doing that.

Real-time replication can be a challenge, depending upon the volume of transactions in your journal receivers. Many times, weekly processing can cause replication to be up to 10 hours behind. If processing finishes on Sunday, it can take until noon on Monday for the data replication tool to filter through all the transactions and find what needs to be sent to the other system.

Regardless of whether you do local journaling and have the data replication tool filter those transactions for you, or you use the remote journaling capability that's available in later or newer releases of the operating system, it's still the same issue. Remote journaling can give you an advantage, but if you're still piping 15 or 50 GB of transactions when you only need to be transmitting 1 GB, there can still be a huge performance impact, even with remote journaling. If we can get those transactions down to the 1 GB that they are supposed to be, performance will be optimized, whether you're doing local or remote journaling.

You want to keep an eye on the increase in network traffic. It's important to look at transactions per hour and megabytes per hour that are being sent across the network. Let's minimize the source data. If we only have to process a fraction of the data to begin with, the demands we're going to place on the network are going to be a fraction of what they would be otherwise.

With any type of high availability implementation, you need to be sure that it actually works. You should do a role swap on a weekly or monthly basis to ensure that this high availability solution of yours is actually a proven one. It's great to do backups, but if you've never done a restore, you can't be sure how good those backups are. If you have a hot backup but you never swap over to it until there's a disaster, you will have problems. But if you routinely role swap, you will ensure that you're journaling appropriately and you're replicating properly.

Data Warehousing

Journaling is necessary for data warehousing as well - for transactions, sales history, inventory transactions, customer records, and more. As with high availability, we're talking about the replication of data, using the record of every add, update, or delete that occurs on your system. And just like high availability, offloading of data warehousing workload should not require an upgrade on the source system. It should not require upgrading a T1 to a T9. Let's make sure that we are in fact offloading workload rather than just creating additional workload and moving problems. You haven't fixed anything if you move query users from your main production system over to a backup system, and they're still experiencing the same kind of horrible response time to their query requests that they were back in production. In fact, the additional load of journaling on the source system may IBM i (AS400, iSeries) Journal Analysis and Optimization even be worse than the query load was to begin with. Make sure you're not just making it a more complex environment - you want to decrease the load that this whole solution has on your production system.

Many data warehousing solutions can also be transforming data, summarizing and reformatting it as it's being replicated to another iSeries, another library on the same iSeries, or an Intel or Unix system. Is your application sending 800,000 transactions an hour saying that the last activity date and time changed on a customer record? If that application is supposed to be changing the balance due on a customer record on 10 percent of the records in the file, it shouldn't be changing the last activity date and time on every record. It is a very simple oversight for the programmer to read through the file and move system date and time to last activity date and time, recalculate a balance due, and update the record. Well, a simple IF statement in that code could dramatically reduce the load that has on your system and improve the ability of this replication solution to keep data current and keep it replicated in the proper fashion.

Again, we don't just want to move issues, we want to solve them. Let's address the performance issue back on the source system; let's not just move it over to a target. End user querying can be a reason why you end up implementing data warehousing solutions. Giving the user a basic querying tool is good for them from a functionality standpoint, but it's not necessarily good from a performance standpoint. Look carefully at SQL performance, performance of queries, proper database tuning, and indexing on your system, whether it's your core business system on the source or the data warehouse that you've created on a target system.

Also, keep in mind that data warehousing means duplicating data; keep a close eye on the integrity of that data. There are things that we can do to reduce the volume of transactions, but we want to make sure that we're maintaining data with good integrity.

Often you may be inclined to exclude the opens and closes of database files from your journaling because they might not be necessary for your data replication purposes. That can be a mistake as well; there's a huge impact on the system from a performance standpoint to have jobs or applications or programs going out and constantly opening and closing files. If you have one job that runs for four hours at night, and you determine that there are 60,000 opens and closes a minute occurring inside of this job, there's something wrong. That job ought not to be doing that. Turn that journaling on just for purposes of analysis. Let's identify the program that's doing that. It's a whole lot more efficient to be just leaving files open and moving record pointers around rather than constantly opening and closing files. 60,000 opens and closes a minute inside of one job is a huge performance issue. If you are in fact journaling those, that's an even bigger IBM i (AS400, iSeries) Journal Analysis and Optimization performance issue - because now that's getting written to the journal receivers, the replication software is filtering it, and it's causing downstream issues as well.

Commitment Control

Another popular use of journaling is commitment control. In more sophisticated applications, commitment control might be embedded within the application, where the application has the ability to accumulate or cache a series of transactions, and at a certain point either commit or roll back those transactions. It's important to look at the journaling to see whether there's an opportunity to improve performance, to save money on hardware, to improve response time to your users, make better-performing Web applications, or finish nightly job streams more quickly. Journaling can have a significant impact on all of those things - we want to make sure that we understand where that impact is coming from.

But when you're doing commitment control and you decide to do data replication for high availability or data warehousing purposes, their goals might conflict with one another. We want to make sure that if we're purely journaling some files for commitment control purposes, and that's creating a large volume of transactions, we may want to segment or separate our journal receivers. Those transactions aren't needed for data warehousing or high availability purposes.

Security Auditing

You might have security auditing turned on in your system. You will have a QAUDJRN journal that might be 15 GB a day or 1 GB a day, depending on the size of your environment. It's interesting when you start looking inside that QAUDJRN to see what's in there. We're all anxious to hurry up and delete these journal receivers, because they're difficult to manage. But they have to be backed up, and we need to have sufficient DASD available for storing them. It's important to hold off for a second and take a look at what's in there. You'll be surprised when you find that 60 percent of what's in QAUDJRN is from some inadvertent thing that's occurring on your system.

If you're journaling object changes within your native file system it's important to identify those work files. A physical file that the developer ought to have put in QTEMP can be a huge journaling issue. Because now there may be 15 or 20 work files (in a data library of 2,000 files) that account for 80 to 90 percent of the transactions in the journal. If we find out that there are a million puts and one clear, what's the point? Why did we journal this file that's getting cleared? The end result of journaling these files is a whole lot of resource being consumed - whether disk storage, I/O, CPU, memory use, network traffic on the source system, the target system, and the network in between - to clear the files a millisecond later. Those files were not needed for high availability purposes. You will consume a lot less resource if you don't journal them to begin with. Maybe journaling the entire system is the right thing to do, but let's have some exceptions to that policy based on these work files that have been identified.

Security audit journal ZC transactions are commonly found as well. Ninety-three percent of your security audit journal may be from the changing the authority on an object 90,000 times an hour. It may have been easier for the developer writing the program to explicitly change the authority each time to ensure that the authority is set properly. But when the authority gets changed 90,000 times an hour to be the same thing every time, it causes inflation in the impact on the system.

Application Debugging

Application debugging is another valuable use of journaling data. You may not be using it for this, but journals are often an underutilized resource from an application standpoint if you have inventory quantities or order balances getting out of sync. That journal data can be of significant value in helping a developer troubleshoot and debug application issues like this.

You also can identify application performance issues. That change of last activity date and time on every record was not only a journaling issue, but was causing that application to run 50 to 60 percent worse than it should have been. If we see that the order status on your order detail or order header file is changing from one to two to three back to one, and an update was issued for every one of those changes, the end result was the same. There shouldn't have been so many updates. You should have an IF statement around it to see whether there really was a change to the important fields. IBM i (AS400, iSeries) Journal Analysis and Optimization

Troubleshooting

Troubleshooting application issues, system management issues, security concerns, disaster recovery, and system availability is important - how important varies depending upon your business. If you run a 24x7 shop it's pretty important how long it takes for that replication software to put current data on another system. If a disaster causes journal receivers to be cached on the source system, it's not uncommon to see it take two or three days to get the system back in sync after a failure.

But whether we're talking about day-to-day system availability, a failure, or just a weekend or month-end processing window, things are going to be a lot better if the journals are a fraction of their current size.

Costs and Implications

There are clearly hardware costs associated with having to upgrade your source system, your target system, or the network in between. But there are also costs from a data integrity standpoint. What's the cost to your business for having inaccurate data, data that customers don't trust, or a system or application that users don't trust? If they don't trust the data, they're not going to use it. They're going to use manual procedures rather than the system or the application.

There are costs from a system availability standpoint. If your users are down at month end for an extra day because of processing durations, a significant part of that downtime can be from journaling that's turned on during those big month end updates. It's critical that you analyze those transactions to make sure that your month end processing is done as quickly as possible.

When we're talking about performance implications, remember that just turning journaling on could make the nightly process run 20 or 30 percent longer (we've seen it as high as 60 or 70 percent longer) depending upon what the issues are in your particular environment. Let's understand what the implications are, how much journaling is contributing to the nightly processing duration. Let's get down into these receivers and understand what truly is causing these delays in processing.

There are things that you can do from a hardware standpoint: You could put your journals on a separate auxiliary storage pool (ASP), for example. But we ultimately want to focus on the source of the data, the application at the source of these transactions. Let's make sure that every transaction is a legitimate one. We don't want to focus on the things that only occur five or 10 times a day; we want focus on those things that are big. Whatever the issue is, let's identify it and see if there's a better, more efficient way of getting that done.

Real-time data replication might not be needed - maybe that day-old, week-old, or month-old data is okay for a particular query need. But where real-time data is important, we want to make sure that we're actually able to provide it. A 10-hour delay after the weekend process or five-minute delays in an order entry system may be critical, depending on your line of business.

There are performance implications in lots of ways and lots of places. It's not purely how much disk space a journal receiver is taking. In addition, it's how much CPU, I/O, memory, and disk activity results from these transactions that don't need to be created.

We want to make sure that if an upgrade is done for any kind of journaling need that you have, that it's an appropriate one. If there are other ways that we can use to offset the cost, let's use them to make the upgrade as efficient as possible. For example, maybe we can avoid having to do a certain kind of upgrade, or change the kind of upgrade that needs to be done. Maybe there's some database tuning that needs to be done on your system to improve performance of all of the SQL and queries that are running on your system. If inefficient database access via SQL is consuming 30 percent of your CPU, let's go make some improvement there. Make some room for the additional journaling load that you're going to have on your system by cleaning up disk space. If there is additional disk load from journaling (which there will be), let's go clean up those old objects that are on your system that you aren't using, and kind of shift things around - maybe look at some things that have nothing to do with journaling to see if we can free up resources before adding this burden to the system.

We want to make sure that we understand - regardless of whether we're optimizing journals, databases, or code - the cost and the benefits of doing that optimization. Very frequently, we find that people have systems that are I/O bound to begin with, or become I/O bound as a result of journaling, and the first thing they want to do is a hardware upgrade. They want to upgrade the CPU, get more memory, faster disk drives, and more disk arms. That's the wrong approach. First, let's make sure we know what the problem is. If the problem is an I/O-bound system, is more CPU going to help? It might not; or, it might not help enough to offset its cost. If the system is I/O bound, let's understand where that I/O bottleneck is coming from. If it's journaling related, let's address it. If it's application or database related, let's address it and then see what truly needs to be upgraded based on a properly performing and efficient application environment.

There might be some very simple system management techniques that you can use to offset some of these costs; for example, if you have access to code or have development staff or a vendor that's cooperative, there might be some simple application changes that will yield good results. The things that we've talked about here are not major changes - we're not rewriting applications, re-architecting and redesigning databases, or completely changing the way your operation is running. There are usually some very simple, basic things that can be done to dramatically improve performance on a system. It's just a matter of identifying them and making sure you are focused on the things that matter the most.

Data integrity clearly has an impact on your business. We need to understand what cost the business incurs from inaccurate data or data that's not timely. These things are usually very difficult to identify and IBM i (AS400, iSeries) Journal Analysis and Optimization solve. A user can usually just go to a screen and show you some data that's wrong, but it's a far different thing to be able to trace that back through your entire job stream to figure out what caused the inaccuracy. Journal transactions are a record of every add, update, and delete that occurs on your system. Depending on what kind of journaling you're doing, you may even have a before image and an after image. That data is a great record of everything that has occurred during a specified period of time, giving you the information you need to identify the source of the data integrity issue - to find out which programs did what, when, what user, what job, over what file, and so forth.

We want to make sure we have the confidence of the users. It doesn't matter how good an application we have from a functionality standpoint, applications need to do many things. They need to perform well, and they need to have data that users can rely on and have confidence in. Data integrity is critical for a system to actually be usable to an end user.

System availability is just as vital, whether it's day-to-day system uptime or recovery time after a failure. We need to practice failures on a weekly or monthly basis to ensure that our recovery time is as minimal as possible. '24x7' is a difficult thing to accomplish, but with a lot of these techniques we can have not only a good backup system, but we can actually have one that's usable and is kept in sync in real time. In fact, everything that is needed for that system to operate standalone is truly getting replicated in a usable form.

This is a real-time world. All of our businesses have an advantage by being able to be responsive quickly to customers' requests and needs. The faster we can fulfill that order or reply to a customer request, the better advantage we have over our competition. Real-time data synchronization is a challenge, but you want to look like a truly integrated system to the end-users and to your customers. Your system might not be very well integrated in the background; you have a mix of all kinds of applications, all kinds of technologies, different database formats, and different application programming languages. But you can make it look real time by not having that 24-hour delay before an order status is visible on a Web site.

Ongoing Management and Administration

We want to make sure we're doing appropriate journaling based on the true goals that we're trying to accomplish. Journal receiver retention is something that's important to look at. We want to make sure that we're keeping an appropriate number of transactions or receivers on the system, but we're not deleting them too quickly. Keep them around long enough to analyze them at least once to see where that inefficiency is. Error detection and recovery is a good use as well. Journaling is, of course, good for performance analysis. With it, we can make sure we're getting optimum use out of the resource that we have.

When dealing with system backups and recovery, we want to make sure that we're being reasonable. Taking a system down for six hours to back everything up every night isn't reasonable. There are more sophisticated, more complex backups that you can be doing. You want to make sure you have a reasonable amount of protection, but you're not impacting the business too much on a daily or an ongoing basis. You want to performance analyze your backups - not just to see what kind of impact the journals are having, but to find out how much time is being spent backing up what.

If you aren't already doing multithreading of backups, you should consider it; this is a great technique to get your backups done quicker. Backing up to disk in the middle of the night into another library, maybe on another ASP, then backing up from that second ASP to tape during the day can get your backups done a lot quicker. Plus, it can alleviate the need to have people at night feeding tapes into tape drives. You can have the daytime people doing that.

Journal only what is appropriate. Remember, 50 to 80 percent of what's in these journal receivers is waste. If you're doing any significant amount of journaling, this holds true 100 percent of the time. You just have to know what's in there. You need to do a detailed analysis of your journal receiver transactions, and be selective. Journaling everything often contributes to significant inflation in CPU, I/O, and memory use on your system.

We want to make reasonable choices, to analyze those receivers before we delete them and prevent the transactions from being generated to begin. Let's not just go filter them out in the replication tool. There are a lot of options, from simple code changes to basic system management changes; there's a lot to be learned by purely analyzing this data and seeing what it tells us.