Software Reliability
Home Software Reliability Professional Experience Favorites Creativity Rider Award




Reliability Functionality vs Business Functionality

Where should your priorities be?

Copyright © 2000 Theodore Louis Palmer

All rights reserved.

Permission to reprint may be obtained by contacting the author. or 314-994-1621

If you are a manager in an organization of your company's business that provides information technology (IT) to a user community organization or if you are a manager of the user community organization that uses information technology to run your company's business, then this message is for you. If you have been managing in one of these two areas of business for some time, then you have probably contributed to, been a part of, or shared war stories with those who have had to deal with applications of information technology that have been marginally reliable and a major source of frustration and consternation to all the lives of those who have been touched by it. Including the lives of you company's customers. There exists many applications of information technology that are such a pain to use, or be the victim of, but have cost so much to create that those who are stuck with these applications find themselves plagued by the question: "Is it worth fixing or should we trash it and start over?" IT DOESN'T HAVE TO BE THAT WAY!

How these marginally reliable applications came to exist and why they are so common is the subject of this essay. Keep in mind during this discovery process that Reliability Functionality has a big side benefit that I'm saving for dessert by telling you about it at the end. Let's start out chronologically at first to track the evolution of applications that are marginally reliable. Applications of information technology are created to satisfy a business need. A business leader manager recognizes an area of the business operation that can be dramatically improved by automating the way information is handled by that operation. Such a leader discusses the feasibility of such an application with information technology technocrats and if the concept seems doable at a cost that has a favorable cost benefit ratio, a new project is born. If the application is going to make use of new leading edge information technology that has never been implemented in the experience of those involved, a "proof of concept" phase may be required.

Managers and subject matter specialists from the user organization meet with managers and subject matter specialists from the information technology organization. They create a project control specification, which details in a combination of business language and technical language the business functionality requirements of the project. This is where the problem of marginally reliable applications start. Right here! They start right here because the project control specification only includes the requirements of business functionality and makes no statement about reliability functionality what so ever. It is just generally assumed by all that the implementers of information technology will execute their responsibility without flaw or any flaws that occur will be of a minor degree and will be promptly detected and corrected as part of the natural order of business life.

How can otherwise seemingly intelligent people make such a stupid assumption? The answer lies in acknowledging the interests and priorities of those involved. The people from the user organization are interested in business operations that are going to increase revenue or reduce expenses. This is a good time to mention another source of projects. The salesperson by whatever title that sells an information service product that has not been created yet to a customer to gain a big new source of massive amounts of revenue. Or the salesperson that negotiates a business deal that will require a major alteration of the business activity that current applications software models. In any case, the focus of attention of people in control from the user community are the details of business operations that are as highly profitable as possible. So all of their requirements are going to be stated in terms of the business operations needed to make MONEY.


Knowledge vs Ignorance

Where should your priorities be?

When I was a student at Texas Christian University (1978 to 1982) I took a course in business law at the M. J. Neeley School of Business. One day I reported to class as usual and our professor, Greg Franzwa, told our class to assemble in a near by auditorium. We were about to be graced by the presence of a guest lecturer. Our guest lecturer was Thomas Carol who at the time was the CEO, or functional equivalent thereof, of Lever Brothers. To break the ice and establish a camaraderie with the assembled college students, he told a joke on himself. Mr. Carol said: "I'm in general management. I have learned that the higher and higher you go in general management, the less and less you have to know about more and more until you finally get to the top when you know absolutely nothing about everything."

The ruling dynasties of China for thousands of years had a status symbol that no one else in the land could display -- grotesquely long curled fingernails. Why was this a status symbol? Unless you have personal servants to tend to your every need, it is not possible to perform the routine tasks of daily life with such fingernails. Have you ever heard a high ranking manager say that they owe their success in business to the fact that they have people working for them who are smarter than they are? Thus it is that to climb the organization chart of most large companies a person must be more successful than their competitors at displaying ignorance and relying on the detailed knowledge of others. The success of Silicon Valley has made a big change in that formula, but they are the vendors of information technology. Not the users or the implementers. It remains in the preponderance of cases that in the halls of corporate America ignorance is a status symbol and knowledge is a stigma.

The knowledge vs ignorance phenomena is why you never see any managers above the first level participating in software requirements specification meetings. They don't want to know the details of the business operation at the level necessary to specify software requirements. This is why users that specify their software requirements do NOT recognize the need to give higher priority and precedence to reliability functionality over business functionality. They don't want to know about it and just assume that quality and reliability of information technology is the responsibility of information technologists. On the flip side of the issue the information technologist first level managers know that when their next performance appraisal comes due their immediate manager is going to contact the managers of the user community and ask if all their information technology needs are being satisfied. Of course the information technologist first level managers want the response to be a resounding YES! So they give the user managers what they ask for without making an issue of reliability functionality first over business functionality. This is like giving small children a steady diet of cake, candy, cookies, and ice crème when what they really need is a steady diet of meat, fish, fruit, and vegetables. In both situations a very high price is paid for this mistake at a much later time and the more time that passes, the more obscure the cause and effect relationship becomes.


What is Reliability Functionality?

Reliability Functionality has five main components:

      1. Audit Trails
      2. Control Reports
      3. Event Logs
      4. Error Logs
      5. Internal Source Code Documentation


I listed Audit Trails first because it is the easiest one for most people to relate to. But, measured in terms of most beneficial effect on going, the component of Reliability Functionality that has the greatest value is Control Reports. The higher in the company organization chart a manager is placed the more likely he or she will be looking at reports that have dollar signs in front of the numbers. The lower in the company organization chart a manager is placed the more likely he or she will be looking at reports that do not have dollar signs in front of the numbers. Application Control Reports, in most cases, do not have dollar signs in front of the numbers. The managers in the user community of course want to show their upper management how big a contribution their business operation is making to the company's bottom line. Plus, upper management likes to look at reports that have dollar signs in front of the numbers. So guess which kind of reports they are most likely to specify first, and assign the highest priority to when designing the requirements for information technology?

Control Reports provide information on how the transaction records of an application are being processed; i.e., how the business rules are being applied. The user community managers are the subject matter experts at knowing what the business rules are and interpreting the business rules for the managers of information technology implementation. If they want their business operation to run reliably, they should be requesting reports that tell them how the business rules are being applied. They should also be anxious to get them and review them. Are they? Not in my experience. Why not? Because, in the preponderance of cases, user community managers perceive that the ultimate responsibility for the over all reliability of the data processing operation lies in the hands of the managers of IT.


Control Reports and Control Tables

If Control Reports provide information about how the business rules are being applied, then where are the business rules stored? In the control tables. Well, not entirely. The business rules are implemented in the algorithms of computer software that model the activity they automate. The control tables store the values that are part of the criteria that determine how transaction records are processed. What happens when the values stored in the Control Tables are flawed? The processing of the transaction records are flawed. In a business operation that processes thousands or even hundreds of thousands possibly even millions of transaction records per day, an error in control table setup is likely to be very obscure. To keep from learning of these flaws after customers do, somebody, who has a personal interest in the successful operation of their part of the business, must monitor these control reports daily in order to discover mistakes and get them corrected as soon as possible.

Mistakes in Control Table setup are not the only source of flaws in the processing of transaction records. In a business operation large enough to be called an "enterprise", there are many application programs executing, possibly on different hardware platforms, at the same time or in a defined structured sequence. Enterprise application programs share data via flat files. This has been a time-honored tradition since the days before magnetic tape as in punched tape. It is not going to change or go away any time soon. The software interfaces that these flat files pass through were created based on certain assumptions. What happens when somebody on the sending end of a flat file does something that conflicts with these assumptions and doesn't tell anybody at the receiving end. Maybe the change they made at the sending end was a mistake that they didn't know they made. If the mistake they made doesn't cause an abnormal termination of the application program on the receiving end, and there is a good possibility that it will not, how is the mistake going to be discovered in time to keep it from having a large negative impact on the profitability of the business? The answer is in the Control Reports.

By monitoring the fluctuations in the numbers on the Control Reports on a daily bases and knowing the cause of any fluctuations that are outside the range of normal, it is possible to discover problems before the impact is dramatic. But who should be responsible for this constant monitoring? What makes the best business sense? That is a management decision. The sophistication of the statistical data in these Control Reports is not complicated. General business problems are not rocket science that requires a degree in astrophysics to solve. They generally don't even require a degree in statistics or computer science. It just requires someone that has a personal and professional best interest in their part of the business and somebody that has detailed knowledge of the business rules. It is also important that it be somebody that has an awareness of changes that are made that effect the part of the business involved. If the managers of the user community that uses IT to run the business don't have an overall awareness of what is going on in the business that effects them, who does?


Audit Trails

Audit Trails have the second greatest value measured in terms of most beneficial effect on going on reliability. Audit Trails shouldn't require a lot of explanation. Many applications have them and many more of them should. The primary transaction records of an application should have an event log record of the date and time of a record's addition to the application database and all modifications made to it. Audit Trails are a form of event log. The Audit Trail record should also record the user ID of the user making the addition or change and the name of the program and the function within it that was used to make the change. In the case of batch programs, the operating system software will probably show the user ID of the person who submitted the batch program for processing as the owner of the process making the changes to the database. That will make that person's user ID show up in the Audit Trail a lot, but that is OK. Primary Control Table records should also have audit trail information.


Internal Source Code Documentation

Internal source code documentation is third in beneficial effect in my opinion. I am not the only one who believes this but it should be a lot more common than it is. The managers of Edward Jones, formerly Edward D. Jones, financial investment company share my opinion of the value of internal source code documentation. I only worked there for about 7 weeks in November and December of 1991. It remains in my experience the best shop that has software coding standards that are enforced by an objective and predictable business process that works. The coding standards at Edward Jones are very comprehensive; perhaps too much so. But it is better to error on the side of caution when considering the cost of software bugs. Besides they are keeping track of people's money.

The part of Edward Jones coding standards that appeals to me most is the requirement that all COBOL paragraphs in the Procedure Division must have a comment box above them that provides information to the maintenance programmer. I have forgotten exactly what information they required but I believe that comment box, which should appear in all programs no matter what the language, should provide the answer to the following 3 questions:

      1. What does it do?
      2. Why does it do it?
      3. How does it work?

The answer to the last question can be optional depending on the complexity of the function or procedure (logical equivalent of a paragraph in COBOL).

How does this improve the reliability of the computer software component of IT? Companies are in business to stay I hope. They're not here today; gone tomorrow. The business of each company is as unique as a finger print. If they are not unique, how are they going to convince customers to buy their products and not those of some other company? The one thing that is constant in any business is change. That is not just a play on words. Businesses are compelled to change for a lot of reasons: fad or fashion; change in technology; legislative, bureaucratic, or judicial mandate; weather; international politics. Just to name a few. Computer software models the activity it automates; therefore as businesses change so must the software that automates business processes. It's not good enough that computer software meet all the technical requirements specified when it was first created. It must lend itself well to being changed without the introduction of bugs. It is a fact of IT life that good IT talent moves around a lot. I'm not a social scientist. I don't have an explanation for it. That is just the way it is.

Three years after an application is put into production it would be a safe bet that none of the original programmers that coded an application will even be working for the company that paid for its creation. Many maintenance changes will have been made in those three years. Some by the original programmers, but more by programmers who were brought on board as the need arose. In many cases those programmers will be consultants; not regular employees of the company that owns the software. Every project that enhances the functionality of software or just changes the way it works has a sense of urgency. I have already enumerated some of the sources of inspiration for that. So therefore each programmer that makes changes to application software is going to feel the pressure to produce results fast. There isn't time to study and analyze semidocumented software to be sure how it really works. Just code it and get it working. Being the first to market has a lot of incentive. Just ask Bill Gates.

So what programmers do under pressure is look for some strategic point in the software program to key off of; then go off and write their own little mini program within a program. As changes are made to the software, many little mini software programs pile up within the original now very, very, . . ., very big program. This makes for software programs that are fat, bloated, and slow. Just ask some operations manager that has a batch process that normally runs all night. Its margin of time to be completed before the batch window closes and the online business day begins keeps getting smaller and smaller until the weeknight isn't long enough to finish in time. So the job has to be rescheduled for the weekend. And there is more than one program for which that is getting to be a problem. Heaven forbid that one of those programs should go into an infinite loop. Running in a tight little never ending circle so that the job never finishes no matter how long the batch window stays open. Or perhaps, after hours of running, the program bombs off the system with one of those ever popular error messages like: "Division by zero not allowed." or the classic IBM SOC7 "Attempt to perform arithmetic operation on non-numeric data." IT DOESN'T HAVE TO BE THAT WAY!

How long are managers of IT going to keep mortgaging the future? In the long run we may all be dead, but I still plan to be around for a long time. You can have your cake and eat it too if you do it right. So here is how to do it right when it comes to coding reliable software programs. Consider this. It takes a lot less time to answer those three little questions that I said should be answered in a comment box at the beginning of every paragraph, function, or procedure than it does to study and analyze those units of source code. If managers of the software development process would require all their programmers, employees and consultants alike, to observe minimum coding standards that include the answers to these three little questions, it would not be necessary to find strategic points in software programs then branch off to mini programs within big big programs. Of course those answers must be more than trivial tokens to meet the minimum requirement. The cumulative effect of not requiring and enforcing this minimum coding standard is that as time passes the software program becomes more and more obfuscated; more complicated, expensive, and risky to maintain.

Consider this also. Programmers write their source code programs in formal languages like COBOL, PL/I, PowerBuilder, VisualBASIC, C, C++, SQL, etc., etc. If they can code in formal languages well enough to make a program work to specifications, then they should be able to articulate in natural language a narrative that answers the three questions that I believe ought to be answered. And do it in a reasonable time frame that has a favorable cost benefit ratio. In the history of the cosmos there will never be an intelligent thinking creature that knows better what inspired an individual programmer to declare a paragraph, a function, or a procedure name and what they intended for it to do. So who better could there possible be than the original programmer to answer these questions?


Sorry for the 2 page diatribe.

The above may be an interesting explanation of what makes many software programs slow and inefficient and it only partially explains the effect of Internal Source Code Documentation on software reliability. Some slow inefficient programs are very reliable. Every time a programmer adds one of those mini programs inside of an already big program it has to work with other mini programs that are already there. That means that they have to have a way of communicating with each other. The easiest way so therefore the most common way (Nobody has time to do it the right way.) is to pass parameters through variables that have been declared as global. There evolves within the big program a protocol for passing parameters through global variables that is unique to each big program and defined ad hoc by each programmer that previously made changes to the big program. Since this protocol was defined ad hoc by many different people who are not talking to each other, it is chaotic, idiosyncratic, and difficult to understand. It also requires a lot of time to figure out. I already stated the cause and effect of time pressure.

So what happens is that the programmer adding this one more mini program to a big program figures out the protocol as best he or she can. Their understanding of this protocol is usually complete enough to get positive test results on all the test cases that they can think of. The problem is that given the number of combinations and permutations of parameters their testing is never exhaustive enough so that when the program is placed into production, from time to time it will encounter a combination of parameters that was not taken into account when the new mini program was coded. This is the cause of the program abends with error messages like "Division by zero not allowed" or IBM's SOC7. This usually occurs between the hours of 2 AM and 3 AM. There is more to it than that. It's not just the combinations and permutations of parameters that cause the abends. It is also not completely understanding the protocol so that one mini program clobbers the parameters being passed by another mini program.

Internal Source Code Documentation can reduce the frequency of occurrence of these kinds of failures by making it easier to understand that protocol. It also can reduce the number of mini programs within a big program by making it easier for maintenance programmers to discover the logic of existing mini programs within a big program. They can therefore add the functionality requested by the user without adding more chaos and idiosyncrasies.


Event Logs and Error Logs

In today's business environment dumb terminals are pretty much anachronistic. The PC now owns the desktop. When I came to St. Louis in January 1988, the largest disk device on an IBM 3090 main frame in common use held less than 2 gigabytes and was the size of a clothes washing machine. A few months ago (it's now December 1999) I added a supplemental hard disk to an older PC of mine. It had more than 6 gigabytes of storage, cost $125, and fit in the pocket of my jacket. Higher end PCs are now being commonly sold with hard disks that store more than 20 gigabytes. That much storage space is difficult to comprehend even to a seasoned IT professional. The hard disks on the desktops of most users are grossly under utilized. So there is no excuse for not including event logging and error logging to the users hard disk in client software written to execute on the users desktop PC.

The software can be written to create the log file if it doesn't exist or append to it if it does exist. When the PC locks up because the hard disk is full in two or three years, a PC technician can delete the log file and the whole process can begin anew. I only thought I was finished discussing reliability when coding software. I have had to fix many an IF statement or SELECT CASE statement that did not have an ELSE clause. These programming statements must have been written by programmers that thought one of the Boolean expressions was always going to test true so there was no need for an ELSE clause. What happens when a subsequent maintenance programmer changes something in another part of the program not knowing about the missing ELSE clause? I suspect that the original programmer didn't know what to do if the ELSE clause were executed and since all was well when the software was first placed into production what was the incentive to figure out an intelligent way of handling the error condition? He or she could have logged it to the user's hard drive after displaying an appropriate error message to the user and went on. When the user reported the error message as a sign that something was wrong, a routine inspection of the error log would lead to a solution.


BIG Side Benefit

Reliability functionality in IT has a big side benefit to the software maintenance and development process. Testing software is a lot like performing a scientific laboratory experiment. It is a stimulus and response process with instrumentation that tells the experimenter what all the responses were. The more responses that can be measured simultaneously, the more the experimenter knows about the results of the test. Testing software maintenance changes usually involves looking at all the normal processing results that an end user would see and verifying that the software had the intended effect. Analyzing all the data that was not supposed to be effected by the software changes can be an exhausting process and usually is. So what happens is that analyzing the results for unintended effects is either not performed or given a superficial examination. The custom is to wait and see if anybody complains about something then go look at it. This is how unintended results are usually discovered.

It can not be guaranteed that analyzing the event logs and error logs as part of the testing process will reveal any unintended results but, once this kind of functionality is in place, it is easy to examine. Control reports are also easy to examine. Since control reports summarize processing effects, they give a high level view of overall results. Audit trail information also provides additional measurement of the processing result. Cumulatively reliability functionality components give the programmer analyst an easy way to verify that no unintended results occurred. They provide another opportunity to discover and fix bad things before users and customers are effected by them.