Skip to main contentCal State San Bernardino
>> [CNS] >> [Comp Sci Dept] >> [R J Botting] >> [papers] >> rjb99i.understand
[Index] [Contents] [Source] [Notation] [Copyright] [Comment] [Search ]
Fri Jan 20 11:55:30 PST 2006

Disclaimer. CSUSB and the CS Dept have no responsibility for the content of this page.

Copyright. Richard J. Botting ( Fri Jan 20 11:55:30 PST 2006 ). Permission is granted to quote and use this document as long as the source is acknowledged.


    On Estimating the Understandability of Software Documents

      Richard J. Botting Computer science Dept., California State University, San Bernardino,

      5500, State University Parkway, San Bernardino, CA 92407

      Voice Mail: USA-(909)-880-5327


      This paper points out well known tools and techniques that can measure the readability, content, and relevance of many documents used in developing software.


      Kari Laitinen (ACM SIGSOFT SEN Vol 21 no 4 (July 1996) pages 81-92) argues that the number of the different words in a document is a good measure of the document's complexity and hence readability. Laitinen does not discuss some easily available tools and techniques that provide similar or better metrics. This note describes them.

      Laitinen does not mention that understanding is not a property of the document by itself. It is a relationship between a document and a reader. What is understandable to a programmer may be incomprehensible to users. Different readers have different educations as well. This note describes a way to allow for the variation in the readers.

      Ambiguous, incomplete, or inconsistent documents can cause problems in a project. Unless care is taken, our documentation is ambiguous, incomplete, and inconsistent. Here I will point out existing tools and techniques that help reduce these problems. These tools also improve understandability.


      Consider a typical document D. It could be anything that can read and written: a memorandum, an SRS, a module specification, a piece of code, a manual page, a help screen, and so on. For the moment, ignore the structure of D. So d looks like a string of characters.

      It is easy to analyze this into a series of lexemes: words, numbers, punctuation, strings, etc.. Remove everything but the words. Convert all the words to the same case. As in Laitinen, identifiers that contain words should be broken up. So the "largest_element_in_matrix_5" becomes ( "largest", "element", "in", "matrix"), "largestMatrixElement" becomes ("largest", "matrix", "element"), and (in COBOL) "CUSTOMER-INVOICE-PAYMENT" becomes ("customer", "invoice", "payment"). In some UNIX systems this is all contained in a program called 'prep'. It is easy to write a script with the same effect on any UNIX system.

      Now sort the list of words alphabetically, remove duplicates, and save in a file. Call the result the vocabulary file of the document D. Let the vocabulary of D V(D) be the set of words in the vocabulary file generated from document D.

      Laitinen claims that the larger the vocabulary V(D) is, the harder the document is to understand. I claim that we must remove from V(D) all the words that are in the reader's natural language and use the size of the remaining set as a measure of the readability of the document.

      There is a program in UNIX that performs the whole process automatically. It is called "spell". It is used like this:

            $ spell document
      Jon Bentley discusses the design of spell checking programs [Bentley 86, J Bentley, Programming Pearls, Addison Wesley Reading MA 1986]. All software documentation should be checked for spelling errors!

      In any project there will be a set of words that are meaningful in that project and yet are not in the normal language. Therefore each project should maintain a set of vocabularies of words that are permitted besides those in the natural language of the people doing the project. It is easy to uncover unnatural and undocumented terms on a UNIX system:

            $ spell document| fgrep -v user.vocabulary
      If the vocabulary file has been sorted then
            $ spell document | comm   -23   -   designer.vocabulary
      will be faster. Notice that each class of readers may have a different "natural" language and vocabulary.

      Perhaps these vocabularies should be more than lists of words. Perhaps they should contain definitions of the words. Perhaps we also need glossaries that list terms and give examples and glosses on the use of the word. After all some methods already have data-dictionaries. Perhaps each project needs an official dictionary defining all the terms in its vocabulary. These dictionaries and glossaries can be written in a form of Extended BNF (XBNF):

            term :: context = definition.
      Samples can be found on the Internet: [ ] or the *.mth files that can be FTP'd from [ ]

      A vocabulary is one dimensional projection of the content of its document. It ignores many details. The client's vocabulary has been the starting point for conceptual data-base design and some early object oriented methods. The vocabulary of a specification or a high level algorithm forms the basis of some object oriented methods and some structured analysis and design methods. Perhaps the size of the vocabulary of the requirements could be used to help predict the size and cost of a software development project.

      When the content of a document changes the vocabulary is likely to change. Call this change drift. Measure it by the size of the symmetric difference between the vocabularies of the two documents. Given the two vocabularies in files voc1 and voc2, then the UNIX command

            comm -12 voc1 voc2 | wc -l
      counts the number of words in one vocabulary and not in the other and vice versa. This is a measure of how much has been added and/or subtracted from the original content. This imposes a metric space of documents.

      Measuring drift may be useful in a software development project. As the client/user's needs change so will their vocabulary. Probably a large change in the vocabulary of a set of requirements signals the need for a large change in the software. Thus large vocabulary drifts in a set of requirements may be associated with higher maintenance costs.

      Second, if the requirements vocabulary is also in the design, code, and user documentation then the result is better protected against catastrophic changes caused by small changes in requirements. In other words, as a project progresses each later document may have a slightly larger vocabulary, but the vocabulary should not shrink!

      Third, large increases in vocabulary needs to be accounted for in a project. The kind of change that would be typical comes as a programmer introduces data structures and algorithms are not specifired or required explicitly. For example a user might state:

            I want the words that are not in the dictionary.
      but the programmer might add:
            The dictionary will be held in a hash-table.
      This is a design decision. It might have been "B-Tree", "random access file", or "sorted serial file"[Bentley 86, J Bentley, Programming Pearls, Addison Wesley Reading MA 1986]. Ideally the relationship between ends and means should be documented - either via Quality Function Deployment techniques, scenario/use-case analysis, functional decomposition, or by some form of traceability.

      Parnas documented the importance of using design decisions to determine module boundaries. It seems likely that significant drift in the vocabulary of a project shows the need for modularization. Perhaps, the new words need to be in a separate document with a HyperText link from the basic document. Perhaps we can combine information hiding with HyperText!

      Higher Order Vocabulary

      It is possible to think of vocabularies as a first order approximation to the content of a document. The second order approximation consists of all pairs of adjacent words in the document. The n'th order approximation of all n-tples of adjacent words (perhaps with punctuation). The complete content, as a string, is the highest order vocabulary: a singleton set.


      The vocabulary is a rough measure of the complexity of a document. There are finer measures. One simple one is a concordance of the words in the document. This is a list of words plus their frequency of use. Again it is easy in to generate concordances of documents. A concordance acts as a kind of finger print that can be used to decide who wrote a document. Perhaps we can use this as an egotism detector.

      Each document D has a function that maps each word w into its frequency F[D](w). F[D] is the concordance of D. There are several simple measures of the distance between two documents based on comparing concordances. In one metric the distance between two documents D1 and D2 is estimated by the sum over words w in the union of V(D1) and V(D2) of the absolute value of the difference between F[D1](w) and F[D2](w):

    1. d(D1,D2)::= Sum[w:V(D1)|V(D2)]abs(F[D1](w)-F[D2](w)).

      Another metric is the square root of the sum of the differences.

      It is also possible to use information theory to map concordances onto a different numeric measure of the complexity of the document. The entropy or variety of the document can be estimated because the concordance is a distribution function and so

    2. Variety(D)::= Sum [w:V(D)] (- p[w] * log[2](p[w]) )

      where for all w:V(D), p[w]::= F[D](w) / Sum [w:V(D)](F[D](w)) .

      It would be interesting to gather data on how the above measures correlate with other properties of projects.


      The size of a vocabulary is one measure of the unreadability of a document. There is a well-known and finer metric for English documents. In English, the readability of a document is decreased by the use of long words and long sentences. This has been codified into a metric called the Fogg Index. There exist proprietary packages such as Gramatic that will measure this and related values and interpret it for authors. It is an effective tool for improving a written document's understandability.

      For example, according to my copy of Gramatic, this document has a Fogg index of 14. Perhaps all software documentation should be checked for style.

      It is again easy to process a vocabulary file to select the longest words that are not in the project's technical dictionary. The following UNIX pipeline filters out the 20 longest words for example:

        awk '{print len "\t" $0}' non-tech.vocabulary|sort -nr|head -20
      These words are likely to be a problem for other readers and perhaps they should be eliminated.


      There have been several proposals for making documentation less ambiguous. One idea is to use mathematics to clarify them. This idea is unpopular with those who can not do it or don't want to risk the time learning to do it. However there are projects that have used mathematical notation successfully.

      Such formal language is not enough however! A common error made by novice programmers is to assume that the comments in a program do not matter. This re-appears in the idea that only the mathematics in documentation is important. Practical projects and class room experiments suggest otherwise. Formal documentation is not, of itself, understandable. It must be embedded and connected with natural language examples and glosses.

      Perhaps better documentation will be a network of mathematical formulae and natural language. The formulae can disambiguate the informal parts. Formulae also allow a great deal to be said with very few symbols - an observation made by Charles Babbage. Formulae can also be manipulated safely to solve problems etc. - algebra. The informal parts are essential to make the formulae understandable to the reader.

      However the tools and methods are not matched to their market place. The commonest markup language for mathematicians ( LaTeX ) is not a formal language. It defines the look and feel of a document not its meaning. TeX is also more complex than is needed for software engineering and its clients. It adds a large vocabulary of words to documents that have nothing to do with the problem being solved or the methods used to solve it.

      Most software engineering documentation can be formalized with discrete mathematics. So TeX or LaTeX may not be appropriate for software engineering. I and other research teams have been developing well-defined languages for software engineering documentation. My own notation is experimental. It is designed to be readable in ASCII but easy to be translated into the HyperText Markup Language(HTML) or TeX. See for details.

      Syntax and Grammar

      Software engineers are used to having the syntax of source code checked by a program. Most word processing systems provide tools that look for common errors and ambiguities. They also spot problematic English like the "Passive Voice". There is no reason why all documents should not have their grammar checked as well.

      Mathematical formulae can have there syntax checked. They can also be checked for non-context free dependencies - type checking. The formal validity of an argument can be tested by algorithms described in texts on formal logic.

      It may even be possible to develop combined syntax and grammar checkers that scan documents written in a mixture of formal and natural languages. Perhaps checking the grammar of documents in a software project could help the developers do a better job by ensuring that the documents are more readable and less ambiguous.

      Other Techniques

      This short note has no space to discuss the pros and cons of using graphics and tables in computer documentation. I believe that we can and should develop documents that have parts that can be viewed in many different ways. SGML points the way to documents with an underlying logical structure with multiple renderings: text, formula, graphic, and tabular.

      Neither is there space to discuss the work on the best structure for documentation: It is clear that the use of Contents lists and Indexes make a document easier to understand. Some research suggests that a book-like format works well. It is again easy to develop or find tools that aid structure and also generate lists and indexes automatically. We could prepare a concordance of each document and make sure that all the words that are used infrequently in the document are indexed and defined.

      Last and Best

      I have left the most accurate way of assessing the understandability of documents until last. It has been used for many years on many successful software projects. Its inventor was even given a prize by his company - IBM. To assess the qualities of a document get a collection of people to read it and report on it. Make sure they represent those who will be working with the document and give them the right to send it back to the author for rework if the document is not good enough. In a word: Inspections.


      There exist many tools and techniques that can help software engineers prepare documents that are less ambiguous, more relevant, and more understandable. They are not the mythical silver bullets of a method or program. They are merely antiseptics and antibiotics that destroy the germs that infect software - they may therefore be magic bullets.

    . . . . . . . . . ( end of section On Estimating the Understandability of Software Documents) <<Contents | End>>