Disclaimer. CSUSB and the CS Dept have no responsibility for the content of this page.
Copyright. Richard J. Botting ( Fri Jan 20 11:55:30 PST 2006 ). Permission is granted to quote and use this document as long as the source is acknowledged.
5500, State University Parkway, San Bernardino, CA 92407
rbotting@wiley.csusb.edu dick@csci.csusb.edu http://www.csci.csusb.edu/dick/home.html
Voice Mail: USA-(909)-880-5327
Abstract
This paper points out well known tools and techniques that
can measure the
readability, content, and relevance of many documents used in developing software.
Introduction
Kari Laitinen (ACM SIGSOFT SEN Vol 21 no 4 (July 1996) pages 81-92) argues that the
number of the different words in a document is a good measure of the document's complexity
and hence readability. Laitinen does not discuss some easily available tools and techniques that
provide similar or better metrics. This note describes them.
Laitinen does not mention that understanding is not a property of the document by itself. It is a relationship between a document and a reader. What is understandable to a programmer may be incomprehensible to users. Different readers have different educations as well. This note describes a way to allow for the variation in the readers.
Ambiguous, incomplete, or inconsistent documents can cause problems in a project. Unless care is taken, our documentation is ambiguous, incomplete, and inconsistent. Here I will point out existing tools and techniques that help reduce these problems. These tools also improve understandability.
Vocabulary
Consider a typical document D. It could be anything that can read and written: a memorandum,
an SRS, a module specification, a piece of code, a manual page, a help screen, and so on. For the
moment, ignore the structure of D. So d looks like a string of characters.
It is easy to analyze this into a series of lexemes: words, numbers, punctuation, strings, etc.. Remove everything but the words. Convert all the words to the same case. As in Laitinen, identifiers that contain words should be broken up. So the "largest_element_in_matrix_5" becomes ( "largest", "element", "in", "matrix"), "largestMatrixElement" becomes ("largest", "matrix", "element"), and (in COBOL) "CUSTOMER-INVOICE-PAYMENT" becomes ("customer", "invoice", "payment"). In some UNIX systems this is all contained in a program called 'prep'. It is easy to write a script with the same effect on any UNIX system.
Now sort the list of words alphabetically, remove duplicates, and save in a file. Call the result the vocabulary file of the document D. Let the vocabulary of D V(D) be the set of words in the vocabulary file generated from document D.
Laitinen claims that the larger the vocabulary V(D) is, the harder the document is to understand. I claim that we must remove from V(D) all the words that are in the reader's natural language and use the size of the remaining set as a measure of the readability of the document.
There is a program in UNIX that performs the whole process automatically. It is called "spell". It is used like this:
$ spell documentJon Bentley discusses the design of spell checking programs [Bentley 86, J Bentley, Programming Pearls, Addison Wesley Reading MA 1986]. All software documentation should be checked for spelling errors!
In any project there will be a set of words that are meaningful in that project and yet are not in the normal language. Therefore each project should maintain a set of vocabularies of words that are permitted besides those in the natural language of the people doing the project. It is easy to uncover unnatural and undocumented terms on a UNIX system:
$ spell document| fgrep -v user.vocabularyIf the vocabulary file has been sorted then
$ spell document | comm -23 - designer.vocabularywill be faster. Notice that each class of readers may have a different "natural" language and vocabulary.
Perhaps these vocabularies should be more than lists of words. Perhaps they should contain definitions of the words. Perhaps we also need glossaries that list terms and give examples and glosses on the use of the word. After all some methods already have data-dictionaries. Perhaps each project needs an official dictionary defining all the terms in its vocabulary. These dictionaries and glossaries can be written in a form of Extended BNF (XBNF):
term :: context = definition.Samples can be found on the Internet: [ http://www.csci.csusb.edu/dick/samples/ ] or the *.mth files that can be FTP'd from [ http://ftp.csci.csusb.edu/dick/samples/ ]
A vocabulary is one dimensional projection of the content of its document. It ignores many details. The client's vocabulary has been the starting point for conceptual data-base design and some early object oriented methods. The vocabulary of a specification or a high level algorithm forms the basis of some object oriented methods and some structured analysis and design methods. Perhaps the size of the vocabulary of the requirements could be used to help predict the size and cost of a software development project.
When the content of a document changes the vocabulary is likely to change. Call this change drift. Measure it by the size of the symmetric difference between the vocabularies of the two documents. Given the two vocabularies in files voc1 and voc2, then the UNIX command
comm -12 voc1 voc2 | wc -lcounts the number of words in one vocabulary and not in the other and vice versa. This is a measure of how much has been added and/or subtracted from the original content. This imposes a metric space of documents.
Measuring drift may be useful in a software development project. As the client/user's needs change so will their vocabulary. Probably a large change in the vocabulary of a set of requirements signals the need for a large change in the software. Thus large vocabulary drifts in a set of requirements may be associated with higher maintenance costs.
Second, if the requirements vocabulary is also in the design, code, and user documentation then the result is better protected against catastrophic changes caused by small changes in requirements. In other words, as a project progresses each later document may have a slightly larger vocabulary, but the vocabulary should not shrink!
Third, large increases in vocabulary needs to be accounted for in a project. The kind of change that would be typical comes as a programmer introduces data structures and algorithms are not specifired or required explicitly. For example a user might state:
I want the words that are not in the dictionary.but the programmer might add:
The dictionary will be held in a hash-table.This is a design decision. It might have been "B-Tree", "random access file", or "sorted serial file"[Bentley 86, J Bentley, Programming Pearls, Addison Wesley Reading MA 1986]. Ideally the relationship between ends and means should be documented - either via Quality Function Deployment techniques, scenario/use-case analysis, functional decomposition, or by some form of traceability.
Parnas documented the importance of using design decisions to determine module boundaries. It seems likely that significant drift in the vocabulary of a project shows the need for modularization. Perhaps, the new words need to be in a separate document with a HyperText link from the basic document. Perhaps we can combine information hiding with HyperText!
Higher Order Vocabulary
It is possible to think of vocabularies as a first order approximation to the content of a document.
The second order approximation consists of all pairs of adjacent words in the document. The n'th
order approximation of all n-tples of adjacent words (perhaps with punctuation). The complete
content, as a string, is the highest order vocabulary: a singleton set.
Concordances
The vocabulary is a rough measure of the complexity of a document. There are finer measures.
One simple one is a concordance of the words in the document. This is a list of words plus their
frequency of use. Again it is easy in to generate concordances of documents. A concordance
acts as a kind of finger print that can be used to decide who wrote a document. Perhaps we can
use this as an egotism detector.
Each document D has a function that maps each word w into its frequency F[D](w). F[D] is the concordance of D. There are several simple measures of the distance between two documents based on comparing concordances. In one metric the distance between two documents D1 and D2 is estimated by the sum over words w in the union of V(D1) and V(D2) of the absolute value of the difference between F[D1](w) and F[D2](w):
Another metric is the square root of the sum of the differences.
It is also possible to use information theory to map concordances onto a different numeric measure of the complexity of the document. The entropy or variety of the document can be estimated because the concordance is a distribution function and so
where for all w:V(D), p[w]::= F[D](w) / Sum [w:V(D)](F[D](w)) .
It would be interesting to gather data on how the above measures correlate with other properties of projects.
Style
The size of a vocabulary is one measure of the unreadability of a document. There is a
well-known and finer metric for English documents. In English, the readability of a document is
decreased by the use of long words and long sentences. This has been codified into a metric
called the Fogg Index. There exist proprietary packages such as Gramatic that will measure this
and related values and interpret it for authors. It is an effective tool for improving a written
document's understandability.
For example, according to my copy of Gramatic, this document has a Fogg index of 14. Perhaps all software documentation should be checked for style.
It is again easy to process a vocabulary file to select the longest words that are not in the project's technical dictionary. The following UNIX pipeline filters out the 20 longest words for example:
awk '{print len "\t" $0}' non-tech.vocabulary|sort -nr|head -20
These words are likely to be a problem for other readers and perhaps they should be eliminated.
Formulae
There have been several proposals for making documentation less ambiguous. One idea is to use
mathematics to clarify them. This idea is unpopular with those who can not do it or don't want to
risk the time learning to do it. However there are projects that have used mathematical notation
successfully.
Such formal language is not enough however! A common error made by novice programmers is to assume that the comments in a program do not matter. This re-appears in the idea that only the mathematics in documentation is important. Practical projects and class room experiments suggest otherwise. Formal documentation is not, of itself, understandable. It must be embedded and connected with natural language examples and glosses.
Perhaps better documentation will be a network of mathematical formulae and natural language. The formulae can disambiguate the informal parts. Formulae also allow a great deal to be said with very few symbols - an observation made by Charles Babbage. Formulae can also be manipulated safely to solve problems etc. - algebra. The informal parts are essential to make the formulae understandable to the reader.
However the tools and methods are not matched to their market place. The commonest markup language for mathematicians ( LaTeX ) is not a formal language. It defines the look and feel of a document not its meaning. TeX is also more complex than is needed for software engineering and its clients. It adds a large vocabulary of words to documents that have nothing to do with the problem being solved or the methods used to solve it.
Most software engineering documentation can be formalized with discrete mathematics. So TeX or LaTeX may not be appropriate for software engineering. I and other research teams have been developing well-defined languages for software engineering documentation. My own notation is experimental. It is designed to be readable in ASCII but easy to be translated into the HyperText Markup Language(HTML) or TeX. See http://www.csci.csusb.edu/dick/research.html for details.
Syntax and Grammar
Software engineers are used to having the syntax of source code checked by a program. Most
word processing systems provide tools that look for common errors and ambiguities. They also
spot problematic English like the "Passive Voice". There is no reason why all documents should
not have their grammar checked as well.
Mathematical formulae can have there syntax checked. They can also be checked for non-context free dependencies - type checking. The formal validity of an argument can be tested by algorithms described in texts on formal logic.
It may even be possible to develop combined syntax and grammar checkers that scan documents written in a mixture of formal and natural languages. Perhaps checking the grammar of documents in a software project could help the developers do a better job by ensuring that the documents are more readable and less ambiguous.
Other Techniques
This short note has no space to discuss the pros and cons of using graphics and tables in
computer documentation. I believe that we can and should develop documents that have parts
that can be viewed in many different ways. SGML points the way to documents with an
underlying logical structure with multiple renderings: text, formula, graphic, and tabular.
Neither is there space to discuss the work on the best structure for documentation: It is clear that the use of Contents lists and Indexes make a document easier to understand. Some research suggests that a book-like format works well. It is again easy to develop or find tools that aid structure and also generate lists and indexes automatically. We could prepare a concordance of each document and make sure that all the words that are used infrequently in the document are indexed and defined.
Last and Best
I have left the most accurate way of assessing the understandability of documents until last. It
has been used for many years on many successful software projects. Its inventor was even given
a prize by his company - IBM. To assess the qualities of a document get a collection of people to
read it and report on it. Make sure they represent those who will be working with the document
and give them the right to send it back to the author for rework if the document is not good
enough. In a word: Inspections.
Conclusion
There exist many tools and techniques that can help software engineers prepare documents that
are less ambiguous, more relevant, and more understandable. They are not the mythical silver
bullets of a method or program. They are merely antiseptics and antibiotics that destroy the
germs that infect software - they may therefore be magic bullets.
. . . . . . . . . ( end of section On Estimating the Understandability of Software Documents) <<Contents | End>>