Using the UML to Specify Languages

Richard J. Botting
California State University, San Bernardino
7700 State University Parkway, San Bernardino CA 92407, USA

rbotting@csusb.edu

Abstract. The UML and the OCL complement BNF. The author shows how the UML can be used to describe the structure and semantics of languages. He gives examples of: the relationships between BNF and the UML, the semantics of LISP1.5, the Composite pattern, static scoping, and polymorphic Object-oriented operational semantics. Some alternatives are discussed and future work described.

1 Introduction

External software design describes the software's interfaces. All interfaces have rules (syntax) and meaning(semantics). Implementers and users need to be able to find out about them(Figure 1). Some interfaces are clearly programming language. Data streams, protocols, and dialogues are also modeled as formal languages. For samples, see my World Wide Web site [ 2]. In this paper I will show how the UML complements the most popular form of syntax description. The UML supplies missing context sensitive constraints and semantics.

For syntax, the designer is forced to use "BNF." Nearly all Language Reference Manuals (LRMs) have syntax rules defined using some variation of the "BNF" meta- language developed by J. Bachus and P. Naur for Algol 60. BNF is based on a mathematical theory of grammar. Extended BNF (EBNF) is a class of later BNF-based notations. My Web site[2] includes specifications of several such meta-languages. The EBNF is simple, usable, formal, and ubiquitous. Many software tools use it. SGML and XML include their own variation on EBNF. However, it is well known that BNF cannot express all the rules (syntax) and only hints at the meaning (semantics) of the language. In practice, the designer uses EBNF and adds disclaimers and commentary to fill the gaps. Recent examples are in books on XML[ 10 ] and VRML[1]. There are no popular, simple, readable, and rigorous techniques for expressing the "context dependencies", "Well-Formedness Rules"(UML Specification], or "static semantics" that EBNF cannot handle. Even so, the designer has little choice but to use some kind of EBNF for syntax.

The opposite happens with semantics. Mathematical methods for expressing meaning are not used in LRMs [16]. Very few LRMs (outside the ACM SIGPLAN Notices) provide any formal semantics. Most use natural language. Natural language specifications are often hard to read and usually have inconsistencies and ambiguities. A popular solution is to create a self-referential description of the language like the UML meta-model. Other designers rely on source code to define the meaning of input and output. The code may be for a real machine or a virtual one. Source code is usually full of irrelevant detail, hard to read, user-hostile, and sometimes a trade secret. In this paper I introduce Object-Oriented Operational Semantics expressed in the UML in the hope that it will can become a popular yet precise technique for defining languages.

2 The ROOT Project

This paper is a side-effect of the ROOT project at the Computer Science Department, CSUSB [9]. ROOT integrated object-oriented technology into the core of our Computer Science Bachelor's degree. The project was supported by an award from a CSUSB Learning Productivity Grant from September 1997 to December 1998. I was tasked to add the UML and OOA/D (Object-Oriented Analysis and Design ) to the "Concepts of Programming Language" course[3]. This is a 10-week about the ideas of languages. The students can already program in C++ and have met the UML. The UML is used throughout the course in lectures, handouts, and exercises. An OOA/D project that used the UML was added to the course.

The question I had to answer before teaching it was: "How should the UML be used in describing languages?" The answer was in a simple diagram of LISP data in the UML. It shows ideas that McCarthy used in the LISP1.5 LRM[14]. This LRM used the LISP M-language to define the semantics of LISP using half-a-dozen primitive functions. All these functions are either in Figure 2 or easily added to it (details in section 6.1 later). Rational Rose 4.0 on a laptop supplied by the CSUSB grant made it possible to model the semantics of LISP1.5 completely in the UML plus the Object Constraint Language (OCL). This model is on the World Wide Web[5]. Most of the course materials are now on the World-Wide Web. They include dozen's of models[3]. The outcomes will be described elsewhere [8]. The rest of this paper describes how to use BNF+UML+OCL to define languages.

3 Relations Between BNF and the UML

There is no simple mapping between BNF and the UML. The UML expresses ideas. BNF describes how to recognize and parse sequences of data. The UML class model is perfect for expressing what formal linguists call abstract syntax. This is the structure generated by a parser that turns input into a set of objects of different types. The UML shows the types of objects and interconnections between them. BNF defines concrete syntax. The lexemes, rules of priority, associativity, and so on, are all questions of concrete syntax. They are best expressed in BNF and left out of the UML. BNF describes a shell and the UML the oyster inside. So, the UML class diagram does not always have one class for each BNF term. Sometimes a term defined in the syntax is most naturally modeled as an attribute or role in the UML rather than as a class.

Theorists demonstrate their linguistic techniques using a simple structured language called "While." "While" has assignments, compound statements, while-loops, and if-then-else statements. Figure 3 shows the abstract syntax of While using the UML. Similar structures occur for all modern languages. The Rational Rose 4.0 model (on the World Wide Web[6]) includes the BNF as documentation.

4 Common Patterns

Abstract classes appear naturally in the UML diagrams of many programming languages. For example, a Statement is defined as something that can be executed. Yet, each type of Statement executes differently. So Statement is an abstract class. A loop statement is a special kind of Statement. All loops control a loop body. So the body is a part of all loops. However, different kinds of loops behave differently, so the Loop class is abstract. A conditional loop has a condition. Conditional loops are further classified into the while-loop, the until-loop, the do-until-loop, etc. Each defines execution differently in terms of the same inherited condition and body. Circa 1980 Smalltalk used inheritance in a similar way to define its control structures.

The Composite pattern (described by the "Gang of Four"[10]) is ubiquitous in programming languages. In this pattern objects are either composite or elementary. Composite objects contain or refer to several component sub-objects. The meaning of a composite object is defined in terms of the meaning of its components. The Figure 2 and Figure 3 above are examples. Two more examples are expressions (Figure5) and block structures(Figure 4) later in this paper.

5 Context Dependency

EBNF cannot express all the rules defining a correct part of a complex language. Some call these rules "static semantics." Clearly adding multiplicities and constraints to a class diagram of the abstract syntax can express many context dependent syntax rules. Other formal ways to express non-context free constraints have been invented: attribute- grammars, multilevel or W-grammars, concurrent grammars, etc. The UML is used more widely than any of these. I will show that the UML combined with the Object Constraint Language (OCL) can be used to define a subtle rule that occurs in most block- structured languages.

Consider the ideas of "Block Structure" and "Static Scoping"as examples. For a program to be syntactically correct it must fit the BNF syntax and all variables that occur in it must be declared somewhere. In C and Java 1.0 a variable can be declared locally in a function or in the file or class surrounding its declaration. In Algol, Pascal, Ada, C++, and Java1.1 however a function or subprogram can be defined inside the definition of another one. The correct declaration for a variable can be found in the local block, in the surrounding block (if any), or the block that contains the function definition(if any), or its surrounding block, and so on, see Figure 4.

Figure 4 shows that Statements can be Blocks and Blocks can contain both Statements and Declarations. The Declarations can declare Variables or Subprograms. Since a Block has one or more Statements and a Statement may be a Block, Blocks can be inside Blocks. A Block can be the body of a SubprogramDefinition or be a Statement inside another Block, but not both. Entities are occurrences of things that should be declared: Variables and SubrogramCall.

The operation decl(e) tests if a Declaration could declare an Entity e. This involves comparing the identifiers, signatures and data types. Many Declarations can match an Entity. If none exist, then the program has an error. Figure 4 has associations that link a Variable v to a unique correct Declaration v.vd and link a SubprogramCall s to its Definition s.sd. The default multiplicities on these links model a correct program. Scoping defines the Declaration that is associated with each Entity. The associations are derived using 'decl(e)' and an operation that is the responsibility of a Block. The operation Block::chain(Entity e) models the "Static Scoping" rule. The correct declaration for entity x is found by asking the local block (x.s.b) to search the "static chain" for the first declaration y such that y.decl( x ).

block:Block::chain(e:Entity):Declaration 
post:
 if block.d->exists( y | y.decl(e))
 then
    result = block.d->select(y | y.decl(e))
 else
  if block.b->exists then
    result = block.b.chain(e)
  else
   if block.sd->exists then
    result = block.sd.b.chain(e)
   else
    result = false
   end if
  end if
 end if

This section showed how a complex non-context-free rule can be expressed in the UML + the OCL. The next section develops techniques for defining the meaning of common programming language constructs using the UML + the OCL.

6 Defining Semantics in the UML

Semantics does more than describe validity. Semantics must supply the meaning for statements in the language. Placing operations in the model and specifying them with the OCL defines the object-oriented operational semantics of the language.

6.1 LISP

As an example, the primitive functions used to define LISP in [14] were: car, cdr -- roles in Figure 2 above. cons -- <<constructor>> ConsPair::cons(List, List) null -- true of NIL else false. atom -- true of Atoms else false equal -- test for same object. The last four are added as operations to Figure 2 above. Polymorphism means that their specifications are simple. For example, the operation null() needs to be a member of only in List and NIL in Figure 2. The following two post-conditions define null precisely:

:List::null()
   result=NIL 
:NIL::null()
   result=T 

A comprehensive model of the LISP1.5 LRM is on the World Wide Web[5].

6.2 Simple Expressions

To save space in this paper, my next example is drastically simplified is typical. An Expression is either a constant(a Real), an addition, or a multiplication (Figure 5 below). The notes in Figure 5 show the post-conditions for the operation specification of Addition::eval() and Constant::eval(). Multiplication::eval() has the following post-condition

result=subexpr.eval()->iterate(v, a=1 | a*v). 
6.3 Variables

There are no variables in the simple Expressions in Figure 5. Normally, the meaning of a program is expressed in terms of Values and Variables. An abstract State class encapsulates the bindings between Variables and Values(Figure 6 on the next page). The State class is used (1) to hide the different ways that different languages handle Variables, and (2) to allow the evaluation operation (Expression::eval) to refer to many Variables. Typically State hides symbol tables (environment) and vectors of values ( stores )[15].

The operations on a State (Figure 6) have three properties:

  1. Operation 'get' does not change the State.
  2. After 'set(v, c)', 'get(v)' returns c.
  3. After 'set(v1,c),' 'get(v2)' is not changed unless v1=v2.
6.4 Side-Effects

Imperative languages have complex semantics because statements and expressions can have side-effects. A side-effect changes the State. A simple technique is to define an operation that has a State as an argument and returns the new State. I use the identifier "exec" to name this operation. The signature is: exec(s:State):State For example, the semantics for the While language ( Figure 2 above) requires that exec is abstract in Statement and implemented in Assignment, While, IfThenElse, and Compound. Their specifications are:

c:Compound::exec( s:State )
 post: result = c.statements->iterate(v, a = s | v.exec(a) ) 
a:Assignment::exec( s:State )
 post: result = s.set(a.lhs, a.rhs.eval(s)) 
w:While::exec( s:State )
 post:
 if w.condition.eval(s) then result = w.exec(w.body.exec(s))
 else result = s
 end if 
i:IfThenElse::exec( s:State)
 post:
 if i.condition.eval(s)
 then
   return = i.thenpart.exec(s)
 else
   return = i.elsepart.exec(s)
 end if 

The complete model of the structure and semantics of "While" is on World Wide Web[6].

7 Discussion

7.1 Are Grammars Useful for Anything but Languages?

It might be argued that grammars are only needed for programming languages. However, grammars have been used as tools for markup languages, Internet Request-For Comments, translators, data processing[12], and management information systems[13]. EBNF is a well-understood formalism. It can be used to clarify any sequential structure.

7.2 Can the UML Describe Syntax without BNF?

It can be argued that the UML already has the capacity to describe syntax since it includes state charts. Start charts can be used to describe "acceptors" for languages. They should be used when the states are already defined. For example, they can be used for documenting and designing communication protocols. However, the syntax of a language does not explicitly give the states. EBNF documents meaningful structures: (sequences, alternatives, repetitions) without introducing states. Theory also indicates that BNF is more powerful than finite state diagrams. State charts would have to allow recursive states before they have equal descriptive power. Experienced designers like Per Brinch-Hansen state that BNF-style syntax helps the designer more than a statechart-like diagram [7].

7.3 Why use Statement Operations to define Semantics?

Using 'eval' and 'exec' operations are not the best way to document the meaning of all languages. Class diagrams describe the VRML well, for example, because a VRML world is a collection of objects[1].

Alternative Classification.

An alternative to associating an 'exec(s:State)' with each type of Statement is to define an operation State::exec(v:Statement). The 'State::get' and 'State::set' operations are no longer needed. The State is more tightly encapsulated. However, the dispatch on the type of Statement must now be explicit. The result is as complex as earlier forms of operational semantics. It is also worth separating State structure from the execution of Statements so that State absorbs the changes caused by adding complications like declarations.

Alternative Methods.

Programming languages have three kinds of semantic methods[15]:

Name Provides
Operational A trace of a series of computations.
Denotational A mathematical function relating input to output.
Axiomatic A set of rules for reasoning about program correctness.

In this paper I have presented operational semantics. It is possible that there are object-oriented denotational and axiomatic semantics. This is left for further research.

7.4 How do you connect UML to BNF?

So far I have relied on the reader's intuitions or placed the EBNF in the documentation of classifications. Perhaps, one could use a tagged value with name "syntax" and a value indicating the document and BNF non-terminal.

7.5 Theoretical Objections

Theorists will worry that any object-oriented semantics is incomplete until there is a mathematical semantics for objects. This is an active research topic[15, 16]. In particular my semantics for the "While" language includes a specification where the object sends an 'exec' back to itself. Perhaps the OCL needs a published mathematical theory that proves that the above recursion is well defined.

8 Conclusion

The UML complements and does not replace BNF. The UML and the OCL are a popular notation that can express context-dependent rules("static semantics"). They can also express semantics. The main advantages of this are:

Perhaps the almost impossible job of writing precise and understandable definitions may become a merely difficult one.

9 Plans

This paper covers topics in a typical undergraduate Programming Language Concepts class. These topics are also reviewed at the start of graduate courses on the theory of programming languages. I plan to express the ideas found later in the CSUSB graduate courses using the UML+OCL as well [ 4 ].

I am incorporating the UML into my documentation of notations on the World Wide Web [2]. This is a repository of information for practicing programmers and students of computing. It is a semiformal, indexed, and searchable hypertext based on a highly extended BNF. It has many specifications from Algol 60 to Z including Email and the OCL. Anyone can view the EBNF parts using Netscape or MS Internet-Explorer. The UML models must be downloaded as text and loaded by another tool before the details can be examined. The UML would be more useful if it were better integrated with the World Wide Web.

References

1. Ames, A.L., Nadeau, D.R., Moreland, J.L.: VRML 2.0 Source Book(2nd Edn.), John Wiley & Sons New York (1997)

2. Botting, R.J.: Documentation Samples, http://www.csci.csusb.edu/dick/samples/index.html

3. Botting, R.J.: CS320 Programming Languages , http://www.csci.csusb.edu/dick/cs320

4. Botting, R.J.: CS620 Theory of Programming Languages, http://www.csci.csusb.edu/cs620

5. Botting, R.J.: LISP in UML, http://www.csci.csusb.edu/samples/dick/lisp.mdl

6. Botting, R.J.: While in UML, http://www.csci.csusb.edu/samples/dick/while.mdl

7. Brinch-Hansen, P.: The Edison Papers, Software -- Practice and Experience Vol. 11, No 4 (Apr 1981) and Programming a Personal Computer, Prentice Hall International 1983

8. Concepcion, A.I., Botting, R., Mendoza, J., Karant, Y., Scroggins, D., Desar, S: ROOT: Refashioning Object-Oriented Technology Teaching. in preparation for next OOPSLA.

9. Concepcion, A.I., Scroggins, D., Desar, S: Refashioning Object-Oriented Technology Teaching, http://www.csci.csusb.edu/rootproj

10. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, Reading MA, 1994

11. Graham, I.S., Quian, L., XML Specification Guide, John Wiley & Sons, New York (1999)

12. Jackson, M.A.: Principles of Program Design, Academic Press, New York (1975)

13. Jackson, M.A.: System Development, Prentice-Hall International, London UK (1983)

14. McCarthy, J., Levin, M. I.: LISP 1.5 Manual, the Computation Center and Research Laboratory of Electronics, Massachusetts Institute of Technology, MIT Press Cambridge Mass. (1962)

15. Schmidt, D.A.: Programming Language Semantics, ACM Computing Surveys Vol. 28, No. 41(March 1996)pp265-267

16.Schmidt, D.A.: On the Need for a Popular Formal Semantics, ACM SIGPLAN Notices, Vol. 32, No. 1 (January 1997)pp115-116, and also ACM Computing Surveys Vol. 28, No. 4es (Dec. 1996), Pages 175-es http://www.acm.org/pubs/citations/journals/surveys/1996-28-4es/a175-schmidt/