[Skip Navigation] [CSUSB] / [CNS] / [CSE] / [R J Botting] / [Samples] / c.semantics
[Index] [Contents] [Source Text] [About] [Notation] [Copyright] [Comment/Contact] [Search ]
Fri Apr 15 09:50:20 PDT 2011


    Towards a Semantics for C and C++


      I want to find out if mathematical methods can be used in a badly defined area to model and so clarify that area. There are at least two versions of C - K&R C(K_R) and ANSI C(ANSI_C). Both of these are described in detail by example and informal definition plus a few formula. Both explicitly allow implementations to differ in quite significant ways. And yet, it is possible to construct C programs that work and portable across several architectures.

      If such a model exists, then it may be useful in teaching, learning, and using C-like languages. If such a model does not exist then the process of finding this out will improve the methods used to construct mathematical models used in software development.


      In this document I try to develop, step by step, a model of what a C program means in terms of elementary discrete mathematics. I do this by explaining the difficulties as they occur and a possible way to work round them.

      In what follows the aim will be to define various sets, functions, and relations that form the semantic domains for C. The meaning of statements will be defined in terms of relations describing possible changes of state plus formulae describing - loosely - the timing of the statements.

      C is defined in terms of Random Access Memory.

      It is difficult to explain the meaning of a C program without doing it in terms of a hypothetical Von Neuman machine because of the use of pointers to define even simple parts of the languages. Arrays, pointers, strings, union, etc etc get their meaning from the random access memory in which they are embedded. In other words, every object in C has an address and every address has contents. Confusing the issue further is the fact that the contents of an address can be another address.

      In normal denotational semantics of a statically scoped language like C the state of the system is described by two functions: one describes a binding of variables to a set of locations and the other a binding from locations to values. Traditionally these locations have been identified with the natural numbers.

      C is described in terms of bytes. It seems necessary to take a more machine-oriented view of this part of the semantics of C.

      Defining RAM and Objects

      To define the meaning of even simple programs we need the ideas of: address, object, and a relation between them: an object is at a particular address.
    1. Addresses::Finite_Sets.
    2. Objects::Finite_Sets.
    3. at::Objects->Addresses.


        The double colon notation comes from BNF.
        I extend it by adding:
        It indicates a term that is going to be used in several places in a document - or even outside this document. Binding local to a formula use a single colon. For more on the notation see MATHS_lexicon & MATHS_syntax.

      However, knowing the address of an object does not tell us its value. A char and an int can have the same address but have different values. So Objects are of different types:

    4. type::Objects->Types

      For a given type and address there is a value:

    5. RAM::Types><Addresses->Values where
    6. Values::=#Byte, -- where "#" indicates a sequence of zero or more bytes.

      One value in #Byte is the sequence with no bytes in it. We can identify this with the dreaded "void" found in C programs.

    7. Void::=("".Values).

      Each type in C has a size, this is the number of bytes that is used to represent values in the particular implementation being used. As a rule C standards do not specify the precise sizes of most data types except for the type indicated by :"char".

    8. sizeof::Types->Number,
    9. (size0): For Types t, address a, sizeof(t) = len(RAM(t,a)).
    10. (size1): sizeof("char")=1. Notice that an address does not have a value until a type is introduced.

      Normal Semantics for RAM

    11. NORMAL_RAM::=
        The basic operation on RAM is to change a single object. Suppose the we have an object of type t, at Address a, and change it to a different sequence of sizeof(t) bytes b. (a.t:=b)::RAM->RAM.
      1. (a.t:=b)(r) = r' where r'(t,a)=b and for all other bytes r'=r.

      (End of Net)

      The problem with the above (NORMAL_RAM) is that there is no guarantee that an assignment may not effect several pieces of storage, in weird ways. For example the following piece of C code:
       		int i=0, *pi=i; char *pc=(char*)&i; *pc='a';
      when then *pc and *pi are partially aliased, and so the assignement to *pc changes i. The expression (*pc.char='a'.NORMAL_RAM) does not allow i to change (since i is not*pc).

      So if we define assignment by using (a.t:=b) with NORMAL_RAM assumptions then we implicitly invalidate a large number of C programs.

      Instead we start by defining the set of addresses associated with a given address and type: For Address a, Type t, area(a,t)={ a':Address. a <= a' < a+sizeof(t) }.

      Now we can define (a.t:=b)::RAM->RAM.

    12. (ram0): if r'=(a.t:=b)(r) then r'.t=b and for all a':Address~area(a,t)( r(a')=r'(a') ).

      Notice this does not define the map (a.t:=b) precisely enough to predict the outcome of the code with the partial alias or overlap shown above. The C specifications refuse to state whether the first byte in an integer is a high order or a low order byte. The precise effect on i of changing its first byte may be on the most significant or least significant bits. In fact there is nothing to stop a truly wierd implementation of the 'int' and 'char' data types making the middle of the integer change.

      However we can hope that a implementor could give a precise definition of the function if necessary. Further we can still prove some useful results about the behavior of most C programs without having the precise effect of (a.t:=b) defined.

      Hence we can define a relationship between parts of the address space:

    13. For a1,a2:Addresses, t1,t2:Types, a1.t1 overlaps a2.t2::=some (area(a1,t1) & area(a2,t2)).

      Now suppose that variable v1 is assigned value b and is bound to address a1 with type t1, and v2 is a variable of type t2 with address a2 then we can show that

    14. (ram0)|- (ram1): if not(a1.t1 overlaps a2.t2) and r'=(a1.t1:=b)(r) then r'(v2)=r(v2).


      The semantics of C is also messy because of side-effects. Every expression can change the state of memory as well as return a value. The semantics of C is defined by giving values to expressions:
    15. Expressions::Sets, A set that will be described in what follows.
    16. Values::=#Byte,
    17. value::Expressions><RAM->Values.
    18. Byte::Sets, a finite set of 256 given elements.

      We will assume that some expressions are elementary - variables, constants, and so on:

    19. Elementary_expressions::@Expressions.

      Expressions also have a type - the type of value returned when the expression is evaluated:

    20. type::Expressions->Types. We can easily list axioms describing how this type is determined by the syntax of the expression:
    21. (t0): if type(e1)=int and type(e2)=int then type(e1+e2)=int.
    22. (t1): if type(e1)=int and type(e2)=char then type(e1+e2)=int.
    23. ...

      We assume that the value returned has the right size to fit in data of the type:

    24. (expr0): for all Expressions e, RAM r, sizeof(type(e))=len(value(e,r)).

      This (expr0) together with (t0) and some facts about binary representations of numbers has an interesting effect. Sums and products are forced to produce values in a finite subset of the numbers.

      In C expressions have side effects - unlike theoretical languages. Expressions also define a function that describes the effect of an expression on RAM:

    25. effect::Expressions><RAM->change_of_state,

      A change_of_state has to be left fairly loose, for now, and so will be modelled as a relation between two States:

    26. change_of_state::=@(States,States), -- relations between two states.

      Some useful state changes are

    27. none::change_of_state::= s'=s.
    28. crash::= {}. -- no outcome is possible
    29. undefined::= States><States. -- any outcome is possible

      In some cases we can only assign a value to an expression if its subexpressions have no side-effects. I'll write the property of not having side-effects using the word "pure".

    30. For Expressionse e, pure(e)::@= effect(e)=none.
    31. For Pure::@Espressions= {e:Expressions. effect(e)=none}.

      If an expression is not Pure then it will be difficult to predict the behavior of expression that contain it unless we can show that the effected part of RAM is not involved in any other subexpression. So I will assume that for any expression e we can define a collection of bytes in RAM that can be change when the expression is evaluated:

    32. For Expressions e, Effected(e)::@Addresses.

      Suppose that an Expression e has the effect of change state s to s', then we can (in theory) compare the two states and list the addresses that have changed: { Addresses a. s(a)<>s'(a) }. An address can be effected if there exists a state s such that it can be changed so we can define:

    33. Effected(e)::={a:Addresses. for some s,s'( s(effect(e)s' and s(a)<>s'(a)) }.

      We can sow the following properties: For Pure e, Effected(e)={}.

      Note. Because change of state is a relation (not necessarily a function) it would allow us to describe changes of state that give rise to no future states... in other words crashes. Similarly, we could (if needed) allow a change_of_state to be non-deterministic. However there is a subtle problem with this approach [ Limitations in rjb95a.Relations.vs.Programs ]

      A simple resolution to this problem is to include a model of timing as part of the meaning of the command or expression[Hehner Circa 1990]. So define

    34. delay::Expressions><RAM->change_of_time.
    35. change_of_time::@Real= { t:Real || t>0 }.

      Clearly if one expression contains another then the delay will be at least the delay caused by the subexpression plus the delay caused by the rest of the expression. However in the case of the conditional and boolean expressions short circuit evalutaion will mean that we can not just add up the delays of all the subexpressions to get a lower bound on the delay of the whole.

      For the elementary parts of expressions we want way to express the fact that they terminate at some future and finite time. However we do not want to get into over-specifying the precise timings of each element. A suitable assumption is that there is a worst elementary Expression - one that takes longer than the rest. So we suppose that there is a constant

    36. T::Real&Positive. For all elementary expressions e, RAM r, delay(e,r)<=T.
    37. (T0): for all e:Elementary_expressions, r:RAM, delay(e,r)<=T.

      Increment Operations

      Consider these three, easily confused expressions:
      1. e1 := v+1,
      2. e2 := v++,
      3. e3 := ++v.

      We assume:
    38. |-e1, e2, e2 in Elementary_expressions.


    39. type(e1)=type(e2)=type(e3)=type(v).

    40. value(e2,r) = r(type(v), address(v)).
    41. value(e1,r) = value(e3,r) = 1+r(type(v), address(v)) where the addition is in sizeof(type(v)) byte arithmetic.

    42. effect(e1,r) = none.
    43. effect(e2,r) = ( address(v).type(v) := 1 + r(type(v), address(v)) ).
    44. effect(e3,r) = ( address(v).type(v) := 1 + r(type(v), address(v)) ).

      Meaning of Arithmetic Operators

      The description of the ++ operator used the mathematical symbol for addition - however the arithmetic in C can only be assumed to be arithmetic modulus a particular value that depends on the type of the expression being evaluated. For example in the above v may have been a char and hence have values either in range 0..127 or in 0..255. If it was an integer, then the range will be bigger but the exact size is not specified by any standard.

      In other words the arithmetic operations in C and C++ are defined by the compiler - even by compiler options. For more on this see Hoare's ground breaking (and readable) classic on the axiomatization of computer programs. [Hoare69]

      From here on in I will assume that I can use the normal symbols for addition, subtraction and so on, but without postulating any properties for them other than closure. In other words when arithmetic is used to model an operation on a value of a given C type all we can be sure of is that the result is also of the same type.

      Pointer arithmetic will be tackled later. [ Pointer Arthmetic ]

      Even with ignoring the ambiguous nature of the arithmetic in C one can not give precise semantics for something as simple as an addition because the two added subexpressions may have side-effects and C standards follow the Algol tradition of not specifying the order in which subexpressions are evaluated. For example the following expression does not have a clear value:

      The obvious rules are that when subexpressions are pure then the arithmetic operators add the results:
    45. For Pure e1,e2, value(e1+e2,s) = value(e1,s)+value(e2,s).
    46. For Pure e1,e2, value(e1*e2,s) = value(e1,s)*value(e2,s).
    47. For Pure e1,e2, value(e1-e2,s) = value(e1,s)-value(e2,s).
    48. For Pure e1,e2, if value(e1,s)<>0 then value(e1/e2,s) = value(e1,s)/value(e2,s).
    49. For Pure e1,e2, if value(e1,s)<>0 and type(e1,s)=type(e2,s)=int then value(e1%e2,s) = value(e1,s)%value(e2,s).

      It may be possible to suggest some stronger rules. For example if the areas of storage used in two subexpressions are free of the side-effects of the other subexpression than the result is predicatable:

      If Effected(e1)&Uses(e2) = Effected(e2)&Uses(e1) = {} then value(e1 + e2,s)=value(e1,s)+value(e2,s) and effect(e1+e2)=effect(e1);effect(e2)=effect(e2);effect(e1).

      Hence we need a model of the areas of storage that are used in an expression. These are the addresses that when changed may change the value or effect of an expression:

    50. Uses(e)::@Addresses={a:Adresses. for some s,s'( s(a)<>s'(a) and (value(e,s)<>value(e,s') or s.effect(e)<>s'.effect(e)) and for all a'<>a(s(a')=s'(a')) }.

      I will delay discussing the ways that the areas or RAM used by an expression actually depends on the state of the RAM until Pointers are discussed.

      The above does not discusss long, char, unsigned, and short versions of the '%' operation. It also begs the question of what the precise version of integer division is expected. Unlike in Ada 83 the standards permit integer division with negative number to behave in several different ways. For example 5/-3 may be either -2 (round down) or -1 (round towards zero).

      Implicit coercions and casts further confuse the issue as well.

      It is somewhat surprising that C programs look predictable to the moderately experienced programmer.


      The comma operator however is camparatively well defined... it does not depend on the implementation and has simple semantics:
    51. For Expressions e1,e2, e1, e2::Expressions,
    52. For Expressions e1,e2, type(e1, e2)::=type(e2),
    53. For Expressions e1,e2, RAM r, value(e1,e2, r)::=value(e2, effect(e1,r)r).
    54. For Expressions e1,e2, RAM r, effect(e1,e2,r)::= effect(e2) o effect(e1).
    55. For e1,e2,r, delay(e1,e2, r)::=delay(e1,r)+delay(e2, effect(e1,r)r).

      Conditional Expressions

      C allows conditional expressions like this:
       		x != 0 ? y/x : 0
      Notice that these are defined to give short-circuit evaluation.

    56. For Expressions e1,e2,e3, if type(e2)=type(e3) then e1 ? e2 : e3 :: exprssion,
    57. For Expressions e1, e2, e3, type(e1 ? e2 : e3)::=type(t2).
    58. For Expressions e1,e2,e3, RAM r, if value(e1, r) <> 0 then value(e1?e2:e3, r)::= value(e2, effect(e1,r)).
    59. For Expressions e1,e2,e3, RAM r, if value(e1, r) = 0 then value(e1?e2:e3, r)::= value(e3, effect(e1,r)).
    60. For Expressions e1,e2,e3, RAM r, if value(e1, r) <> 0 then effect(e1?e2:e3, r)::= effect(e2, effect(e1,r)).
    61. For Expressions e1,e2,e3, RAM r, if value(e1, r) = 0 then effect(e1?e2:e3, r)::= effect(e3, effect(e1,r)).
    62. For Expressions e1,e2,e3, RAM r, if value(e1, r) <> 0 then delay(e1?e2:e3, r)::= delay(e1, r)+delay(e2, effect(e1,r)).
    63. For Expressions e1,e2,e3, RAM r, if value(e1, r) = 0 then delay(e1?e2:e3, r)::= delay(e1, r)+delay(e3, effect(e1,r)).

      Logical Operators

      The logical operators in C convert expressions into zero (for false) and one(for true):
    64. bit(false)=0
    65. bit(true)=1. The logical operators '||' and '&&' use short circuit evaluation. I will show this for one of the operators (||):
    66. For Expressions e1,e2, e1 || e2 :: expression,
    67. For Expressions e1, e2, e3, type(e1||e2)::=int,
    68. For Expressions e1,e2, RAM r, if value(e1,r) <> 0 then value(e1||e2, r)::= 1.
    69. For Expressions e1,e2, RAM r, if value(e1,r) = 0 then value(e1||e2, r)::= bit(value(e3, effect(e1,r))<>0).
    70. For Expressions e1,e2, RAM r, if value(e1,r) <> 0 then effect(e1||e2, r)::= effect(e1,r).
    71. For Expressions e1,e2, RAM r, if value(e1,r) = 0 then effect(e1||e2, r)::= effect(e2, effect(e1,r)).
    72. For Expressions e1,e2, RAM r, if value(e1,r) <> 0 then delay(e1||e2, r)::= delay(e1, r)
    73. For Expressions e1,e2, RAM r, if value(e1,r) = 0 then delay(e1||e2, r)::= delay(e1, r)+delay(e2, effect(e1,r)).

      The formula for e1 && e2 are similar and left as an exercise.


      An assignment first evaluates its right hand size - which can have an effect on the state and then the new value is placed in RAM.
    74. For variable v, Expressions e, v=e::Expressions.
    75. value(v=e,r) = value(e,r),
    76. effect(v=e,r) = (effect(e); (address(v).type(v) := value(e,r)).
    77. delay(v=e, r) > delay(e,r),
    78. delay(v=e, r) <= delay(e,r)+ 2 * T.

      Note: Allows for the extraction of the address of v and copying as an elementary operations.

      It avoids the issue of what happens when the left hand side is not a simple variable:

       		a[++i] = i++;


      Any expression becomes a statement when followed by a semicolon in C. A statement has an effect on RAM, but no value:
    79. Statements::Sets.
    80. effect::Statements->change_of_state.

      We also have void expressions:

    81. Void_expressions::={e:Expressions || value(e) in Void }.

      Expression Statements

      This approach has the effect that we can handle expression statements simply.
    82. expression_statement::= Expressions ";". We map statements into a change of state - but they have no value and no type.
    83. For Expressions e, e;::Statement.
    84. (s0): For Expressions e, effect(e;,r)= effect(e).
    85. (s1): delay(e;,r) <= delay(e,r)+T. Note: removing an item from the stack is also an elementary operation.

      Values of pointers, arrays, and such.


      We will suppose that some values are also addresses. We can express this be defining a partial map from Values into Addresses:

    86. A::Values<>->Addresses

      Now all pointers are the same size:

    87. (A0): For all v1,v2:pre(A), sizeof(A(v1))=sizeof(A(v2)).

      Further, only values of type address(T) are defined...


      Pointer Arithmetic





    88. K_R::=The version of C described by Kernighan and Ritchie.

    89. ANSI_C::=The version of C described by the standard promulgated by ANSI.
    90. ANSI::=American National Standards Institute.

    91. TBA::=To Be Announced


    92. MATHS_lexicon::= See http://cse.csusb.edu/dick/samples/math.lexicon.html
    93. MATHS_syntax::= See http://cse.csusb.edu/dick/samples/math.syntax.html

    94. C_syntax::= See http://cse.csusb.edu/dick/samples/comp.c.syntax.html

      (UML model of the data types in C): [ C.png?root=atlantic-zoos ]


    . . . . . . . . . ( end of section Towards a Semantics for C and C++) <<Contents | End>>