.Open Syntax (XBNF) XBNF::="Extreme BNF", a form of syntax description that extends Backus-Naur-Form as far as it can go. XBNF uses space, bars ('|') and the symbol '::=' as in BNF. It uses '()' as parentheses and '#' to mean 'any number of'. This choice follows more than 10 years of testing other forms in class and on computers. The EBNF symbols '<>{}[]' are used for other purposes Here is comparison of BNF, EBNF and XBNF. .As_is BNF ::=| .As_is Ada EBNF number::= digit {digit} .As_is XBNF number::= N(digit). You can see which is shorter. XBNF is more powerful than EBNF as well. It is not as short as the theoretical notations for a grammar because it is designed to work in ASCII (no Greek or special symbols) and yet say more. . A Mapping into English .As_is "::=" is read "is defined to be" .As_is "&" is read "and also" .As_is "~" is read as "but not" .As_is "|" is read as "or" .As_is "#" is read as "any number of including none" .As_is "(" is read as "a sequence of" .As_is ")" is read as ", end of sequence" .As_is space between 2 parts in a sequences can be read as "and then" .As_is "_" becomes a space. .As_is "N(_)" is read as "one or more _" and short for "_ #(_)" .As_is "O(_)" is read as "optional _" and is short for "(|_)" . A Meta-Grammar Here is a set of definitions that defines a XBNF grammar. grammar ::= #( $definition | $comment ), a grammar has any number of definitions and comments. A useful grammar has comments that describe why a syntax is needed and lots of examples of how to use it. The definition above is followed by this comment for example. definition ::= $defined_term "::=" $selection, a definition has a "::=" between the defined term and an expression defining a set. This definition defines what a $definition `is`: it is something that defines a $defined_term (what ever that `is`) plus a special colon-colon-equals opearator and then an expression that defines a selection of alternative forms (see below). selection ::= $alternative #( "|" $alternative ), a selection is made of alternatives. It means making a choice of forms. Mathematically is no more or less than the union of the alternative forms. alternative ::= $intersection | $complementary_form. Each alternative is either a intersection or complementary_form. intersection::= $sequence #("&" $sequence ), the ampersand("&") character indicates that all the sequences must hold at once (set intersection). complementary_form::= $sequence "~" $sequence. In a complementary form ( `A~B` say) then any `A` which is not a `B` is in `A~B`. An example would be `#(a|b) ~ (a b)` : any sequenced of a's and b's except the sequence `ab`. sequence ::= #$phase , phase ::= $O("#") $item, The 'number sign'(#) indicates the occurrence of any number of the following item - including none. The above shows an `optional item`: O("#"). So the definition of $phase above implies () |- phase = "#" item | item. item ::= $element | $defined_term | "("$selection ")", an item can be a more complex sructure in parentheses of some simple name, or a string of characters in quotes. comment ::= (#$character )~$definition. Any number of characters that are not definition form a comment. This sentence is therefore part of a comment in this grammar. Here is the definition of what an element or terminal looks like - its actually a C string. element::=$quote #($char~($quote| $backslash)|$backslash $char) $quote. quote::="\"". backslash::="\\". Here is a more complex definition the defines what a defined term can be: defined_term::= ( ($letter | $digit) #$identifier_character) & ( #identifier_character (letter|digit)) & correctly_spelled & defined. A defined_term does not start or end with an underscore - an identifier is made of letters, digits and underscore characters, and must always start and end with either a letter or digit. Notice that a defined_term's definition includes the item '...& defined' which indicates a special non-syntactic constraint - that a defined_term must be in the set of terms which are defined. identifier_character::=($letter | $digit | $underscore). A defined term is made of numbers, words and the underscore character: correctly_spelled::=#($word | $number | $underscore). Numbers can have one or more digits: number::= $digit # $digit. Words are at least two letters long and are correfctly spelled: word::=($letter $letter #$letter) & `in some dictionary`. . Lexemes It helps if you define the basic elements of your language (the lexemes) in a special section -- a Lexicon. It also helps to indicate them: .As_is name::lexeme="string", purpose. .As_is double_plus::lexeme="++", used to add 1 to a variable. or .As_is name::lexeme, purpose. .As_is typedef::lexeme, use to introduce type definitions in C++. In the second form the string is equal to the characters in the name: .As_is typedef::lexeme="typedef". Doing this lets the string that represents the lexeme be indexed and found by search engines. In practice many queries of language reference manuals on the web are trying to find the meaning a lexeme. . Lexicon The following terms are defined here for completeness and so you can use them in your own documents. Normally you can take them as given merely by including a link to this section of this web page: .As_is http://www.csci.csusb.edu/dick/maths/intro_ebnf.html#Lexicon .Box character::lexeme=`Any ASCII character`. char::lexeme=`any ASCII character`. digit::lexeme= "0".."9". capital_letter::lexeme="A".."Z". upper_case::lexeme=$capital_letter. lower_case::lexeme="a".."z". Note: MATHS/XBNF/EBNF is case sensitive, but is used to define case insensitive languages. The following define some maps that may help to define the syntax of case insensitive languages. to_upper::lower_case---upper_case. See table below... to_upper::char->char= to_upper|->Id. to_upper::#char->#char= ""+>"" |+> (first;to_upper|rest;to_upper). .Table c to_upper(c) .Row a A .Row b B .Row c C .Row ... ... .Close.Table to_lower::upper_case---lower_case= /to_lower. to_upper::char->char= to_lower|+>Id. to_upper::#char->#char= ""->"" |+> (first;to_lower|rest;to_lower). ignore_case::char->@char=map[c:char]( to_lower(c) | to_upper(c) ). ignore_case::#char->@#char= ""+>{""} |+> (first;ignore_case|rest;ignore_case). letter::lexeme=capital_letter | "a".."z". underscore::lexeme="_". sign::lexeme=$plus|$minus plus::lexeme= "+", minus::lexeme="-". comma::lexeme=",", semicolon::lexeme=";", left_bracket::lexeme="[", right_bracket::lexeme="]". quotes::lexeme="\'". space::lexeme=" ". non_quote::lexeme=char ~ quotes. l_paren::lexeme="(", r_paren::lexeme=")". .Close.Box If you need to define a language that uses the ASCII code then .See http://www/dick/samples/comp.text.ASCII.html gives definitions of characters (including the control characters) by name and purpose. . Precedence of Operators. The definitions of XBNF above .See A Meta-Grammar imply the following consequences and interpretations: Concatenation (space) has a higher precedence than a vertical bar ("|"). In an XBNF formula with both spaces and "|" symbols the bars separate the sequences. For example, If a, b, c and d are items, then a b | c d = (a b) | (c d) =`either a sequence of a and then b, end of sequence or a sequence of c and then d, end of sequence`. `a b | c d` is not equal to `a (b|c) d`. Indeed a (b|c) d = a b d | a c d = (a b d) | (a c d). The number sign("#") has lower precedence than a space and so always applies to the next item. Thus: file1::=header #data_record sentinel, indicates one header, then a number of data_record, and then one sentinel. It does not allow the header to re-occur. Neither can sentinel appear other than once in each file. However, if file2::=#(header data_record) sentinel, then there can be any number of header's but each header is followed by an data_record and there is still exactly one sentinel in an file. The next description file3::=#(header data_record sentinel). implies that headers and sentinels can occur many times - but always with an data_record in between them. Further in any file there will be the same number of sentinel's, data_record's and header's because the "#" does not permit a part of the repeated sequence to appear with out the rest of it also following. file4::=#(header #data_record sentinel). implies that headers and sentinel's can occur many times and with zero or more data_record's in between them. Similarly, If `a` and `b` are items then (`a` | #`b` ) indicates either an `a` or any number of `b` 's, as does (#`b` |`a` ). This, plus the rule relating spaces and the "|" symbol means that (`a` #`b` `c` | #`b` )=`either a single a followed by many b's and one c or else many b 's alone`. Parentheses (()) are put around selections within sequences, and iterations. (a | b) (c | d)= `a sequence of a or b, end of sequence and then a sequence of c or d, end of sequence`. Or informally `the first item is either an a or a b followed by either a c or a d.` Notice that this describes four alternative forms because: (set_theory)|- (a | b ) (c | d) = (a c | a d | b c | b d). #(a | b )=`any number of a's and b's in some order`. (similarly)|- #(a | b )= `empty` | a | b | a a | a b | b a | b b | a a a | a a b | ... So, `#(a|b)` is different from `#a | #b` because ()|- #a | #b = `empty` | a | a a | a a a| ... | b | b b | b b b | ... So, `#(a|b)` includes alternatives like `a b` and `a a b` and `b b a` that are not permitted by `#a | #b`. The following common form indicates a series of pairs: #(a b )=`any number of pairs. where each pair has one a followed by one b`. The `#(a b)` form implies that each `a` must be followed by a `b`. `#a #b` implies that all the `a`s are followed by all the `b`s. `#(a|b)` implies that there is a series of `a`s and `b`s in some order... a kind of muddled list. . Shorthand Idioms Certain patterns appear again and again in practice and XBNF defines shorthand "macro"s for these patterns: .Box For x, O(x) ::=(x |). O(_) means 0 or 1 occurrences of its argument. So for example: (O1)|- for all x, y, $O(x) y = (x y| y). N(X) ::= X | X $N(X), N(_) indicates 1 or more occurrence. L(X) ::=X | X $comma L(X), L(_) indicates a comma separated list. comma::=",". .Close.Box Indeed the "#" notation is in essence another abbreviation because of the following rules: |-(hashon): #(X) = O( N(X)). |-(nhash): N(X) = X #(X) = #(X) X. . Semantics Informally each defined term maps into a set of sequences. Thus it extends $set_theory and a theory of sequences. The structure of these sequences is determined by the definitions of the defined terms. This is easy to see until one gets doubts about definitions like this: train:: = engine carriage | train carriage. The above defines a $train in terms of the meaning of a train. The formal semantics tackles and resolves the problem of such recursive definitions see $grammar_theory below. . See Also grammar_theory::=http://www/dick/maths/intro_grammar.html. set_theory::=http://www/dick/maths/logic_30_Sets.html. similarly::=using the same reasoning as in the previous theorem. . Acknowledgment Thanks to Larry Evans who pointed out the broken and bogus links in these notes on Mon May 21. Also thanks to claudiu on Tue, 23 Oct 2001 who spotted errors and made wise suggestions. Problems that remain are dick botting's. .Close Syntax (XBNF)