[Skip Navigation] [CSUSB] / [CNS] / [Comp Sci Dept] / [R J Botting] / [Samples] / regular_expressions
[Index] [Contents] [Source Text] [About] [Notation] [Copyright] [Comment/Contact] [Search ]
Tue Sep 18 15:26:52 PDT 2007

Contents


    Definition of the form and meaning of Regular Expressions

      Introduction

      This is a sample definition of the way that regular expressions are used in the UNIX awk command to define a pattern to be searched for. This is used in my Common Gateway Interface lookup to select bibliographic entries [ lookup.html ]

      Method

      This piece of documentation defines the syntax and semantics of such regular expressions. The syntax is written in XBNF (stretched Backus-Naur Form). XBNF is a general version of EBNF that allows alternative definitions to be written. This allows the formal definition of several varieties of UNIX regular expression in one document:
    1. program::{ed, grep, egrep, sed, ex, vi, awk, ...}.

      The semantics by a function that maps each syntactic item into a set of matching strings.

    2. syntax::=@#char, sets of ASCII character strings in a pattern.
    3. semantics::=@#char,st of ASCII character strings.
    4. m::=matches, an abbreviation for
    5. matches::=syntax->semantics. This map is defined piecemeal in the following.

      Selected lines

    6. Line::=#char.

      A given line either matches or does not match a pattern... the set of matching Lines is written M: M:: pattern->@Line. The lines that match a given pattern

      An item is selected by a pattern if any of its lines matches the pattern:

    7. For Item I, pattern p, selected(I,p)::= for some L:I(L in M(p) ).

      Patterns

      A pattern can specify where a given regular expression is searched for in a line: at the start, at the end, any where, or filling the whole line.
    8. pattern::=caret re | re dollar | re | caret re dollar.

    9. For re r, M("^" r "$")= m(r).
    10. For re r, M("^" r )= m(r) ! any.
    11. For re r, M( r "$")= any ! m(r).
    12. For re r, M( r )= any ! m(r) ! any.

    13. Where (!) = "followed by" or concatenation.

    14. any::=#char, -- any string of ASCII characters.

      Note. Anchored patterns (with "^" and "$") may not be useful for looking things up in this web site.

      Regular Expressions

    15. regular_expression::syntax,
    16. re::abbreviation=regular_expression,
    17. If program=awk or egrep then re=alternative | alternative bar regular_expression
    18. else re=alternative.
    19. For alternative a, regular_expression r, m( a bar r) = m(a) | m(r).

    20. alternative::= # element. An empty pattern matches anything and so is deprecated.
    21. For element e1,e2,e3,..., m(e1 e2 e3 ...) = m(e1) ! m(e2) ! m(e3) ! ...
    22. where (!) = "followed by" or concatenation.

    23. element::= single_piece | option | repeated | iteration | word.

      Words can be recognized in the two UNIX editirs: ex and vi.

    24. word::syntax,
    25. if program=ex or vi then word = "\\<" element "\\>" else word=no.

      Options are only available in awk and egrep:

    26. option::syntax.
    27. If program=awk or egrep then option =single_piece query else option=no.
    28. For single_piece c, m( c query)= (empty | m(c)), optional.

    29. repeated::syntax. If program=awk or egrep then repeated::=single_piece plus else repeated=no. For some programs repeated = single_piece "\\{" O(number) O(, O(number)) "\\}".

    30. m( c plus)= m(c) ! #(m(c)). -- one or more
    31. m( c "\\{" n "," m "\\}")= between n and m copies inclusive.

    32. iteration::=single_piece star, -- any number including none.
    33. m( c star)= #(m(c)).

    34. single_piece::= single_char | set_of_single_chars | sub_expression.

    35. subexpression::syntax,
    36. If program=awk or egrep then subexpression = left pattern right else subexpression = no.
    37. m( left p right) = m(p).

    38. set_of_single_chars::= dot | lbracket string rbracket.
    39. m(dot)=char, -- the wild-card symbol is a period.
    40. For string s, m("[" s "]")= m2(s).

      Sets of single characters

      Strings of form "[" s1 s2 s3 ... "]" have a special syntax:
    41. string::=normal | complemented.
    42. complemented::=caret normal
    43. normal::= range_or_single | range_or_single normal.
    44. range_or_normal::=range | char~brackets.
    45. brackets::=lbracket | rbracket.

      Their meaning can be described by the functions m2 and m3 that interpret a string of characters(#char) as a set of single character strings (@#char) as follows:

    46. m2::string->@#$char. -- handles complemented forms.
    47. m3::normal->@#$char. -- handles normal forms only.

      For normal s, char c,c1,c2.

    48. m2("^" s )= char ~ m3(s). -- complement, but not,...
    49. m2( s )= m3(s).
    50. m3(c1 "-" c2 ) = set{ c:char. c1<=c<=c2 }.
    51. m3(c1 "-" c2 s ) = set{ c:char. c1<=c<=c2 } | m3( s ).
    52. m3(c )= {c}.
    53. m3(c s )= {c} | m3(s).

      Ommission

      I haven't documented the rules for putting brackets inside sets of single characters...

      Lexemes

    54. char::Sets=the set of ASCII characters.
    55. magic::=caret | dollar | plus | star | query | bar | left | right | dot | lbracket | rbracket | backslash.
    56. |-magic ==> char.
    57. non_magic::=char ~ magic.
    58. for non_magic c, m(c) = {c}.
    59. escaped_char::= backslash char.
    60. For char c, backslash b, m(b c) = {c}.
    61. single_char::=non_magic | escaped_char. This automatically defines matches (m) for any single_char.

    62. caret::="^". Indicates pattern occurs at start of line
    63. dollar::="$". Indicates pattern occurs at end of line.
    64. plus::="+". Indicates one or more of the previous.
    65. star::="*". Indicates zero or more of the previous.
    66. query::="?". follows an option piece
    67. bar::="|". separates alternative forms.
    68. left::="(". starts a subexpressions.
    69. right::=")", ends a subexpression.
    70. dot::=".". indicates a wild card character.
    71. lbracket::="[". starts a set of alternate characters.
    72. rbracket::="]". ends a set of alternate characters.
    73. backslash::="\\". used to indicate the character itself rather than its meaning.

      Note. It is probably not necessary to escape caret and dollar except at the start (respectively end) of a regular expression since they only have a magic meaning in those positions.

      Glossary


      (concatenation):
    74. For a,b:#char, a ! b::= (a1, a2, a3, ..., b1,b2,b3, ...).
    75. For A,B:@#char, A ! B::= set{ s:#char. for some a:A, b:B( s=a!b) }.

    76. For A, #A::= the smallest(X. empty in X and X!A in X).

    77. img(s)::set of characters listed in string s,
    78. img(s)::=set( s[i]. i:1..len(s) )

    79. empty::@#char= {""}, -- the set containg an empty or null string.
    80. no::@#char={}, -- the empty or null set. In an alternative shows that something is note allowed.

    81. magic::UNIX.jargon. Symbols meaning something other than themselves. Terminology dating back to the vi 'nomagic' option (I guess).

      set{ v:S. P} ::= the set of c in S satisfying P.

    . . . . . . . . . ( end of section Definition of the form and meaning of Regular Expressions) <<Contents | End>>

End