[Skip Navigation] [CSUSB] / [CNS] / [CSE] / [R J Botting] / [Samples] / comp.html.syntax
[Index] [Contents] [Source Text] [About] [Notation] [Copyright] [Comment/Contact] [Search ]
Fri Jun 10 06:57:46 PDT 2011


    Syntax of the HyperText Markup Language - HTML


      HTML is the Markup Language used to describe pages on the World-Wide-Web. See [ www.html ] for tutorials and background information. Also see the "Bare Bones Guide to HTML" [ http://werbach.com/barebones/ ]

      HTML is designed to describe the logical structure of a large number of interlinked pages. It is a special document type described using the rules of SGML (Standardized General Markup Language). It has been updated several times. The January 2000 is called XHTML: [ xhtml.html ] And we have now moved on to HTML4, HTML5, ...

      The W3 Consortiuum support the Web and provide [ http://w3schools.com/ ] as a family of tools for learning the technology.

      This page defines the syntax of a useful subset of HTML, see [ Basic Ideas ]

      For information on SGML see

    1. SGML::=Standardized Generalized Markup Language, [ sgml.html ] [ comp.text.SGML.html ]

      For the official definitions of HTML see the official defining documents for HTML2.0: [ html-spec_toc.html ]

      For a more up-to-date and complex definition in SGML see [ htmlpro.html ]

      For a UML model of an HTML document see [ HTML.png?root=atlantic-zoos ]

      For information about [ Semantic Markup ] (attaching meaning to parts of an HTML document by adding extra tags and/or attributes) see the following possibilities

    2. RDFa::= See http://en.wikipedia.org/wiki/Rdfa (W3C),
    3. Microdata::= See http://www.google.com/support/webmasters/bin/answer.py?answer=176035 (HTML5 microdata),
    4. SchemaDotOrg::= See http://schema.org/ (backed by Microsoft, Yahoo, Google for searching), etc.

      Basic Ideas

      There are three ideas that makeup HTML
      1. SGML tags: <tag> .... </tag> [ Documents ]
      2. XML-style tags in XHTML: <tag/>
      3. SGML Elements for characters -- entity [ Lexicon ]
      4. (URLs): how://where/what... [ Universal Resource Locators ]
      5. (CGIs): Writing programs that produce pages when run [ Common Gateway Interface ]

      For help with the Three Letter Acronyms(TLAs) used in talking about HTML see
    5. glossary::= See http://cse.csusb.edu/dick/samples/comp.html.glossary.html


    6. For all X, O(X)::=optional X
    7. For all X, #(X)::=zero or more X


      1. HTML_control_char::=( lt | gt | semicolon | ampersand | quote ).

      2. normal_character::= char ~ HTML_control_char.
      3. char::= See http://cse.csusb.edu/dick/samples/comp.text.ASCII.html#char

      4. ampersand::="&".
      5. semicolon::=";".
      6. lt::="<".
      7. gt::=">".
      8. quote::= See http://cse.csusb.edu/dick/samples/comp.text.ASCII.html#quotes -- double quotes character of ASCII".

        SGML allows special symbols to be indicated in a form known as an entity. For example in HTML the less_than character has a special use and so real less than signs are encoded like this:

        The ampersand and the semicolon bracket SGML & HTML entities:
      9. entity::= ampersand identifier semicolon | ampersand number semicolon. An entity allows a symbol to be described by an identifier rather than as itself. This has two purposes. First, it allows symbols used in HTML to appear in the rendered document. For example '&quot;" is written in HTML where you want a double quotation mark to appear. The second use of an entity is to express in ASCII characters that are not ASCII symbols. There are a small number of predefined HTML entities. See The Latin 1 Iso Character set: [ SEC101 in html-spec_9 ] and the HTML Coded Character Set: [ SEC106 in html-spec_13 ]

        The structure of an SGML/HTML document is described by inserting tags into the raw text - this known as "Marking Up the text".

        The general syntax of a tag is:

      10. tag::=lt tag_identifier #attribute gt | lt "/" tag_identifier gt.

        So, tags take two forms - those that indicate the start of something, and those that indicate the end of something. Here is a typical pair:

        that indicate the start and end (respectively) of a piece of text that needs strong emphasis.


      11. For X:tag_identifier, start(X)::= lt X #attributes(X) gt.

        Each type of tag has its own attributes.... and the set of attributes for a given tag has varied with the edition of HTML and the browser. However they all have the same syntax:

      12. For X:tag_identifier, attributes(X)::attribute.
      13. attribute::= attribute_identifier O("=" attribute_value).
      14. tag_identifier::@identifier.
      15. attribute_identifier::@identifier.


      16. For X:tag_identifier, end(X)::= lt "/" X gt.

        I used to have

      17. identifier= letter #(letter|digit),but I knew it was a guess and asked for contributions. Ted Taylor (Nov/28/2010) observed
          The HTML4.1 standard does not define identifier directly but the implication is that it is the same as id and name.

          The reference in the standard [ type-name in types ] allows underscore after the initial letter (no surprise) as well as hyphens (slight surprise), colons and periods (bigger surprise).

          It appears that even this syntax does not constrain all use - initial underscore or hyphen is allowed but intended to be reserved (by convention). In some contexts where leading hyphen is allowed, two leading hyphens is disallowed.

          If you have more on this, I would be interested.

        I can add that the stuff about leading hyphens is a virus from the early C libraries.
      18. identifier::=letter #(letter | digit| "_" | "-" | colon | period).

        Upper and lower case are ignored in tag and attribute identifiers but not in attribute values.

        An attribute value can be a string or an identifier:

      19. attribute_value::= identifier | quote #(char~quote) quote

      20. comment::=lt "!--" text that will not effect the rendered page "--" gt.

      21. html_input::lexical= #(comment | tag | entity | normal_character).


        Universal Resource Locators

          Universal Resource locators (URLs) are attribute values that tell a browser where to find things on the Internet. The was a general introduction at http://www.ncsa.uiuc.edu/demoweb/url-primer.html but now this is a forbidden link. I must thank Erika Lynch for pointing this out and suggesting [ http://www.investintech.com/content/beginnersurl/ ] as an alternative pointer to resources.

          The following XBNF is an approximation to the standard defined at [ 5_BNF.html ]

          Notice that there is a special URL_encoding used to transmit symbols that have special means in the syntax below.

        1. URL::= protocol ":" O(where) what.
        2. where::=site O(port).
        3. what::=path O("/" O(file O("#" identifier | "?" query ))).
        4. path::=#("/"directory).

        5. query::= name_value_pair #( ampersand name_value_pair).
        6. name_value_pair::= name "=" value. The value in the URL can be any string because it uses URL_encoding.

        7. protocol::="http" | "ftp" | "mailto" | "telnet" | "file" | "gopher" | "news" |... .
        8. site::= "//" internet_address.
        9. port::=":" decimal_number.
        10. directory::=file_name.
        11. file::=file_name O("."file_type). File names can include periods.
        12. file_type::="html" | "gif" | "xbm" | "au" | "jbeg" | "mpeg" | "aiff" | "mov" |... Browsers often use the file_type (or extension or suffix) to determine what they should do with the resource. The protocol is tied into to the Multimedia EMail proposals (MIME) [ comp.mail.MIME.html ]

          URL Encoding

          To ensure that URLs with strange characters are transmitted across the Internet correctly the characters in the URL are encoded as strings of one or three characters.

        13. URL_encoding::char-->#char=A special encoding of ASCII characters that uses plus in place of spaces and URL_hex_code in place of characters other than letters & digits.
        14. URL_encoded::="result of URL_encoding".

          URL_encoding= (letter|digit);Id | " "->"+"|->"%"hex(lower 16-bits of character code). [ comp.text.ASCII.html ]

          For example "Space Plus+" +> "Space+Plus%2B".

        15. Function_table::=following,
          IdId"+""%" hex

          (Close Table)

          Sadly different browsers do Url-encoding differently! Some encode using the letters "a".."f" for hexadecimal and some use "A".."F". Worse the implementation for spaces and plus-signs is FUBAR. Some quick tests show the following mappings when a test string "space plus+" is sent from a form by different browsers:

          1. space::=" ",
          2. plus::="+".
          3. (above)|-URL_encoding = space+>plus | plus+> "%2B" | ...
          4. MS_IExplorer4::= space+>plus | plus+>plus | ...
          5. lynx2.3::= space+>"%20" | plus+>plus | ...
          6. lynx2.8::= URL_encoding.
          7. Java URLencoder::= URL_encoding. [ java.net.URLEncoder.html ]
          8. Netscape::= URL_encoding.

          9. MS_IExplorer4 is a many-to-one mapping and so can not be inverted:
          10. (MS_IExplorer4)|-MS_IExplorer4 in #char(1..2)--(1)#char.

          (End of Net)

            To see what your browser does with a simple form and/or CGI (and it is not a pretty sight) try my [ test.form.html ]

        16. URL_hex_code::= "%" hexadecimal_digit^2.

          There is a MIME format called "x-www-form-url-encoded" that is used in forms.

          There is a special Java class for handling the URL encoding: [ java.net.URLEncoder.html ]

          There is a local UNIX Shell script that will reverse URL_encoding at [ urlunencode ]

        . . . . . . . . . ( end of section Universal Resource Locators) <<Contents | End>>


      1. document::= O(start("HTML" )) O(header) body.
      2. header::= start("HEAD" ) #header_elements end("HEAD" )
      3. body::= start("BODY")untagged_body end("BODY" ) | untagged_body.
      4. untagged_body::= #( element | named(element) | hypertext_refed(element) ).

      5. attributes("BODY")::=often used to specify the background, and the color of text, links and so on.


        You can select a graphic to form a background to your page by
         		<BODY BACKGROUND="Graphic">
        Be careful to select something that lets the message on the page be read!


          The wise author either lets the browser or user select the default colors for the body, or specifies a complete set of attributes. Also the wise author is very careful too make sure that the colors chosen are readable! Notice that a significant number of people can not tell Red from Green and so these colors are problematic. Also notice that some browsers (on hand-held computers in particular) are in black and white... so choose colors for background that is much darker or lighter than those for text, links etc.

          The values the body specification use a form of hexadecimal coding:

        1. color_codes::= "#"red green blue. The red, green, and blue numbers get smaller the color gets darker. So "#000000" indicates black and "#FFFFFF" indicates white. Example color codings
        2. red::=hexadecimal_digit^2.
        3. blue::=hexadecimal_digit^2.
        4. green::=hexadecimal_digit^2.
             		#0000FF	Blue
             		#00FFFF	Cyan
             		#00FF00	Green
             		#FFFF00	Yellow
             		#FF0000	Red
             		#FF00FF	Purple

          Here is a set of body attributes that should be specified and values that are close to the Netscape "classic":
             	Background	BCOLOR=#B8B8B8	Grey
             	Normal Text	TEXT=#000000 	Black
             	Unused Link	LINK=#0000FF 	Blue
             	Visited Link	VLINK=#8000AF	Purple
             	Active Link	ALINK=#FF0000 	Red

        Some Elements

        1. element::=special_text | header | list | table | image | series_of_paragraphs | break | horizontal_rule | link | form | ...

          Here is a short page with a sample of lists and tables on it: [ ttt.html ]


        2. series_of_paragraphs::=paragraph #(start("P" ) paragraph)
        3. break::= start("br" ).
        4. horizontal_rule::=start("hr" )...


        5. named::=start("a name=" name ) (_) end("a" ). Note. The above is used like this named(header) in this syntax, and the argument (header) replaces the (_) above,

        6. hypertext_refed::=start("a href=" quote URL quote ) (_) end("a" ). Again.... hypertext_refed(X) means start("a href=" quote URL quote ) X end("a).


          This fails to express a complicated set of rules about what elements can, can not, or should not appear nested inside other elements. These are in the Document Type Definition(DTL) for HTML documents - written in the Standardized General Markup Language (SGML) and held at CERN(Center for European Nuclear Research).

        7. special_text::= |[ x:special_test_type] (start(x ) simpler_text end(x )).
        8. special_text_type::= "pre"|"listing"|"blockquote".

          Note. The above summarizes 3 different alternative with a different x in each one.

        9. header::= |[i:"1".."6"] ( start("H" i ) text end("H" i ) ).
           		<H1>This is the most prominent header</H1>
        10. Note. the above describes the 6 levels of headers with "H1" being most prominent and "H6" least prominent. The actual and relative styles and sizes can not be specified but a chosen by the user and the browser.

        11. paragraph::= #(piece | text ),
        12. piece::= |[s:styles]( start(s ) text end(s ) ).
        13. styles::=logical_styles | physical_styles.

        14. logical_styles::= "em" | "strong" | "code" | "samp" | "kbd" | "var" | "dfn" | "cite" | "address",
        15. physical_styles::= "b" | "i" | "u" | "tt". Note. Physical styles are "deprecated".

          The browser and user determine the precise meaning of these styles with the following guidelines:

          Table of Styles

           Style	Meaning
           em	Emphasized - "notice me"
           strong	Emphasized even more
           code	This is a piece of computer output
           samp	This is a sample of HTML
           kbd	This is the name of a key on the keyboard
           var	This is a syntactic variable
           dfn	This is a definition
           cite	This is a citation of a source
           address	This is an address (Real or Email)
           b	looks bold (deprecated)
           i	looks italic (deprecated)
           u	looks underlined (deprecated)
           tt	looks like a typewriter (deprecated)

          The SGML specifies rules about what is recommended, normal and deprecated.

        16. text::=untagged_body & ( recommended_DTD_ rules | DTD_rules | deprecated_DTD_rules),
        17. DTD::=Document Type Definition.


        18. image::=start("img"), -- note there is no need for an end img tag.
           		<img src="local.gif" alt="[description]">
        19. attributes("img")::= | following,
        20. alignment::="left" | "right" | "center" .

          The 'alt' attribute is what is shown to a browser that does not show the image - some browsers do not show graphics. Some users turn off the graphics to get to the information quicker!

          The ismap attribute indicates that parts of the graphic are hot buttons that act as links to other pages etc. Maps take time to construct without special purpose tools.

          Remember that each graphic takes time to transfer over the network. Animated graphics, in particular are resource hogs. If you need to have a large and complex GIF then use an image processing tool to create a small thumb nail version. I've used "SnagIt", "the GIMP", "xv", etc. Then link the thumbnail to the full image:

           		<a href="bigfig.gif"><img src="thumbnail.gif" alt="Download a graphic!"></A>
          My home and personal pages have examples: [ index.html ] [ me.html ]


        21. list::=ordered_list |unordered_list | definition_list | menu |directory... .
        22. definition_list::= start("DL") #definition end("DL").
        23. attributes("DL")::= O("compact").
        24. definition::=start( "dt" ) term #(start("dd") text ) .
        25. term::=text.
        26. ordered_list::=start("OL") list_body end("OL" ).
        27. unordered_list::=start("UL") list_body end("UL" ).
        28. menu::=start("menu") list_body end("menu" ).
        29. directory::=start("dir") list_body end("dir" ).
        30. list_body::=#list_item.
        31. list_item::= start("li") text

          Lists are a simple and effective way to organize your pages. Notice that an item in a list can be further split into lines with <br>, paragraphs by <p> and pieces with <hr>. You can also have lists inside lists. So bulletted lists and outlines are easy in HTML.


          A table is two dimensional grid. Each cell in the grid can be just about any HTML element. The browser has an interesting task of making sure that all columns are wide enough and rows tall enough for all the elements to fit.

          Tables are a standard part of HTML but some text based browsers may not support them.

        32. table::= start("table") #row end("table" ), Table tags can have a numeric BORDER attribute.
        33. row::=start("tr") #table_item O end("tr" ),
        34. table_item::=table_header_item | table_normal_item.
        35. table_normal_item::=start("td") #element O end("td" ).
        36. table_header_item::=start("th") #element O end("th" ).

        . . . . . . . . . ( end of section Some Elements) <<Contents | End>>

        Common Attributes

      6. For s, numeric_attribute(s)::= O( s "=" number ).
      7. For s1,s2, string_attribute(s1)::= O( s1 "=" string ) .
      8. For s1,s2, value_attribute(s1,s2)::= O( s1 "=" s2) .
      9. name_attribute::= value_attibute("NAME", name).

        HTML Forms

          HTML forms are a quick and easy way to gather information from a user and send them through a [ Common Gateway Interface ] into a program on a server. See [ HTML_quick.html ] for a quick introduction
        1. form::= form_tag #( element | form_element ) "</FORM>".
        2. form_tag::= "<FORM " form_attributes ">".
        3. form_attributes::= name_attribute O(action_attribute) O(method_attribute).

        4. method_attribute::= "METHOD" "=" ("GET" | "PUT"), use GET for small forms and PUT for large ones.

        5. action_attribute::= "ACTION" "=" quotes action quotes.
        6. action::= URL, -- special semantics:
          • URL can use the HTTP protocol to call a CGI program on a server.
          • URL can use the MAILTO protocol to send the form as EMAIL.
          • URL can use any protocol to refer to a page on the WWW.
          • Can URL use TELNET?

          Form Elements

          1. form_element::= textarea | action_element | input_element | selection.

          2. textarea::="<TEXTAREA" textarea_attributes ">" ASCII_text "</TEXTAREA>", multiple line text box.
          3. textarea_attributes::=name_attribute numeric_attribute("ROWS") numeric_attribute("COLS") numeric_attribute("MAXLENGTH") O("WRAP").

          4. action_element::= |[t:action_type] (input(t)).
          5. action_type::= submit | reset | image.
          6. (above, MATHS)|-action_element = input(submit) | input(reset) | input(image).

            1. reset::=ignore_case("RESET"), appears as a button to be selected, resets all input elements to previous values.
            2. submit::=ignore_case("SUBMIT"), appears as a button to be selected, and transmits data in form according to the method_attribute and action_attribute.
            3. image::=ignore_case("IMAGE"), like a submit but includes the (x,y) coordinates of the click in the image in arguments name.x and name.y

          7. input_element::= | [t:input_type] (input(t)).
          8. input_type::= button | checkbox | hidden | password | radio | reset | submit | text.
            1. button::=ignore_case("BUTTON").
            2. checkbox::=ignore_case("CHECKBOX"), multiple boxes can be checked with same name and different values.
            3. hidden::=ignore_case("HIDDEN").
            4. password::=ignore_case("PASSWORD"), user input possible but invisible but not encrypted.
            5. radio::=ignore_case("RADIO"), only one of each set can be selected.
            6. text::=ignore_case("TEXT"), one line text box.

          9. For t:input_type, input(t)::= "<INPUT" "TYPE" "=" t name_attribute input_attributes(t) ">".

          10. input_attributes::input_type -> input_attributes = following

          11. selection::= "<" select select_attributes ">" #option "</SELECT>".
          12. selection_attributes::=name_attribute numeric_attribute("SIZE") O("MULTIPLE). Normally SELECT generates a "pop-up" menu listing the options and letting the user select a single item from the list. MULTIPLE plus a SIZE signals a browser to offer the user a scrollable list of options and allow them to check off any number of them. This generates a comma separated list of URL-encoded options.
          13. select::=ignore_case("SELECT"), allows one from a menu of options.
          14. option::= "<OPTION" string_attribute("VALUE") O("SELECTED") >" string. The VALUE is returned in place of the following string if it exists.

          . . . . . . . . . ( end of section Form Elements) <<Contents | End>>

          What is Sent by a Form

          If the form's action is "mailto:" or a call to a CGI then a URL encoded string is sent called the query:
        7. query::= pair #( "&" pair).
        8. pair::= name"="value. The name and value come from the values selected when a Submit is selected:
          • Checkbox and radio: name and value attribute(s) selected.
          • Select: the name of the selection and a comma separated list of selected OPTIONs.
          • Text and textarea: the name and the content.
          • Image: coordinates clicked in image.

        . . . . . . . . . ( end of section HTML forms) <<Contents | End>>

        Common Gateway Interface

        1. (glossary)|-
        2. CGI::=Common Gateway Interface The CGI rules define how data is given to a program on the server. The program then runs and generates a page that is returned to the browser, using a standard format. [ CGI in www ]

          This program can be written in any language - but many prefer to use a language called PERL. Personally I use UNIX shell scripts. To interface a MIS database to the Web.... you could use COBOL. Almost certainly Java is going to be another popular way to write CGIs.

          The following program [ unpost.c ] is useful for converting CGI posted input into normal but URL-encoded standard input ready for a UNIX shell script or program to handle. [ URL Encoding ]

        . . . . . . . . . ( end of section Common Gateway Interface) <<Contents | End>>


      10. FUBAR::="Fouled Up Beyond All Recognition", an extreme form of SNAFU.
      11. SNAFU::="Situation Normal -- All Fouled Up", An acronym used in the USA army in the Second World War.
      12. deprecated::=they don't like it because it is physical not logical . (HTML is not for word processing!)