[Skip Navigation] [CSUSB] / [CNS] / [CSE] / [R J Botting] / [Samples] / xml
[Index] [Contents] [Source Text] [About] [Notation] [Copyright] [Comment/Contact] [Search ]
Wed Oct 24 10:48:38 PDT 2012

Opening the PDF files on this page may require you to download Adobe Reader or an equivalent viewer (GhostScript).


    eXtensible Markup Language


      Somewhere [ acknowledgement1 ] between the complexity of SGML and the rigidity of HTML lies the eXtensible Markup Language(XML). XML lets you describe the structure of a document. In return, you must have a Document Type Description(DTD) before you can process an XML document properly.

      XML documents can be well_formed if they follow some simple rules that allow them to be parsed. These rules are outlined below. A well_formed document can also be valid if they match a Document Type Declaration(DTD). The DTD has to be declared at the start of the document along with things like the XML version and the character code. There are many different DTDs for for different purposes.

      XML documents can be processed if the processing is described in XSL. To display an XML document you need to supply some kind of mapping into a particular "style". Thus we now have style sheet languages : XSL, PSL, P, ... See the W3C information [ Style ] on style sheets and style sheet languages.

      The W3 Consortiuum support the Web and provide [ http://w3schools.com/ ] as a family of tools for learning the technology.

      Examples of Document Types

    1. SimpleNovel::= See http://www.megginson.com/texts/darkness/novel.dtd.

      (MathML): Structure of mathematical formula. [ REC-MathML ] with syntax [ appendixE.html ] + XML dsssl stylesheets rtf tex jade [ mml-files ]

      (OpenMath): [ http://www.openmath.org/ ]

      (DITA): Darwin Information Typing Architecture (DITA XML) [ dita.html ]

      (CML): Chemical Markup Language -- [ http://www.xml-cml.org/ ]

      (W3D): Replacement for the Virtual Reality Modeling Language. [ Specifications ]

      (HRMML): XML based Human Resource Management Markup Language: [ main.html ]

      (DocBook): Structure of documentation for software documents. DocBook is (in 1999) actually an SGML based way to document software. See [ DocBook in comp.text.SGML ] for more information.

      (XMI): XMI::="Presents meta-data for modeling objects", by CORBA and the Object Managment Group. [click here [socket symbol] if you can fill this hole] The following needs an Id and password //ftp.pmg.org/pub/docs/ad/98-1005.pdf

      XMI is also integrated with the Unified Modeling Language 1.3 standard [ uml.html ]

      (CBL): XML Common Business Library [ cblfaq.html ]

      There are many more sample DTDs at [ resources.html?keys=*5266 ]

      XML examples

      [ http://www.xmltree.com/ ]

      (XMLRepository.com): [ http://xmlrepository.com/ ]

      XML Links

      XML information by Dick Baldwin [ http://xml.about.com/ ] About.com is the new name for the organization that was previously known as The Mining Company.

      (XML Query Language - Frequently Asked Questions): xml [ http://metalab.unc.edu/xql/ ]
      (Human Resources Markup Language): xml [ home.htm ]
      (MathML Files: DSSSL style sheet): xml [ mml-files ]
      (Cover pages documentation): xml [ xml.html ]
      (XML FAQ): xml [ http://www.ucc.ie/xml/ ]
      (XML-QL: A Query Language for XML): xml [ http://www.w3.org/TR/NOTE-xml-ql/ ]
      (XML and web services at DDJ): languages [ http://www.ddj.com/topics/xml/ ]

      Well-Formed Documents

      First, XML is like HTML however there are vital differences:
      1. All the tags used in HTML are not defined in XML.
      2. You can add new tags to XML.
      3. XML is Case Sensitive
      4. In XML, WhiteSpace is significant
      5. XML is not about layout and look-and-feel. It is about structure and meaning.
      6. Five predefined entities: gt(>), lt(<), quot("), amp(&), apos(').
      7. End tags are never omitted. <t....> ... </t>
      8. There is a special kind of tag which does not enclose some content <.../>
      9. Comments look like this <!-- ..... -->
      10. Processing can be embedded <?....?>
      11. Attributes always have a name and a value, and the value is between double quotes: name="value".

      Syntax of Well Formed Documents

      Here is a simple description of all documents that might be in XML -- ignoring the context dependencies:
    2. XMLBNF::=following,
        After a prolog, comes a single entity called the root, and then some miscellaneous stuff that is probably meaningless:
      1. document::= prolog root miscellaneous.

      2. prolog::=xml_type #comment dtd. A well formed document must start with a prolog that identifies the version of XML it uses. For example
         		 <?xml version="1.0"?>
        is the current version of xml. The prolog should also identify the character code - especially if you need to use any non-"ASCII" characters. It can also identify some namespaces:
      3. xml_namespace::lexeme= "xmlns".

      4. root::= tagged_element | empty_element.

      5. miscellaneous::= #(comment | processing | WS).
      6. WS::=white space.

      7. tagged_element::= "<" tag #attribute ">" content "</" tag ">", -- the tag at the start and end must be the same. To be valid the tag must be defined in a DTD and have attributes that and content that match the rules in the DTD. A tagged element contains other data -- between the two tags.
         		<title>War and Peace</title>

      8. empty_element::="<" empty_tag #attribute "/>".
         		<timestamp date="1999/06/22" time="11:00"/>
      9. singleton::= empty_element.

      10. content::= #( parsed_data | element | comment ), the valid sequences of pieces in a content are described by a regular expression form in the DTD. An element is either a tagged element or a empty_element:
      11. (element)|-element==>tagged_element | empty_element.
      12. parsed_data::= #(char ~ ("<" | ">" | "&" | ";" | "'") | entity ).
      13. entity::= predefined_entity | defined_entity.
      14. predefined_entity::=gt | lt | quot | amp | apos,
        1. gt::="&gt;", stands for ">".
        2. lt::="&lt;", stands for "<".
        3. quot::="&quot", stands for "\"".
        4. amp::="&amp;", stands for "&".
        5. apos::="&apos", stands for "'".

      15. comment::= "<!--" ... "-->".
          	<!-- this is a comment -->
      16. attribute::= name "=" quoted_value.
      17. quoted_value::=quotes value quotes | apostrophe value apostrophe.
      19. apostrophe::="'".

      20. defined_entity::=defined in prolog.
      21. parsed_data::=defined in prolog.
      22. tag::=defined in prolog or namespace,
      23. |-tag ==> O( namespace ":") name.

      24. name::=defined in prolog.
      25. value::=defined in prolog.

      (End of Net XMLBNF)

      The actual rule for quoting is a little more complex in that the quote character can not appear inside the value:

    3. quoted_value::= | [ q:quotes|apostrophe ] q #(char~q) q, or the union with q equal to quotes or apostrophe of....


      To be valid the entities, tags and their attributes must match a set of rules given in a DTD.

      Suppose that we specify a DTD that has a set of normal tag names T and a set of content free (empty elements) with tag names C and for each tag t:T|C we must have attribute names N(t), and for each tag t:T|C, q:quotes|apostrophe, and attribute a:N(t), we have a set of valid values V(t,n, q), and D is the raw data in our document then define

    4. a(t,q)::= ![n:N(t), q](n= q V(t,n,q) q), a sequence of names with valid quoted values, and
    5. c(t, e)::=an expression describing the valid content of tag t in terms of elements e, and then an element of type t, is defined by
    6. e(t)::= ("<"t a(t)] ">" c(t, e) "</" t"> | [t:C]( "<" t a(t) "/>"), and an element is the union over all tags
    7. element::= D | |[t:T](e(t)). Note
      1. There is a trick above... the content expression c(t,e) depends on all the elements as a function associating tag names to elements of that type. It is probably best to think of this as an array or vector indexed by entity names. The resulting grammar is context dependent but can be formalized using only a small variation of context free grammars.

        The "data" (D above) can include elements that indicate some processing to be done to the data like this "<?.....?>".

      2. processing::= "<?" tag parameters "?>".

        It is possible to name things (like files of data or strings) and use the names in place of the things -- but the rules are a little convoluted.

      Document Type Declarations

      The dtd above is a document type declaration and has many forms. Here are some simple ones:
    8. dtd::= "<!DOCTYPE " WS name O(WS externalId) OWS O( localdtd ) ">".
    9. externalId::= ("PUBLIC" | "SYSTEM") WS string_identifying_a_dtd_file.

    10. localdtd::= "[" #(markup_declaration| ... | WS) "]" OWS. Local dtd are interpreted before external ones so that they can define terms used in the external ones. Unlike all other languages the first definition of a markup overrides the later ones. Thus localdtd's both over-ride and inform the external ones!

      The DOCTYPE defines the structure of the entity in the document for the document to be valid.

    11. markup_declaration::=element_declaration|entity_declaration|attribute_list_declaration | notation_declaration | process_indication | WS.

    12. element_declaration::="<!ELEMENT" element_name type_description ">".
    13. element_name::@name, the set of names occurring in element_declarations.

    14. attribute_list_declaration::="<!ATTLIST" element_name #attribute_declaration ">", attaches a set of attributes to the element named..
    15. attribute_declaration::=attribute_name attribute_type attribute_default.
    16. attribute_name::@name, the set of names appearing in attribute declarations.
    17. type::= "CDATA" | "ENTITY" | "NMTOKEN" | "NMTOKENS" | "ID" | "IDREF" | "IDREFS".
      typeSyntax of Attribute valuesSemantics
      CDATACDATA_sectionblock of text
      ENTITYTBAName of data
      IDidentifierCan be used as an IDREF
      IDREFidentifierrefers to another ID

      (Close Table)

    18. attribute_default::= required | implied | fixed | default_value.
    19. default_value::=literal data token.
    20. required::="#REQUIRED", implies that the element must specify a value and so no default is needed.
    21. implied::="#IMPLIED", no default is given and no value has to be given. Note however if the attribute name is mentioned it must be assigned a value.
    22. fixed::="#FIXED" default_value, meaning that the default is also the only value and so cannot be changed in any occurrence.

    23. entity_declaration::="<!ENTITY" O("%") entity_name entity_meaning ">".
    24. entity_name::@name, the set of names occurring in entity_declarations. These add new entities. An entity is an abbreviation. Some (with the '%') are to be used in DTDs and are expanded there. They are written as "%"entity_name";" and are replaced by the associated entity_meaning as the DTD is elaborated. Others (with no "%") are ready to be used in actual XML document in form "&"entity_name";".

    25. notation_declaration::="<!NOTATION" TBA ">".

    26. CDATA_section::= "<![CDATA[" TBA "]]>".

    27. pcdata::="#PCDATA", keyword indicating a block of parsed character data -- but no XML style marking up.
    28. identifier

      More TBA.

      Standards on the WWW

      W3C specifications [ REC-xml-19980210 ] and Tim Brays Annotated Specification [ axml.html ]


    29. FOP::= See http://www.jtauber.com/fop/, XSL to PDF converter.
    30. XT::= See http://www.jclark.com/xml/xt.html, processes XSL transformations.


      1. IBM [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCea89bdEb/ ]
      2. Apache XML Project's Xerces Java [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCfa89bdEb/ ]
      3. James Clark's XP [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCga89bdEb/ ]
      4. Microstar's Aefred [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCha89bdEb/ ]
      5. Sun's Java API for XML [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCja89bdEb/ ]
      6. Oracle's XML parser [ http://click.softwaredevelopment.email-publisher.com/maaac9gaaQhCka89bdEb/ ]

      . . . . . . . . . ( end of section Parsers) <<Contents | End>>


    31. namespace_rules::= See http://www.w3.org/TR/RECxml-names.

      Lars Marius Garshol <larsga@ifi.uio.no> wrote(comp.text.xml,13 May 1999) "The namespace URI does not point to anything meaningful, it's just a globally unique identifier. So your application will have to understand the DTD to make use of its elements. It would need that even if the URI did refer to a DTD. But at least they are now identified as being fitting elements, and your application can make a decision as to whether it should just ignore them or whether it should try to support them."


    32. API::="Application Programmers Interface".

    33. DAML::=DARPA Agent Markup Language, [ http://www.daml.org/ ]

    34. DOM::="Documentation Object Model".
    35. DTD::="Document Type Declaration", [ DTD in comp.text.SGML ]

    36. ebXML::=Electronic Gusiness eXtensible Markup Language, [ http://www.ebxml.org/ ]

    37. HTML::markup_language= HTML_glossary & HTML_syntax.
    38. HTML_glossary::= See http://cse.csusb.edu/dick/samples/comp.html.glossary.html.
    39. HTML_syntax::= See http://cse.csusb.edu/dick/samples/comp.html.syntax.html.

    40. language::="a set of syntactic and semantic rules defining the correct form, structure, and meaning of strings of characters", the chief product of computer science research.

    41. ML::="in an acronym often indicates a markup_language" | "a programming language".
    42. markup_language::language="a language that describes how to mark up text to give it added meaning, richness, or layout and style".


    43. For x, O(x)::= an optional x.

    44. OReilly_Books::= See http://www.xml.com.

    45. P::stylesheet_language, [ Thot ] the Thot structured document language and the P stylesheet language.
    46. PSL::stylesheet_language, part of the Proteus library and style sheet library. [ ~multimedia ]

    47. RDF::="Resource Description Framework", [ http://www.w3.org/RDF/ ]

    48. RSS::="a lightweight multipurpose extensible metadata description and syndication format", "a Semantic Web vocabulary", [ http://web.resource.org/rss/1.0/ ]

    49. SAML::=Security Assertion Markup Language, [ tc_home.php?wg_abbrev=security ]

    50. SAX::="Simple API for XML".

    51. Schema::= See http://www.w3.org/XML/Schema, [/w3/org/TR/xmlschema-0]

    52. SensorML::=Sensor Modeling Language, [ http://vast.uah.edu/SensorML/ ]

    53. SGML::markup_language="Standard Generalized Markup Language", [ comp.text.SGML.html] . [ sgml.html ]

    54. stylesheet::="A description in a special stylesheet_language of the way a user or client wants some data interpreted and/or displayed".
    55. stylesheet_language::="A computer language defining how to specify the style for displaying or processing a document".

    56. SVG::=Scalable Vector Graphics, [ http://www.w3.org/TR/SVG/ ]

    57. SMIL::=Synchronized Multimedia Integration Language, [ AudioVideo] .

    58. SOAP::= See http://www.w3.org/TR/soap, [ REC-soap12-part0-20030624 ]

    59. TBA::="To Be Announced".

    60. VML::=Vector Markup Language, [ http://www.w3.org/TR/1998/NOTE-VML-19980513/ ]

    61. VoiceXML::= See http://www.w3.org/TR/2003/CR-voicexml20-20030220/ [ http://www.voicexml.org ]

    62. WSDL::=Web Service Description Language, [ default.asp] . [ wsdl ]

    63. XHTML::= See http://www.w3.org/TR/xhtml1, Extensible HTML -- a version of HTML that follows the rules of XML. [ http://www.xhtml.org/ ] [ html_xhtml.asp ] (thanks to Eric October 19th 2012 for this link).

    64. XML::markup_language="eXtensible Markup Language". [ http://www.w3.org/XML/ ]

      See the BNF syntax XMLBNF above or the W3C specs [ http://www.w3.org/TR/1998/REC-xml-19980210/ ] or Tim Brays Annotated Specification [ axml.html ] or the Italian translation http://www.xml.it/REC-xml-19980210-it.html

      (OASIS): Organisation for the Advancement of Structured Information Systems.

    65. XML.org::= See http://www.xml.org,

    66. XSL::stylesheet_language="XML stylesheet Language". [ http://www.w3.org/TR/REC-xml/ ]

    67. element::=an identifiable(and so tagged) piece of data.
    68. entity::=a string that symbolizes a character | something that contains data.

      See Also

      The Annotated XML Spec at [ axml.html ]

      Mapping runtime objects into XML formatted data [ XML_Serialization ] [ xtal.html ]

    . . . . . . . . . ( end of section eXtensible Markup Language) <<Contents | End>>


    [ languages.html#YAML ] [ JSON.html ]


    Thanks to "Edward Szumski" <eszumski@csci.csusb.edu>


  1. Larry Evans <jcampbell3@prodigy.net> for correcting my many other errors.


  2. Thanks to Mark Doernhoefer for his excellent series of articles "Surfing the Net for Software Engineering Notes", and in particular for the URLS in SIGSOFT V29n3(May 2004)pp15-24.