HTML is designed to describe the logical structure of a large number of interlinked pages. It is a special document type described using the rules of SGML (Standardized General Markup Language). It has been updated several times. The January 2000 is called XHTML: [ xhtml.html ] And we have now moved on to HTML4, HTML5, ...
The W3 Consortiuum support the Web and provide [ http://w3schools.com/ ] as a family of tools for learning the technology.
This page defines the syntax of a useful subset of HTML, see [ Basic Ideas ]
For information on SGML see
For the official definitions of HTML see the official defining documents for HTML2.0: [ html-spec_toc.html ]
For a more up-to-date and complex definition in SGML see [ htmlpro.html ]
For a UML model of an HTML document see [ HTML.png?root=atlantic-zoos ]
For information about [ Semantic Markup ] (attaching meaning to parts of an HTML document by adding extra tags and/or attributes) see the following possibilities
SGML allows special symbols to be indicated in a form known as an entity. For example in HTML the less_than character has a special use and so real less than signs are encoded like this:
<The ampersand and the semicolon bracket SGML & HTML entities:
The structure of an SGML/HTML document is described by inserting tags into the raw text - this known as "Marking Up the text".
The general syntax of a tag is:
So, tags take two forms - those that indicate the start of something, and those that indicate the end of something. Here is a typical pair:
</strong>that indicate the start and end (respectively) of a piece of text that needs strong emphasis.
Each type of tag has its own attributes.... and the set of attributes for a given tag has varied with the edition of HTML and the browser. However they all have the same syntax:
I used to have
The reference in the standard [ type-name in types ] allows underscore after the initial letter (no surprise) as well as hyphens (slight surprise), colons and periods (bigger surprise).
It appears that even this syntax does not constrain all use - initial underscore or hyphen is allowed but intended to be reserved (by convention). In some contexts where leading hyphen is allowed, two leading hyphens is disallowed.
If you have more on this, I would be interested.
Upper and lower case are ignored in tag and attribute identifiers but not in attribute values.
An attribute value can be a string or an identifier:
The following XBNF is an approximation to the standard defined at [ 5_BNF.html ]
Notice that there is a special URL_encoding used to transmit symbols that have special means in the syntax below.
URL_encoding= (letter|digit);Id | " "->"+"|->"%"hex(lower 16-bits of character code). [ comp.text.ASCII.html ]
For example "Space Plus+" +> "Space+Plus%2B".
Sadly different browsers do Url-encoding differently!
Some encode using the letters "a".."f" for hexadecimal and some
use "A".."F". Worse the implementation for spaces and plus-signs
is FUBAR. Some quick tests show the following mappings
when a test string "space plus+" is sent from a form by different
There is a MIME format called "x-www-form-url-encoded" that is used in forms.
There is a special Java class for handling the URL encoding: [ java.net.URLEncoder.html ]
There is a local UNIX Shell script that will reverse URL_encoding at [ urlunencode ]
. . . . . . . . . ( end of section Universal Resource Locators) <<Contents | End>>
<BODY BACKGROUND="Graphic">Be careful to select something that lets the message on the page be read!
The values the body specification use a form of hexadecimal coding:
Background BCOLOR=#B8B8B8 Grey
Normal Text TEXT=#000000 Black
Unused Link LINK=#0000FF Blue
Visited Link VLINK=#8000AF Purple
Active Link ALINK=#FF0000 Red
Here is a short page with a sample of lists and tables on it: [ ttt.html ]
Note. The above summarizes 3 different alternative with a different x in each one.
<H1>This is the most prominent header</H1>
The browser and user determine the precise meaning of these styles with the following guidelines:
em Emphasized - "notice me"
strong Emphasized even more
code This is a piece of computer output
samp This is a sample of HTML
kbd This is the name of a key on the keyboard
var This is a syntactic variable
dfn This is a definition
cite This is a citation of a source
address This is an address (Real or Email)
b looks bold (deprecated)
i looks italic (deprecated)
u looks underlined (deprecated)
tt looks like a typewriter (deprecated)
The SGML specifies rules about what is recommended, normal and deprecated.
<img src="local.gif" alt="[description]">
The 'alt' attribute is what is shown to a browser that does not show the image - some browsers do not show graphics. Some users turn off the graphics to get to the information quicker!
The ismap attribute indicates that parts of the graphic are hot buttons that act as links to other pages etc. Maps take time to construct without special purpose tools.
Remember that each graphic takes time to transfer over the network. Animated graphics, in particular are resource hogs. If you need to have a large and complex GIF then use an image processing tool to create a small thumb nail version. I've used "SnagIt", "the GIMP", "xv", etc. Then link the thumbnail to the full image:
<a href="bigfig.gif"><img src="thumbnail.gif" alt="Download a graphic!"></A>My home and personal pages have examples: [ index.html ] [ me.html ]
Lists are a simple and effective way to organize your pages. Notice that an item in a list can be further split into lines with <br>, paragraphs by <p> and pieces with <hr>. You can also have lists inside lists. So bulletted lists and outlines are easy in HTML.
Tables are a standard part of HTML but some text based browsers may not support them.
. . . . . . . . . ( end of section Some Elements) <<Contents | End>>
. . . . . . . . . ( end of section Form Elements) <<Contents | End>>
. . . . . . . . . ( end of section HTML forms) <<Contents | End>>
This program can be written in any language - but many prefer to use a language called PERL. Personally I use UNIX shell scripts. To interface a MIS database to the Web.... you could use COBOL. Almost certainly Java is going to be another popular way to write CGIs.
The following program [ unpost.c ] is useful for converting CGI posted input into normal but URL-encoded standard input ready for a UNIX shell script or program to handle. [ URL Encoding ]
. . . . . . . . . ( end of section Common Gateway Interface) <<Contents | End>>