This document tries to define the technical terms it uses or to provide links to definitions. If you find terms which are unknown to you and not defined here, please consult eg the Terms section of HTML 2.0 specification or some of the general Internet glossaries. (The most authoritative Internet glossary is probably RFC 1983.)
People who have heard about HTML 3.0 should notice that HTML 3.2 is not an extension or a variant of HTML 3.0, which has now been withdrawn. (The version numbers 3.0 and 3.2 are misleading!) More exactly, HTML 3.2 contains
For a good summary of the new features in HTML 3.2 as compared with HTML 2.0, consult the article What's New in HTML 3.2 in the World Wide Web Journal, but please notice that it contains a few mistakes.
HTML 3.2 has been defined by the World Wide Web Consortium. It is supported by several browsers to a large extent, and it will probably become the common basis understood by almost all relevant Web software. The next version, an extension to HTML 3.2, is being developed under the code name Cougar.
An older standard, HTML 2.0, is supported to an even larger extent, since HTML 3.2 is an extension of HTML 2.0.
However, to be exact, the following HTML 2.0 features have been removed in HTML 3.2:
This document does not discuss general issues of Web authoring, such as overall design of documents and document collections. As regards to them, see my list of suggested reading.
In addition to such issues, you need to know where to put your HTML document to make it accessible to the world; this may involve things like setting up directory and file protections suitably. Please consult your local Web support for information relevant at your site.
This document concentrates on basic HTML usage. In particular, this document does not give realistic examples about applets or image maps. (The main reason for this is that the author felt that a basic document was urgently needed, and providing good examples about such complicated and somewhat controversial issues would have taken too much time.)
For printing on paper, you may wish to use the
PostScript version
(generated from the HTML version with Netscape),
which also exists in
a much smaller form, as
compressed
(with the Unix compress utility).
In general, you should be able to read this document on any decent WWW browser. However, tables (TABLE elements) have been used in this document, mainly in the description of attributes, since they are essentially tabular information best presented so. Unfortunately this means that parts of this document are almost illegible when viewed with browsers which cannot present tables (eg most versions of Lynx).
The author hereby gives general permission to copy and distribute this document or parts thereof in any medium, provided that all copies contain, in a manner appropriate for the medium, an acknowledgement of authorship and the URL of the original document, ie http://www.hut.fi/%7ejkorpela/HTML3.2/
The permission granted above does not imply permission to distribute this document in a modified form or as a translation. Please contact the author to discuss the conditions for such actions.
Explanation: The author wishes to preserve the integrity of the document. This includes specifying the context when distributing or using excerpts and informing the reader about the availability of the entire document in its most up-to-date form.
Please notice that most introductory texts on HTML do not present the language exactly as defined by HTML 3.2; some of them might differ a lot from it. This is understandable, since the language HTML evolves rapidly (and even divergently).
The specification is relatively short and technical, and consulting the older HTML 2.0 specification (also known as RFC 1866) can be useful, since the current HTML 3.2 specifications can sometimes be understood only be assuming HTML 2.0 as a background document.
In order to understand the HTML specifications exactly, some fluency in reading SGML (the metalanguage used to describe the syntax of HTML formally) is required. SGML as a whole is rather complicated, and the SGML standard is only available in printed form. However, for the purpose of understanding the SGML descriptions of the syntax of HTML (that is, HTML DTDs), the following material usually gives you enough information:
There are some minor internal inconsistencies in the HTML 3.2 specification.
Notice that documents on HTML (even some of the above-mentioned) very often contain information about features which do not belong to HTML 3.2.
Even if you know HTML 3.2 well, you will by mistake violate the specification; for instance, just forgetting an ending quote can cause a lot of such violations. You may not notice the error in your environment but your readers may get confused.
It is not sufficient to check that "it works" on your browser. Other people will use that browser in a different environment or with different settings, different versions of the browser, or even quite different browsers. Browsers very often pass invalid HTML without giving error messages, perhaps even handling in such a way that things seem to work fine. For other people, it might be a mess. Looking at your document on a few different browsers may help to detect problems, but it would be too tedious to do that for all important browsing environments.
Therefore, validate your code. You can use eg HTML Validation Service of WebTechs which is easy to use.
Passing validation means that there are no violations of HTML syntax (providing that the validator does its job right). Checking the quality of the document is a different thing. There are some checkers such as WebLint which can be used to test the document for various common problems - for things which, although technically legal, are likely to provoke known browser bugs, etc. Checkers may of course perform an HTML syntax check too, but typically they are rougher than validators. They might declare a document legal syntax when it isn't, or declare it illegal when it is. Nevertheless, they are useful tools, both for alerting newcomers to potential problems, and for picking up errors made by even the most experienced.
For more information, Heikki Kantola's nice compact list of validators and checkers and WDG's (annotated) rather extensive list of validators and checkers.
In addition to character repertoire and encoding (of characters by bit combinations), there is a special feature which is fixed in HTML: the interpretation of numerical character escapes of the form &#n; where n is a number. Such an escape is to be interpreted as the character corresponding to n in ISO 10646 and Unicode. In practice, browsers cannot represent all ISO 10646 characters, but the specifications imply that if a browser &#n; presents as a character, it must use the ISO 10646 character. (Unfortunately, browsers may violate this.)
In practise, you should use ISO Latin 1 characters only. Currently or in the near future you can hardly expect general support for extensions to it, although support to some national alphabets may exist nationally. Support for ISO Latin 1 should exist in all browsers, but there are problems even with this. You may of course decide to stick to the ASCII character set, which is a subset of ISO Latin 1, especially if you do not need letters with diacritic marks (or, in general, letters other than English a - z).
The printable characters of ASCII (with code values from 32 to 126 in decimal) are the following:
! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~
The other printable characters
of ISO Latin 1 (with code values
from 160 to 255 in decimal)
are the following:
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿNote: The presentation of some characters in the copy of this document may be defective eg due to lack of font support. Naturally, the appearance of characters varies from one font to another.
If your keyboard or text editor does not allow you to enter (ie to type directly) some ISO Latin 1 characters such as ä or ñ, you can use the character escape conventions.
Some practical warnings to those who create HTML documents on microcomputers:
<H1> <H1 ALIGN=LEFT>
<H1>Foreword</H1>In such cases the two tags and the part of the document enclosed by them forms a unit which is called HTML element. Some tags, eg <HR>, are HTML elements by themselves, and for them the corresponding end tag would be illegal. - In the sequel we will usually refer to tags by their name only, omitting the obligatory angle brackets.
For some elements which logically consist of a start tag, some content and an end tag, it is legal to omit the end tag, possibly even the start tag. For example, you can omit the end tag </P> and let browsers and other software imply it when necessary. The exact rules for allowable tag omission are given in the HTML specification, often only in the formal (SGML) syntac, so they can be hard to read. Moreover, some browsers are known to misbehave if you omit some end tags even when the specs allow it, and this can have drastic effects eg when nested tables are involved. Thus it is wisest to use explicit end tags always for all elements which logically have an end tag.
You can also omit the quotes from an attribute value if the value consists of the following characters only (cf to the technical concept of name):
Within attribute values, no HTML tags are recognized. On the other hand, escape sequences are recognized and interpreted.
There is a minimized syntax for attributes when the attribute value is the same as the attribute name. For instance, <UL COMPACT="COMPACT"> can be abbreviated as <UL COMPACT> (and it is common practise to do so). Some user agents even require minization for some attributes (COMPACT, ISMAP, CHECKED, NOWRAP, NOSHADE, NOHREF), so perhaps it is best to use the minimized syntax when applicable.
Successive attribute specifications must be separated with blanks (or newlines).
The general syntax of URLs is the following:
scheme://host:port/path/filename
where
http | a Web document (to be accessed using Hypertext Transfer Protocol, HTTP) |
ftp | a file in a so-called FTP server, to be retrieved using File Transfer Protocol |
gopher | a file in a Gopher server |
mailto | electronic mail address |
news | a newsgroup or an article in Usenet news |
telnet | for starting an interactive session via the Telnet protocol (which is part of TCP/IP) |
www.hut.fi (or sometimes a numerical TCP/IP
address); notice that typically, but not necessarily, Web
servers have domain names starting with www
:port
http
URLs. For other URLs, simplifications and special interpretations are
applied. For example, a mailto URL is just of the form
mailto:address where address is
a normal Internet E-mail address like
Jukka.Korpela@hut.fi.
Please notice that appending anything to the E-mail address in
a mailto URL
is nonstandard and
may result in lost mail without
anyone noticing!
As explained above, it is safest to enclose URLs in quotes when writing them as attribute values in HTML.
For an overview of URLs, see W3C material on addressing.
As regards to the technical specifications of the syntax of URLs, see RFC 1738 (absolute URLs) and RFC 1808 (relative URLs).
In particular, the specifications say that within a URL only a limited set of characters can be used as such:
A to
Z, a to z,
0 to 9)
$-_.+!*'(),
;/?:@=&# provided that they
are used in the special meaning reserved for them
in the
RFCs mentioned above.
;/?:@=&# must also be encoded, if they
are not used in the special meaning.)
This encoding (which is defined by URL specifications, not HTML
specifications)
consists of
using the percent sign followed by two
hexadecimal digits, presenting the code position.
For example, tilde (~) should be presented as
%7E and space as %20.
(Violating the rules causes problems
much more likely
in the latter
case than in the former.)
In this document, upper case letters are used for the above-mentioned constructs. This may help the reader distinguish HTML code from normal text.
However, the following constructs are (in general) case sensitive:
The term newline is used to denote an end of line designation. Theoretically SGML specifies that a line (record) should begin with a record start character (line feed, LF, ASCII code 10) and end with a record end character (carriage return, CR, ASCII code 13). In practise, HTML documents are presented and transmitted using a newline presentation convention of the computer system used. Therefore, HTML browsers are encouraged to accept any of the three common representations, namely CR LF sequence, CR only, and LF only, as line separators and to infer the missing record end and start characters.
Thus, it does not matter how you divide the text into a lines, since
a newline is equivalent to a blank. Notice, however, that you
must not divide a word into two lines in HTML.
If you eg divide the word
international into two lines as follows:
inter- nationalit will be interpreted as equivalent to
inter- nationaland the result is not what you want.
Thus, you must use HTML tags such as P or BR to force line breaks, if they are necessary for the logical representation of your document.
Browsers usually do not divide words into two lines, except possibly when a word contains a hyphen. The HTML 3.2 Reference Specification is not very explicit in this matter; it just says, in the discussion of tables, the following:
For some user agents it may be necessary or desirable to break text lines within words. In such cases a visual indication that this has occurred is advised.
Beware that the line length is outside your control. It depends on the browser, device, and settings used by the people who look at your document. You can force line breaks but not prevent line breaks between words, in general. (You can try to prevent line breaks by using non-breaking spaces.)
As regards to newlines in conjunction with HTML tags, there are special rules:
<P> Text
is equivalent to
<P>Text
Text </P>
is equivalent to
Text</P>
The horizontal tab character (HT) can appear in the HTML source. Within PRE elements, tabs have a special interpretation. Otherwise a tab is equivalent to a space. Thus, it does not imply tabulation of any kind. (In order to present tabular data, use the TABLE element.) It is best to avoid tabs in HTML code and to use a suitable number of spaces instead, if one wants to format the HTML source code into tabular form.
Apart from the elements at the topmost levels, namely HTML, HEAD and BODY, the HTML elements are classified into three major categories:
Any text element (including plain text) can appear wherever a block element is allowed, by virtue of implicitly forming a paragraph (P element) when necessary.
A rule of thumb which may help in remembering which elements are block elements and which are text elements: block elements cause paragraph breaks, text elements do not.
Note: Often block elements can contain both text elements and
other block elements, ie blocks can be nested.
Text elements can be nested, too.
On the other hand,
text elements may not contain block elements.
For example,
<CITE><H3>Origin of Species</H3></CITE>
is invalid (since CITE
is text element and H3 is block element)
and also illogical (you don't really mean that the heading
as a structure
is a citation, do you?)
whereas
<H3><CITE>Origin of Species</CITE></H3>
would be legal, although different browsers might treat it differently
(letting either H3 or CITE determine the rendering, or possibly
using a mixture of the two).
Similarly, don't embed
headings into A NAME
tags but vice versa.
It is also illegal to have a paragraph break (P tag)
within eg a STRONG element; although several
browsers can handle it, the semantics is ambiguous and you should use
separate start and end STRONG tags within each paragraph (if you really
want to emphasize such large portions of text!).
The same information is presented in the individual tag descriptions, in their Allowed context and Contents parts. Here it is presented in a compact form. This form does not cover all details but might be more illustrative.
Legend:
A, ADDRESS, APPLET, B, BIG, BLOCKQUOTE, BODY, CAPTION, CENTER, CITE, CODE, DD, DFN, DIV, DT, EM, FONT, FORM, H1, H2, H3, H4, H5, H6, HTML, I, KBD, LI, P, PRE (with restrictions), SAMP, SMALL, STRIKE, STRONG, SUB, SUP, TD, TH, TT, U, VAR.
The following are not text containers but may contain text elements indirectly, ie contain elements which are text containers:
DIR, DL, MENU, OL, TABLE, TR, UL.
The following may not contain text elements at all:
AREA, BASE, BASEFONT, BR, HEAD, HR, IMG, INPUT, ISINDEX, LINK, MAP, META, OPTION, PARAM, SCRIPT, SELECT, STYLE, TEXTAREA, TITLE,
Similarly I will use the term block container to denote any element which may contain a block element directly (as opposite to containing an element which contains a block element). Block containers are: BLOCKQUOTE, BODY, CENTER, DD, DIV FORM HTML, LI (when within UL or OL), TD, TH.
Obviously, since some characters such as < are used with a very special meaning in HTML, there must be some way of expressing them as data characters, ie when they should appear eg as part of the document itself or in a URL. The convention is that the following notations are used:
| character | notation | usual name(s) of the character |
|---|---|---|
| < | < | less than character, left angle bracket |
| > | > | greater than character, right angle bracket |
| & | & | ampersand |
There was notation " for the double quote (") in HTML 2.0, but it does not belong to HTML 3.2 (for certain technical reasons). The double quote can be typed as such within normal text, and within quoted strings as well if the single quotes are used as the outermost quotes. (In the rare cases where this does not work, you can use " to represent the double quote.)
Notice that the semicolon is part of the escape sequence. In principle, it is necessary only if the following character would otherwise be recognized as part of the name. In practice, it is best to adopt the habit of always terminating an escape sequence with a semicolon.
In escape sequences, the case of letters is significant. For example, the ampersand & may not be represented as & (this escape sequence is undefined), and the escape sequences ä and Ä denote two distinct characters, a umlaut (a dieresis, the letter a with two dots above it) in lower case and in upper case (ä and Ä); notice the principle of uppercasing only the first letter in the escape notation (&AUML; is undefined).
The need for the above-mentioned escape sequences arises from the syntax of HTML. In fact there are escape sequences for all characters in the ISO Latin 1 character set. There are
| © | copyright sign, © |
| ® | registered trademark sign, ® |
| | non-breaking space |
However, there is usually little reason to use other escape sequences than < and > and &. Using ä instead of ä might seem to give some character code independency, but it does not; if a browser can display ä correctly, it can also display correctly a document in which the character ä is specified directly. But notice that sometimes you cannot input some special characters directly due to keyboard restrictions, and in such cases you can have use for notations like ä.
And please notice that "character ä" means the ISO Latin 1 character with name "small letter a with diaeresis" (diaeresis = umlaut), with code 344 in octal, 228 in decimal. It can be entered into an HTML document in various ways. It is possible that pressing a key labeled with ä or Ä is not among those ways. For instance, on a Macintosh with Scandinavian keyboard the ä key normally produces a character quite different from ä in ISO Latin 1. Various programs may or may not handle this by performing character code conversions.
Some browsers support other escape sequences than those mentioned above, for example ™ and &cbsp;. The use of such notations is strongly discouraged. (Notation ™ refers to a symbol which does not belong to ISO Latin 1 at all; you may wish to use the HTML 3.2 conformant notation <SUP><SMALL>TM</SMALL></SUP> instead. Notation &cbsp; stands for "conditional breaking space", not in ISO Latin 1 and possibly not intended to be a character at all.)
This name concept occurs in the description of HTTP-EQUIV and NAME attributes of the META element and in the description of NAME attribute of the PARAM element.
In other contexts, a string which is used to name something may contain other characters as well but then it must be quoted.
It is of course possible that due to software or hardware limitations all colors cannot be presented. On some devices, the actual rendering might be just black and white or different shades of grey.
When a color is specified as the value of an attribute, there are two possibilities:
It is not necessary to know the numerical equivalents of the predefined color names in order to use them. However, the following table specifies them as well, since they might help authors who wish to define colors by slightly modifying the predefined ones.
| Black = "#000000" | Green = "#008000" |
| Silver = "#C0C0C0" | Lime = "#00FF00" |
| Gray = "#808080" | Olive = "#808000" |
| White = "#FFFFFF" | Yellow = "#FFFF00" |
| Maroon = "#800000" | Navy = "#000080" |
| Red = "#FF0000" | Blue = "#0000FF" |
| Purple = "#800080" | Teal = "#008080" |
| Fuchsia = "#FF00FF" | Aqua = "#00FFFF" |
These colors were originally picked as being the standard 16 colors supported with the Windows VGA palette. The HTML 3.2 Reference Specification contains a section on colors with sample images in each of the 16 colors. Notice that these colors are rather striking in their brightness. Normally you should use paler colors.
See also:
A browser should multiply the pixel values by an appropriate factor when rendering to very high resolution devices such as laser printers. For instance if a user agent has a display with 75 pixels per inch and is rendering to a laser printer with 600 dots per inch, then it should multiply the pixel values given in HTML attributes by a factor of 8.
The question whether should prevent line breaks when rendering HTML documents is ambiguous. The HTML 2.0 specification says:
Use of the non-breaking space and soft hyphen indicator characters is discouraged because support for them is not widely deployed.The soft hyphen should really be avoided; it serves no useful purpose in HTML. But as regards to non-breaking space, you can well use it to try to prevent line breaks where you don't want them. And although the HTML 3.2 Reference Specification is not explicit about the matter in general, it suggests, in the discussion of the NOWRAP attribute of TH and TD elements, that should act as non-breaking space within table cells at least.
If you use non-breaking spaces, use them instead of normal
spaces, not in addition to them. For instance, if you wish to prevent a line
break between
version and 3, type
version 3
(not version 3).
On the other hand, within a table in HTML 3.2, can have quite different meaning, which can be described as non-empty space: when a table is presented with borders, cells with empty contents are drawn without them, and spaces only do not constitute contents - but does! This peculiar semantics does not prevent from acting as a non-break space as well.
For further confusion, some people use to force spaces into the visible presentation of a document, eg by putting an or a few of them into the beginning of a paragraph to get its first line intended. This may actually work on some browsers, but it is unwise to rely on that, and it is normally useless to try to enforce such presentation features anyway.
You can begin a comment with the four-character sequence <!-- (less than sign, exclamation sign, two hyphens) and terminate it with the three-character sequence --> (two hyphens, greater than sign). Don't use the character pair -- or the character > within a comment. For example:
<!-- Written by Jukka Korpela -->(For a more thorough discussion of comment syntax, see document HTML comments by WDG.)
It is generally preferable to include metainformation about the document into HTML elements, such as META. Consider making information about purpose, author, creation and last update time etc a visible part of the document itself, too.
Thus, comments should be inserted in rare cases only, eg to comment the HTML code itself to explain things that may look odd. Remember that a comment is part of an HTML file, to be transmitted whenever the document is delivered. Therefore, to avoid wasting bandwidth, if you have a long story to tell, put it into a separate document and insert just its URL into a comment.
HTML editors and converters often insert a few comment lines into the beginning of an HTML file. Such indications can be helpful and should not be removed.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <TITLE>Hello</TITLE> Hello worldIn fact, this document implicitly has the following structure, ie it is equivalent to the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>Hello</TITLE> </HEAD> <BODY> Hello world </BODY> </HTML>This means that apart from the first line, the entire file is an HTML element which contains a HEAD element, with the TITLE element as contents, and a BODY element, with the plain text as contents.
Thus, in the absence of HTML, HEAD, and TITLE tags a browser implicitly assumes them in suitable places. Therefore, your document always contains a head and a body.
Here we will simply emphasize that every HTML document should contain certain basic information about its origin. The local recommendations may specify in detail the form in which that information should be provided.
The importance of providing origin information becomes evident if we think how people find documents using search engines or link lists in an increasing amount. In such contexts the document pops up as such, in isolation, even if you may have intended that people find out following links which you have carefully designed so that they give background information. When a user has eg found your document using AltaVista, he most probably wants to know what kind of document it is. Therefore, each HTML file should provide the very basic information (or link to information) about its origin and nature. For example, in a book-like document collection divided into small files, every file should contain at least a link to the "front page" of the "book".
At least the following origin information should be provided:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>A sample HTML document</TITLE> <LINK REV="made" HREF="mailto:jukka.korpela@hut.fi"> </HEAD> <BODY> <H1>A sample HTML document</H1> This is a sample HTML document exemplifying a suggested way of presenting basic origin information. <HR> <P> <A HREF="http://www.hut.fi/~jkorpela/">Jukka Korpela</A>, <a href="mailto:Jukka.Korpela@hut.fi">Jukka.Korpela@hut.fi</a> <BR> This document belongs to the context of <a href="index.html">Learning HTML 3.2 by Examples</a> <BR> The URL for this document is <KBD> http://www.hut.fi/~jkorpela/HTML3.2/skel.html </KBD> <BR> Created: December 5, 1996 </BODY> </HTML>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">(where you theoretically should have
HTML 3.2 Final
instead of
HTML 3.2)
<TITLE>Introduction to General Absurdity</TITLE>
Most browsers don't complain if you omit these, but they are required by the HTML 3.2 definition. More importantly, there are good practical reasons to include them:
Optionally, the HEAD element may contain the following elements in addition to a TITLE element:
The tags for expressing major structural features, so-called block level tags, are the following:
A recommendable approach, which may need adjustments to fit your local recommendations, is the following:
List can be nested in the sense that an item in a list, i.e. an LI (or DD) element, may in turn contain a list element.
Notice that the basic paragraph element P is not nestable, ie you cannot have P elements within a P element to create subparagraphs. However, the various list elements effectively provide an itemization structure which essentially corresponds to subparagraph division. Moreover, the list elements are nestable.
Logical markup shall be preferred. Use physical markup only if it is really relevant that part of a text displayed in a particular physical way (if possible). The need for physical markup may arise when referring to information in fixed presentation form, such as text in a book or in an image. Such situations occur rarely.
For instance, use the STRONG element for strong emphasis, letting the various Web browsers express the emphasis in the way which is the best in the environment where they are used. Do not use the B element (indicating bolding), except in the rare occasions where you are writing about some text appearing in boldface somewhere.
When style sheets will be generally useable, both authors and readers will be able to affect the rendering (eg font, color, and background) of elements. For instance, someone might wish to have all program code extracts presented with yellow background and larger than normal font whereas someone might prefer some quite different methods of distinguishing them from normal text. Such operations will be much easier if logical markup has been used consistently.
In addition to being more flexible with respect to various browsers and rendering environments, logical markup has the following advantage over physical markup: In an increasing amount, computer programs are used for extracting information from HTML documents for various purposes like indexing. For this to work, it is much better to have logical markup indicating eg that some text is more important than the rest or a quotation of computer printout, rather than having designations of physical fonts.
Both logical and physical markup is done using HTML elements with start and end tags. It follows from the nature of HTML language that markups must not overlap. For instance, the following is in error:
This has some <B>bold and <I></B>italic text</I>.On the other hand, markup elements can be nested. User agents should do their best when rendering structures like the following:
This is <I>italic text which contains <U>underlined text</U> within in </I> whereas <U>this is normal underlined text</U>.
Obviously, browsers with limited font repertoire can have difficulties in presenting text markup.
Avoid emphasizing too much, since emphasizing everything is tantamount to saying everything with the same emphasis, ie not emphasizing anything! (The proverbial student who underlines everything in his textbook has not grasped the idea of emphasizing.)
Unfortunately there is no phrase element for "de-emphasis", ie for indicating segments of text as less important. If you really need that, you may consider using the SMALL element. But especially if the less important text is relatively long, it might often be a better idea to put it "behind hyperlinks", into separate documents to which there are links in the main document. A person who follows such a link is probably interested in the text, so he probably prefers seeing it as normal text, and there is no need for any de-emphasis.
The DFN element can be regarded as a special kind of emphasis, too, but logically it indicates that a term is used in a context where it is defined. This is a very useful element in principle but unfortunately many browsers, including Netscape, do not effectively support it.
The VAR element indicates that a piece of text (typically, a word) is a variable, ie a generic notation to be replaced by different actual expressions.
The other phrase elements involve different kinds of citations or quotations:
| CITE | citation (title of a book or article or equivalent) |
|---|---|
| CODE | program code or equivalent (eg HTML code) |
| SAMP | sample output from programs, scripts, commands etc |
| KBD | text to be typed from a keyboard by a user; typically used when giving instructions |
Please do not identify eg the concept of emphasis with its physical representation on your browser (or even its typical representation on several browsers). See below for notes and examples on rendering markup.
| TT | "teletype" text, ie monospaced text |
|---|---|
| I | italics |
| B | bold |
| U | underlined |
| STRIKE | strike-through text |
| BIG | large font |
| SMALL | small font |
| SUB | subscript |
| SUP | superscript |
Note: SUB and SUP might reasonable be regarded as phrase-level markup, and as mentioned above, SMALL might be used as a substitute for the missing phrase markup for de-emphasis.
The FONT (and BASEFONT) element offers more possibilities to control font sizes than BIG and SMALL. However, all use of font size control in HTML should be avoided.
For example, some browsers (eg Internet Explorer) render TT (and CODE) so that the font is significantly smaller than normal text font, and this disproportion is preserved when the setting for font size is changed; moreover, Internet Explorer renders VAR with monospaced font whereas most graphical browsers use (much more naturally) italics. On the other hand, in Netscape these font sizes are separately settable and by default the same font size is used for both, but "the same" is the technical size in points - in practise monospaced font looks bigger than normal proportional font!
Thus, avoid messing up with font sizes; use phrase markup and other structural elements and let the users, if they dislike the font sizes, define fonts in their browser settings the best they can.
The following table is intended for giving an idea of the variation. It (verbally) presents the rendering of markup elements in Netscape Navigator, Microsoft Internet Explorer, and Lynx. Notice that there is variation even within each of these programs - depending on version, platform, and system-wide or user's own configuration, so this is just a typical situation. Thus, consider this as what different things might happen rather than as a description of what actually happens in some particular program.
| element | Netscape | Internet Explorer | Lynx |
|---|---|---|---|
| EM | italics | italics | underlined |
| DFN | normal text | italics | normal (monospaced) |
| CODE | monospaced | monospaced small | normal (monospaced) |
| SAMP | monospaced | monospaced small | normal (monospaced) |
| KBD | monospaced | monospaced small | normal (monospaced) |
| VAR | italics | monospaced small | normal (monospaced) |
| CITE | italics | italics | underlined |
| TT | monospaced | monospaced small | normal (monospaced) |
| I | italics | italics | underlined |
| B | bold | bold | underlined |
| U | normal text | underlined | underlined |
| STRIKE | strike-through | strike-through | text between [DEL: and
:DEL]
|
| BIG | larger than normal | larger than normal | normal text |
| SMALL | smaller than normal | slightly smaller than normal | normal text |
| SUB | lowered, slightly smaller | lowered | normal text |
| SUP | raised, slightly larger | raised | normal text |
These relate to unnested elements. Nesting of text elements may affect the rendering.
The following example illustrates the approach in the context of an introduction to the Perl programming language.
<P>The following Perl script prints out its input so that each line begins with
a running line number:</P>
<PRE><CODE>
#!/usr/bin/perl
$line = 1;
while (<>) {
print $line++, " ", $_; }
</CODE></PRE>
<P>The scalar variable <CODE>$line</CODE> is of course the line counter.<P>
<P>The loop construct is of the form<BR>
<CODE>while (<>) {</CODE><BR>
<VAR>process one line of input</VAR> <CODE>}</CODE><BR>
</P>
<P>Assuming that you have written this script (the simpler version of it) into a
file named <KBD>lines</KBD>, you could test it using a command of the form<BR>
<KBD>./lines</KBD> <VAR>datafile</VAR><BR>
In particular, using the script as input to itself, you would do as follows
(the details of system output vary from one system to another):
</P>
<PRE>
<SAMP>lk-hp-23 perl 251 % </SAMP><KBD>./lines lines</KBD>
<SAMP>1 #!/usr/bin/perl
2 $line = 1;
3 while (<>) {
4 print $line++, " ", $_; }
lk-hp-23 perl 252 % </SAMP>
</PRE>
Notes on the example:
Thus, on the Web there is no such thing as the layout of a document. As an author you cannot dictate layout, just make some efforts to affect it. The following notes, and all information related to layout-oriented features of HTML, should be read with this in mind.
Several HTML elements have optional attributes which can be used to affect the way in which the element is rendered. Consult the detailed descriptions of individual HTML tags to see the possibilities and to read notes about them.
In particular, you may wish to center parts of the text to make them more distinguishable from normal text. You can use the ALIGN=CENTER attribute in several elements like P or DIV (or the separate CENTER element).
If you wish to separate major portions of your document visually from each other, you can use the HR element. Typically it is rendered as a full width horizontal line. But please use this in addition to structuring tools like headings, not as a substitute for them.
As regards to detailed layout issues such as forcing or preventing line breaks, see section Division into lines and the use of blanks and tabs. Font issues were discussed above.
Technically links are specified using A (anchor) elements, and the technical issues are discussed in the description of the A tag. Here we just present the basic idea, a very simple example, and a few pragmatic or stylistic notes.
A link is a directed connection between a particular point in a document and another particular point in the same or another document. The points are often called anchors in HTML terminology.
The two ends of a link (the anchors) are in different logical positions: the link is from one point to another. The latter, called the target of the link, is very often the beginning of a document or, perhaps more logically speaking, an entire document.
In the simplest case, you create a link from one point of your document to another document (which could be your own or written by someone else, perhaps physically located at the other side of the globe). You have to decide which words act as a visual representation of the link, ie as the phrase which refers to the other document, and you need to know the Web address (the URL) of that document. Then you just put the pieces together into a suitable A element. For instance:
I work at <A HREF="http://www.hut.fi/english.html">HUT</a>.This might, in one environment, be rendered as follows:
I work at HUT.
The link text, here the abbreviation HUT, acts as a link to a Web document which explains what the abbreviation means and also provides a lot of information about it. The renderings vary a lot - the link text might be underlined, colored, or otherwise distinguishable from normal text. The user (reader) is assumed to know how links are rendered in the particular environment.
Although it is technically easy to set up links, it is pragmatically often very difficult to use them the right way. Here are some practical guidelines:
Assuming that we have some graphics in some format in a file, there are two essentially different ways to use it in a Web document. You can either link to it or to embed it into your document. In the first case, you use an anchor (A) element; in the latter case, an IMG element. In the first case, when a user accesses your document he sees eg a verbal phrase which acts as a link, and activating that link causes an image to be displayed, either in the same window or in another, depending on the browser and its settings. On the other hand, an embedded image is part of your document; when a user accesses your document, the image is loaded along with it and displayed as part of it.
In both cases, the user will see the image only if the browser supports the particular graphics format. The most commonly supported formats are GIF and JPEG. They are often the only formats supported for embedded images. For linked images, the support is typically wider (it might include eg PostScript, PDF, and PNG) and extensible by the user (by installing new viewers and making suitable additions to the settings of the browser). The reason is that linked images are typically implemented so that the browser knows nothing of the graphics format itself but only knows how to launch a separate program to present it.
As a special case, it is possible to combine linking and embedding in a sense: you can create a document which contains an image which acts (instead of verbal link text) as a link to another image. Typically, the embedded image is rather small, stamp-like, often a small coarse version of the image to which it points as a link.
Linking to an image is usually permitted without specific permission. On the other hand, embedding an image means using it in a way which requires the author's permission, and the author must be mentioned. (See Web Law FAQ.) Obviously, some images are so simple that copyright is not applicable. Moreover, there is a large number of collections of images, some of which are in the public domain.
To illustrate linking to images and embedding images, let us consider a GIF image which has been put onto a suitable place so that it is accessible using the URL http://www.hut.fi/%7elsarakon/sae.gif. Now I could refer to it in the following way:
<A HREF="http://www.hut.fi/~lsarakon/">Liisa Sarakontu</A> has drawn <A HREF="http://www.hut.fi/~lsarakon/sae.gif">a picture of Siamese algae eater</A>.On the other hand, since Liisa has given me the permission to do so, I could embed the image into a document of mine as follows:
The Siamese algae eater (<I>Crossocheilus siamensis</I>) is often mixed up with another algae eating fish, the "false Siamensis" (<I>Garra taeniata</I> or <I>Epalzeorhynchus sp.</I>). Below you can see drawings of them by <A HREF="http://www.hut.fi/~lsarakon/">Liisa Sarakontu</A>. <P> <IMG SRC="http://www.hut.fi/~lsarakon/sae.gif" ALT="[Picture of Siamese algae eater]"> <P> <IMG SRC="http://www.hut.fi/~lsarakon/false.gif" ALT='[Picture of "false Siamensis"]'>The issue of good use of images is very difficult any many-faceted. No attempt to cover it will be made here. The author has written a separate treatise How to use images in communication in general and on the Web in particular.
There is no general support in HTML 3.2 to presenting mathematical formulas. Consult the W3C document on Math Markup to see what work is in progress in this respect. However, you can use some software (eg TeX) to produce the representation of a formula as an image, eg in PostScript form, and use the IMG tag to embed it into your document or the A tag to create link to it. The latter method is often worth considering, especially for large formulas. The reader may prefer reading the text without distractions and looking at the formula (image) at the very moment he is prepared to do so. Moreover, he may prefer looking at it in a separate window (which is separately adjustable in size and positionable on the screen).
In some cases, when just a few separate symbols are needed within the text and they have reasonable textual alternatives, the following kind of approach can be suitable:
The Greek letter <IMG SRC="http://www.ece.cmu.edu/icons/Sigma.xbm" ALT="sigma"> is often used to denote summation.There is a problem, however: since an image has fixed dimensions whereas the size of letters is browser-dependent, there might be an unesthetic disproportion.
Sometimes it is best to present mathematical expressions in linearized notation. For example, instead of trying to find a way of presenting the square root of 2 in the normal mathematical way, you might write just sqrt(2). It depends on intended audience whether you need to explain such notations.
Table cells are often called table elements, but it is best to avoid that in the HTML context, since it might cause confusion eg with the TABLE element, which is the HTML description of an entire table.
Tables are the most important improvement in HTML 3.2 in comparison with HTML 2.0. On the other hand, the table constructs of HTML 3.2 are only a subset of The HTML3 Table Model (RFC 1942).
Unfortunately tables are not yet supported by all browsers, and even if support exists it may be of poor quality. (Text-only browsers and speech-based user agents will always have difficulties with complicated tables, of course.) See Alan Flavell's review Tables on non-table browser for information about making tables look somewhat reasonable, if possible, also on browsers which do not support tables.
Another unfortunate situation is that people have started using table elements just to get a desired layout of pages, not to represent data which is logically matrix-like in structure.
<TABLE> <TR> <TD> 1 </TD> <TD> 0 </TD> </TR> <TR> <TD> 0 </TD> <TD> 1 </TD> </TR> </TABLE>and it looks like the following on a typical browser:
| 1 | 0 |
| 0 | 1 |
Thus, the TABLE tags enclose the table rows, each of which is enclosed by TR tags and enclose table cells enclosed by TD tags. This corresponds to the logical structure of a table as a set of rows consisting of cells. You can abbreviate the table structure by omitting the TD and TR end tags (since a browser implicitly assumes them), but at the expense of losing the logical clarity to some extent:
<TABLE> <TR> <TD> 1 <TD> 0 <TR> <TD> 0 <TD> 1 </TABLE>
Moreover, although omitting those end tags is legal HTML 3.2, it may in practise confuse some browsers (including Netscape) in some cases.
The use of blanks and newlines in the HTML code for a table is irrelevant to the visual appearance of a table when viewed with a browser, since that appearance is controlled by HTML tags. However, it is often useful to position table elements suitably in the HTML code so that items in the same column are adjusted to make the structure clear for you (or whoever has to maintain the HTML document).
<P>An illustration of the use of the TABLE element in HTML.</P> <TABLE BORDER=1> <CAPTION>Finnish, English, and scientific names for some animals</CAPTION> <TR><TH>Finnish name</TH><TH>English name</TH><TH>Scientific name</TH></TR> <TR><TD>hirvi</TD><TD>elk</TD><TD><I>Alces alces</I></TD></TR> <TR><TD>orava</TD><TD>squirrel</TD><TD><I>Sciurus vulgaris</I></TD></TR> <TR><TD>susi</TD><TD>wolf</TD><TD><I>Canis lupus</I></TD></TR> </TABLE>Notice that some table elements in the example contain text markup; in this case, there is a specific reason for using the I element.
In the simplest case you can just write a TABLE element (with attributes defaulted) which contains a single row which contains two data cells, each of which contains a paragraph.
In a more general case, you should divide the parallel texts into logical parts, such as paragraphs, and make each part a cell of the table. This may require a lot of work (unless you have a suitable program to do the job), since you must take care of "merging" the text: after the first part of the first text, you must have the first part of the second text, etc.
The following example presents a passage from the Bible in three versions and translations:
<TABLE> <CAPTION><STRONG>The beginning of Genesis in three languages</STRONG></CAPTION> <TR ALIGN=LEFT VALIGN=TOP> <TH><TH>Latin (Vulgate)</TH><TH>English (King James version)</TH> <TH>Finnish (1992 version)</TH> </TR><TR ALIGN=LEFT VALIGN=TOP> <TH>1</TH> <TD>In principio creavit Deus caelum et terram.</TD> <TD>In the beginning God created the heaven and the earth.</TD> <TD>Alussa Jumala loi taivaan ja maan.</TD> </TR><TR ALIGN=LEFT VALIGN=TOP> <TH>2</TH> <TD>Terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas.</TD> <TD>And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.</TD> <TD>Maa oli autio ja tyhjä, pimeys peitti syvyydet, ja Jumalan henki liikkui vetten yllä. </TD> </TR><TR ALIGN=LEFT VALIGN=TOP> <TH>3</TH> <TD>Dixitque Deus "Fiat lux" et facta est lux.</TD> <TD>And God said, Let there be light: and there was light.</TD> <TD>Jumala sanoi: "Tulkoon valo!" Ja valo tuli.</TD> </TR></TABLE>Notice that the ALIGN and VALIGN attributes can be essential for achieving good rendering. Browsers cannot know the nature of tables from their contents, so there are situations where the document author may need to control formatting issues like alignment.
Using a TABLE element for a definition list is perhaps not an intended use of that element but it is often useful, especially since the author can control things like alignment and use of borders. Consult the document Examples of various list elements in HTML for a very simple example of presenting a definition list as a table with default attribute settings. Usually you probably want the "definition terms" to be left-aligned, as in the following example:
<TABLE> <CAPTION>The first three letters of the Greek alphabet</CAPTION> <TR><TH ALIGN=LEFT>alpha</TH> <TD> the first letter of the Greek alphabet </TD> </TR> <TR><TH ALIGN=LEFT>beta</TH> <TD> the second letter of the Greek alphabet </TD> </TR> <TR><TH ALIGN=LEFT>gamma</TH> <TD> the third letter of the Greek alphabet. </TD> </TR> </TABLE>
For numerical tables, proper alignment is usually crucial for easily readable rendering. (It is in a sense a structural feature, since it relates to the comparability of items of a column.)
Integer values in a column should be right aligned. This is easy to achieve in principle. There are two alternatives:
Values containing a decimal point (or, in many languages, a decimal comma) should be aligned according to that separator, but unfortunately this is not possible in HTML 3.2. (There are suggested ways of expressing such requests, but currently there is little if any support for them.) One solution is to present such values so that there is the same number of digits to the right of the decimal point in every value in a column, and use ALIGN=RIGHT.
However, the rendering might be unsatisfactory if numbers are presented using a proportional font so that digits are of essentially different sizes. It is possible but tedious to overcome this by putting the data in each numerical cell within a TT element. (Notice that it is not legal for a TT element to contain a TABLE element!)
The following example contains first a hand-formatted table presented using the PRE element, then the same data using a TABLE element. In general, it takes more work and care to use a TABLE element but the result is often much better.
Measurement results: <PRE> time temperature pressure 12:00 26 12.8 12:15 22.5 9.8 12:30 11 1.65 12:45 3.3 0.03 13:00 0.05 0.002 </PRE> <TABLE> <CAPTION>Measurement results</CAPTION> <TR><TH>time</TH><TH>temperature</TH><TH>pressure</TH></TR> <TR ALIGN=RIGHT><TD>12:00 </TD><TD>26.00 </TD><TD>12.800 </TD></TR> <TR ALIGN=RIGHT><TD>12:15 </TD><TD>22.50 </TD><TD> 9.810 </TD></TR> <TR ALIGN=RIGHT><TD>12:30 </TD><TD>11.00 </TD><TD> 1.650 </TD></TR> <TR ALIGN=RIGHT><TD>12:45 </TD><TD> 3.30 </TD><TD> 0.030 </TD></TR> <TR ALIGN=RIGHT><TD>13:00 </TD><TD> 0.05 </TD><TD> 0.002 </TD></TR> </TABLE>
The index is implemented in HTML using normal
links, eg
<A HREF="af.html">Afghanistan</A>
What we will discuss here is how to present the link names, or some
other pieces of text, as a list, table, or some other structure.
If you only read HTML specifications, the obvious answer is to use the DIR or MENU construct. However, as mentioned and exemplified in the general discussion of lists, this is not practically feasible. Thus, if we prefer having the menu in multicolumn format, as we usually do, we must use other constructs.
One possibility is to format the menu by hand and enclose it into a PRE element. If the menu items are link texts, you should first format it as text only, then add the anchor (A) tags, since adding them obscures the layout. For clarity, therefore, the following example is presented without links (unlike the other alternatives):
<PRE> Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antarctica Antigua and Barbuda Arctic Ocean Argentina Armenia </PRE>Another possibility, which should be the normal one, is to present the items simply as a text paragraph, using eg a blank or a blank and a comma as separator. This means that the browser takes care of dividing the text into lines and the presentation is very compact:
<BASE HREF="http://www.odci.gov/cia/publications/nsolo/factbook/"> <P> <A HREF="af.htm">Afghanistan</A>, <A HREF="al.htm">Albania</A>, <A HREF="ag.htm">Algeria</A>, <A HREF="aq.htm">American Samoa</A>, <A HREF="an.htm">Andorra</A>, <A HREF="ao.htm">Angola</A>, <A HREF="av.htm">Anguilla</A>, <A HREF="ay.htm">Antarctica</A>, <A HREF="ac.htm">Antigua and Barbuda</A>, <A HREF="ocat.htm">Arctic Ocean</A>, <A HREF="ar.htm">Argentina</A>, <A HREF="am.htm">Armenia</A> </P>Of course, it is possible to force line breaks by using a BR element (eg to make a change in the initial letter cause a new line in an example like above). If you think the items are not distinguishable enough in the rendering, consider prefixing each item with a special character like * (and using just spaces as separator).
However, if for some reason the presentation must be such that all items occupy the same amount of space, then one can either use the PRE method described above or take the effort of designing a suitable TABLE element. Example:
<BASE HREF="http://www.odci.gov/cia/publications/nsolo/factbook/"> <TABLE><TR> <TD WIDTH=160><A HREF="af.htm">Afghanistan</A></TD> <TD WIDTH=160><A HREF="al.htm">Albania</A></TD> <TD WIDTH=160><A HREF="ag.htm">Algeria</A></TD> <TD WIDTH=160><A HREF="aq.htm">American Samoa</A></TD> </TR><TR> <TD WIDTH=160><A HREF="an.htm">Andorra</A></TD> <TD WIDTH=160><A HREF="ao.htm">Angola</A></TD> <TD WIDTH=160><A HREF="av.htm">Anguilla</A></TD> <TD WIDTH=160><A HREF="ay.htm">Antarctica</A></TD> </TR><TR> <TD WIDTH=160><A HREF="ac.htm">Antigua and Barbuda</A></TD> <TD WIDTH=160><A HREF="ocat.htm">Arctic Ocean</A></TD> <TD WIDTH=160><A HREF="ar.htm">Argentina</A></TD> <TD WIDTH=160><A HREF="am.htm">Armenia</A></TD> </TR></TABLE>Alternatively, you might wish to consider the effect of using a table with borders.
Notice that this solution is rather unclean. It involves a TABLE structure where the division into lines is (normally) made for layout purposes only, and adding new items usually requires complete restructuring of the table. You typically need to insert WIDTH attributes to ensure that table columns are of the same width, and the specification is inherently device-dependent since it must be given in pixels. In particular, the pr