HTML

Introduction

HTML is the language of the web. Almost every web page is written in HTML, and web browsers exist specifically to process and display it. However, many websites fall foul of basic HTML good practice, some of which directly impacts upon the speed at which users can interact with them.

HTML stands for Hypertext Markup Language. Markup languages are ways to annotate a text document with additional information (markup) that sits "on top" of the text, for example saying that X is bold, Y is a link, Z is a heading.

The HTML standard is maintained by the World Wide Web Consortium (http://www.w3.org), who periodically revise and update it. The latest version of the standard at the time of writing is HTML 4.01, released in 1999.

Separating Content and Format

HTML was originally supposed to be a language that described the structure of the document, while the visual appearance was left to the browser. However, website authors wanted more control over the page display and so HTML was expanded to include presentational elements such as font tags. To determine the page layout it became standard practice for web designers to use nested tables, which were originally intended only for displaying data in a spreadsheet format.

These practices persist today, despite advances in browser support for newer web technologies such as style sheets. The designers of HTML now recommend that HTML should be used to specify the structure of the document, and that style sheets should be used to control the appearance:

As HTML matures, more and more of its presentational elements and attributes are being replaced by other mechanisms, in particular style sheets. Experience has shown that separating the structure of a document from its presentational aspects reduces the cost of serving a wide range of platforms, media, etc., and facilitates document revisions.[1]

Separating content from format generally results in a good page structure that not only makes maintenance easier but also saves bandwidth.

Versions of HTML

There are three versions of HTML worth discussing in this context:

  • HTML 4.01 Strict
  • HTML 4.01 Transitional
  • XHTML

Strictly speaking, XHTML is not a version of HTML but is a separate language in its own right, however it is very similar and fulfils the same function as HTML. Below we describe each version and explain why we think HTML 4.01 Strict is the best choice.

HTML Strict and Transitional

There are two "dialects" of HTML 4.01, called Strict and Transitional. The Strict dialect (or DTD) forbids all presentation in the HTML. For backwards compatibility, you can use the Transitional DTD instead. The W3C says:

Authors should use the Strict DTD when possible, but may use the Transitional DTD when support for presentation attribute and elements is required.

Using the Strict DTD will force you to write your web pages with no presentation information in the HTML, thereby separating the format from the content.

An important part of standards compliance is to validate your documents, for example using the W3C Validation Service at http://validator.w3.org/. This assures you that you have actually followed the standard correctly.

XHTML

A standard closely related to HTML is XHTML, where the "X" stands for XML. XHTML is a modified version of HTML that is fully XML compatible, so that it can be processed by XML parsers. The syntax of XHTML is simpler and more consistent and therefore easier to parse than HTML. This makes it more attractive for applications involving the automated processing of web pages.

There are two main differences in XHTML. The first is that it allows empty elements like this:

<textarea />

These both open and close themselves, instead of the traditional

<textarea></textarea>

Older browsers have problems with this form, and may ignore the tag entirely or misinterpret it.

The second difference is that all elements that could be left unclosed in HTML, such as paragraph p and horizontal rule hr elements, are required to be closed in XHTML. This may increase the size of your document, although not significantly.

Which Flavour of HTML?

From a bandwidth perspective there isn't too much difference between XHTML and HTML. XHTML documents can be slightly bigger because all tags must be closed, but not significantly bigger.

Both HTML 4.01 Strict and XHTML force you to separate content from format, which is more likely to result in a bandwidth saving design.

XHTML is not supported as well by older browsers. Strictly speaking even Internet Explorer 6 doesn't support XHTML, although browsers generally make a reasonable attempt at rendering XHTML and for simple web pages it can be hard to see the differences between HTML and XHTML.

If you are starting from scratch then we would recommend using HTML 4.01 with the Strict DTD. However, if your website is already in XHTML, it would not be worth the effort of changing it to HTML.

Content Development Tools

Reading and writing HTML takes a bit of getting use to, with its many angled brackets. Consequently there are many tools to create and maintain HTML pages, usually with a graphical display that looks like a cross between a word processor and a web browser. However, to truly understand the behaviour of web pages, or to see where optimisations are possible, you should look directly at the HTML source code. Some tools make this easy, while in others it is nearly impossible. Luckily, the source code is in plain text and can be loaded into almost any text editor. All the major web browsers have the option to view the source of a page, if not to edit it.

Not all tools are equal, and some do not produce very good HTML code. Microsoft Word has a particularly bad reputation. While it is possible to create HTML documents with Word, we strongly recommend against doing so. Word inserts a lot of extra information into its HTML documents to prevent loss of data when loading the pages back into the word processor. This information is generally not needed if you are making a website. A simple "Hello World" web page in Word is ten times bigger than it should be.
(Example:Optimising a Web Page Generated by Microsoft Word)

Removing Redundant Information

As with all parts of a website, the easiest way to reduce the size is to remove information that is simply unnecessary. The best way to start this process is to examine your HTML code closely. Look for anything that doesn't immediately relate to your content and ask yourself if it could be shortened or removed entirely.

Meta Tags

HTML meta tags[2] (or meta elements) serve two functions: to provide instructions to browsers and to search engines.

Every meta element has a name and a value. The name is set with the name or http-equiv attribute, and the value with the content attribute. An example of a meta element:

<meta name='description' content='Aptivate aims to enable access to 
 information and communications for all. Our two priority areas: 
 Universal Access to ICTs and Bandwidth Management.'/>

Meta elements give instructions to browsers and search engines. The instructions for search engines include the description, keywords and robots names. keywords has been widely abused in the past and is largely ignored by many search engines. robots provides a way to prevent search engines from crawling your site, but the more efficient method of using a robots.txt file should now be used instead.

Instructions for browsers are generally helpful only when you have no direct control over the web server. They include HTTP header equivalents such as the refresh directive, which substitutes for the ability to send a Refresh: header in the HTTP response; and similarly the meta Content-Type, which allows you to specify the character set of the document without changing the HTTP Content-Type header. If possible, configure the web server to serve content correctly, and remove these browser instructions.

Other meta names, such as Author and Generator are ignored by web browsers and can be removed.

Empty Elements

Non-breaking spaces (&nbsp;) can usually be replaced with a space, unless of course the words either side should not be separated. Almost-empty elements like:

<p>&nbsp;</p>

are just being used for presentation (creating white space) and can be replaced by specifying margins or padding with CSS.

Comments

Comments, such as:

<!-- this is a start of a heading -->

are helpful to web developers but add to the page size and will be ignored by browsers. All comments should therefore be removed from websites prior to making your content public. Optimisation tools such as HTML Tidy[3] can be used to strip comments from HTML source using the hide-comments:yes configuration option. This way you can keep a commented version of your website for development purposes.

White Space

Similarly to comments, white space can help the legibility of HTML source by indenting levels of nested elements but this will only add to file size. Again, tools such as HTML Tidy[3] can be used to strip white space whilst allowing you to keep separate development and production versions of your website.

Inline CSS and JavaScript

Inline style and script sections can be moved out to an external file, which will be cached by the browser. The exception to this is for very small quantities of CSS or JavaScript, where the time taken to load the separate file would be greater than the saving gained.

Remove Default Attributes

The HTML 4.01 Specification[4] lists each attribute, along with their default values for each element. Certain attributes have default values so there is no need to specify them explicitly. In the example below of a text input field there is no need to specify the type attribute because by default an input element will be rendered as a text field.

<input size="31" name="q" type="text" />

There are a number of techniques for saving space and time potentially taken up by links, particularly on pages with several links such as home pages and site maps.

Show Link Sizes

If you link to something that is over 75kB you should show the size next to the link. For example:

PDF Report (237kB)

Use Relative URLs

Absolute URLs take up more space than relative URLs. So for example, if we wanted to link internally to a page whose absolute URL was:

http://www.example.org/directories/news/sport/

and the base directory of the document we were linking from was:

http://www.example.org/directories/news/

then the relative URL would be simply:

sport/

Keep Directory and File Names Short

Long URLs such as this:

http://www.example.org/Regional/Africa/Sao_Tome_and_Principe/Science_and_Environment/

could be abbreviated to this:

http://www.example.org/reg/af/st/SciEnv/

URL Rewriting

You can use URL rewriting on your web server to automate link abbreviations. This allows you to further reduce the size of the link without losing the semantic meaning encapsulated by your directory structure. This is useful both for website editors and also for search engine relevance rankings — some give higher rankings according to keywords in links. Links are given short URLs in the HTML and these are substituted with longer, more meaningful URLs by the server. For example, this relative URL:

<a href="/r/24">

could be set up to resolve to:

http://www.example.org/medicine/

This technique is supported by many servers, including Apache and IIS. In Apache the mod_rewrite module[5] uses regular expressions to handle URL expansion. ISAPI_Rewrite[6] operates in a similar manner for IIS.

Use a Trailing Slash at the End of Directory Links

Put a trailing slash (/) at the end of links to directories, for example:

<a href="http://www.example.org/directoryname">

should be:

<a href="http://www.example.org/directoryname/">

If a user requests a page that points to a directory, but fails to include the trailing slash, the server is required to send a redirect back to the client telling it to reload the page with the slash added. On a slow connection, this redirect can add several seconds to the page loading time.

Avoid Using Classes

Even if you use CSS (Cascading Style Sheets), there are ways to use it inefficiently. In theory, every element in your page could be a generic grouping element (div or span) with an appropriate class, but that would not be using the markup language as intended. The document would no longer have much human-readable meaning, and would display very badly without the CSS or on a browser with limited CSS support.

Wherever possible, use standard HTML elements such as paragraphs, images, lists and horizontal rules. When you need to change the appearance of an element, consider using a CSS selector to identify it, instead of giving it a class. CSS selectors are a way of applying style rules to all elements that match a particular pattern.

For example, suppose you want to make the first paragraph after any h2 heading appear in bold. You could write this CSS:

p.bold { font-weight: bold }

and this HTML:

<h2>Second-Level Heading</h2>
<p class="bold">Bold First Paragraph</p>
<p>...</p>

However, you might have a lot of such paragraphs, and repeatedly specifying class="bold" for each one could become tedious and waste bandwidth. A simpler solution is to write a CSS selector which always applies to such paragraphs. This removes the need to specify the class unless you want to override the default.

Here's the revised CSS:

h2 + p { font-weight: bold }

These practices should make your site easier to develop, easier to maintain, and help to promote a consistent appearance, which makes it easier to navigate and to use.

Test your site with CSS disabled and make sure that it's easy to read and use, even if it's not pretty.

For more detail on optimising CSS see the Style Sheets chapter.

Incremental Loading

We mentioned in the Introduction that website users are prepared to wait for up to 30 seconds for a page to load if useful data starts to appear within 2 seconds. Even if the rest of the page is loading, visitors to your site might be able to start reading an article or navigate to a different page.

An important part of incremental loading is that the site navigation should load and display as quickly as possible. This includes navigation bars and the most used links within the page. For this reason, the number of navigational links should be kept down to a reasonable number and should load as soon as possible by occurring at the beginning of the file. Remember that style sheets can, and should, be used to specify position and layout. This allows you to write high-level navigation as an "unordered list" (using the ul and li tags in HTML) right near the top of the file, so it loads first, but position your navigation anywhere on the page.

Modern browsers are capable of reformatting web pages on the fly and loading them incrementally. However, certain page layouts can stop them from doing so.

Complex arrangements of nested tables whose widths are not fixed are difficult to render properly, until the browser has seen the entire table. This is because the width of each column depends on all of the others. If the table resizes whilst the page is loading, it will cause the layout to jump about and annoy the user. Use styled div elements with style sheets instead of tables for layout.

You can ensure that data tables will load incrementally by telling the browser the number of columns in the table and the width of each column. This is achieved using the colgroup and col elements.

JavaScript files take effect where they appear in the page, so they must normally be read and executed before the page loading can continue. This can be prevented by deferring JavaScript loading with the defer attribute, as described in the Interactive Sites chapter.

To slow down page loading and test that it does load incrementally, you can use the Loband Simulator[7] and see it happening. The Firebug[8] extension of Firefox displays the order that the browser loads files and which ones are delaying others.

Framesets

Framesets are a way to split the browser window into separate areas, which normally display different pages. They can seem like an easy way to place a navigation menu onto every page of a site, without requiring the main pages to be modified, and reduce bandwidth usage.

However, framesets have usability problems. They can reduce the amount of available space on a page. They make it difficult or impossible for users to link directly or bookmark a specific page within the frameset and they often prevent the browser's back button from working as expected.

Therefore we recommend that you avoid using framesets. Page layouts similar to those provided by framesets can now be achieved using nested styled div elements. If page size is kept to a minimum and style information is stored in a separate file, the amount of extra bandwidth in having to reload the navigation for each page is negligible.

Summary

We make the following recommendations for optimising HTML:

  • Use HTML 4.01 Strict
  • Avoid using framesets
  • Don't use tables to format your pages
  • Use style sheets to separate formatting from content
  • Remove comments and white space using tools like HTML Tidy[3]
  • Navigation and important links should load first
  • Add "trailing slashes" to links to directories
  • Remove default attributes

By following our recommendations, you should be able to create pages that look good and work well, not only on a modern desktop browser, but also on older browsers, mobile devices and accessibility technologies.

Footnotes

[#1] HTML 4.01 Specification, W3C, http://www.w3.org/TR/html4/intro/intro.html (retrieved 06/08/2007)

[#2] http://www.w3.org/TR/html401/struct/global.html#h-7.4.4

[#3] http://tidy.sourceforge.net/

[#4] http://www.w3.org/TR/html401/index/attributes.html

[#5] http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html

[#6] http://www.isapirewrite.com/

[#7] http://www.loband.org/loband/simulator.jsp

[#8] http://www.getfirebug.com/