Executive Summary
XML (Extensible Markup Language) is a textual data format that represents
information as a hierarchy of tagged elements, similar to a tree. Each XML
document has exactly one root element, and all other elements nest within
it[1]. Elements have names (tags) and may carry attributes and text. XML
was designed to be both human-readable and machine-readable, and it
relies on strict rules for well-formedness (correct syntax) and validity
(meeting a schema or DTD). This report thoroughly covers XML syntax
(elements, attributes, namespaces, processing instructions, CDATA,
comments, entities), best practices (e.g. meaningful tag names[2]), and
contrasts well-formed vs valid XML. It then delves into DTDs (Document
Type Definitions) – the original XML schema mechanism – explaining
internal vs external subsets, declarations for elements, attributes, entities,
notations, and content models (ANY, EMPTY, mixed, sequences/choices with
occurrence indicators). We show parameter entities and sample DTDs, and
outline validating XML against a DTD. Next, XML Schema (XSD) is
examined: its XML-based schema language, built-in/simple/complex types,
elements and attributes, namespaces, <import>, <include>, <redefine>, type
derivation (extension/restriction), substitution groups, and identity
constraints (<xsd:key>, <xsd:unique>, <xsd:keyref>). We provide example
schemas and validation steps. The report also offers practical guidance on
choosing DTD vs XSD (and migrating), common tools (e.g. xmllint, Xerces,
IDE plugins) for validation, and numerous examples. A comparison table
highlights differences in expressiveness, data typing, namespace support,
extensibility, and tooling. Throughout, authoritative sources (W3C specs) are
cited for definitions and rules, and diagrams (e.g. a timeline of XML/XSD and
a node-tree illustration) clarify structure. The content is organized with clear
headings and concise paragraphs, suitable for learning or teaching XML
concepts deeply and rigorously.
timeline
title Timeline of XML and Schema Standards
1998 : **XML 1.0** (W3C Rec.)
1999 : **XML Namespaces 1.0** (W3C Rec.)
2001 : **XML Schema 1.0** (W3C Rec., Parts 1 & 2)
2004 : **XML Schema Part 2: Datatypes** (W3C Rec.)
2012 : **XML Schema 1.1** (W3C Rec., Part 2: Datatypes)
2022 : **XML Schema 1.1, 2nd Edition** (W3C Rec.)
XML Elements, Attributes, and Document Structure
An XML document is structured as nested elements. Each element has a
start-tag <name> and an end-tag </name> (or can be empty with <name/>). For
example:
<note> ... </note>
Here <note> is the root element. Elements may contain text, other child
elements, or be empty. By definition, an XML document must have exactly
one root (document) element, and all other elements must nest properly
within it[1]. This is why XML data naturally forms a tree structure (see
diagram below).
Figure: XML document as a node-tree (root element with children and text
nodes). XML’s hierarchical structure enforces a single root and proper
nesting[1].
Example: In the simple XML below, <bookstore> is the root, containing
multiple <book> children, each with their own subelements and attributes:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Each element name (tag) must follow XML naming rules: the first character
must be a letter or underscore (_), and it cannot begin with the letters “XML”
(in any case)[1][2]. After the first character, letters, digits, hyphens,
underscores, and dots are allowed. Authors should choose meaningful names
(words or combinations) and avoid spaces or purely symbolic names[2]. For
example, use <invoiceDate> instead of <d>. Certain symbols (like <, &) cannot
appear in names because they serve as markup delimiters.
Attributes are name–value pairs inside a start-tag, used to add metadata to
an element:
<book id="bk101" available="true">
Here id and available are attributes. Attribute values must be quoted
(single or double quotes). XML does not mandate ordering of attributes. By
default, unrecognized attributes are ignored by generic XML parsers unless
constrained by a schema/DTD (see below). Common built-in attribute types
(in DTD) include CDATA (any text), ID, IDREF, ENTITY, etc. We cover attribute
declarations under DTD and XSD below.
Character Data & Entities: Text between tags is called character data.
Certain characters have special meanings in XML ( <, >, &, quotes). To include
them as literal data, use entity references or CDATA sections (below).
Standard predefined entities are < for <, > for >, & for &, ' for
', and " for "[3]. For example, to write 5 < 10 in XML, one could write:
<note>5 < 10</note>
or use a CDATA section:
<note><![CDATA[5 < 10]]></note>
CDATA sections (marked <![CDATA[ ... ]]>) tell the parser to treat enclosed
text literally (so < and & need not be escaped)[3]. They cannot nest, so the
sequence ]]> must not appear inside a CDATA block[4][3]. CDATA is useful
for embedding chunks of text that include characters which would otherwise
be seen as markup.
Comments and Processing Instructions: XML supports comments (<!--
comment text -->) anywhere outside element content[5]. Comments cannot
contain the sequence -- (double hyphen). They are strictly for human or
application notes and are ignored by parsers. Processing instructions (PIs)
allow embedding information for applications. They start with <?target ...?
>. For example, <?xml-stylesheet type="text/xsl" href="[Link]"?> can
link an XSLT stylesheet. The XML specification defines PIs as
“[Instructions] begin with a target (name identifying the application)
and continue until ?>. They are not part of the document’s character
data but must be passed through to the application.”[6].
Namespaces: To avoid naming collisions when mixing vocabularies, XML
uses namespaces. A namespace is identified by a URI; element/attribute
names can be placed in a namespace via prefixes or default declarations[7].
For example:
<root xmlns:h="[Link]
xmlns:f="[Link]
<h:table><h:tr><h:td>Apples</h:td></h:tr></h:table>
<f:table><f:name>African Coffee Table</f:name></f:table>
</root>
Here h: and f: are prefixes bound to URIs. The W3C Namespaces spec
defines: “An XML namespace is identified by a URI reference; element and
attribute names may be placed in an XML namespace using the mechanisms
described…”[7]. In practice, this means you declare xmlns:prefix="URI" on
an element; that prefix then qualifies all child elements/attributes. Default
namespace (no prefix) is set with xmlns="URI" and applies to unprefixed
element names. Namespaces are not supported in DTDs (another reason to
prefer XSD as we will see).
Best Practices – Tag Naming and Structure: XML tags should be
meaningful words (e.g. <customerAddress>, not <cAddr>), typically in lower-
case or PascalCase. Avoid spaces or punctuation in names[2]. Structure your
document logically (e.g. group related elements under a container element).
By convention, use one element per concept and nest semantically. Also,
minimize redundant levels. E.g., for a list of items, use
<items><item>…</item></items>. The tree structure should be clear, with a
single root. And always quote attribute values and close empty tags with />.
For example: <br/> instead of <br> in XML.
Well-Formed vs Valid XML
A well-formed XML document obeys XML syntax rules: one root element,
properly nested tags, all tags closed, attributes quoted, no illegal characters,
etc[1][8]. It need not conform to any schema or DTD, but it must satisfy
XML’s grammar. For example, this is well-formed XML (but has no schema,
so it’s not valid):
<?xml version="1.0"?>
<greeting>Hello, world!</greeting>
A valid XML document is well-formed and complies with the constraints
defined in its DTD or XML Schema. The W3C XML spec states: “An XML
document is valid if it has an associated document type declaration and if
the document complies with the constraints expressed in it.”[8]. That means
the document’s elements and attributes appear in allowed contexts, with
correct content and datatypes as dictated by the DTD/XSD.
Example – Well-Formed but Not Valid:
Given the above <greeting> example, it has no <!DOCTYPE> or schema
declaration, so it is well-formed but not valid. Alternatively, if a DTD required
two child elements <to> and <from>, but the XML omitted one, it would still
be well-formed but fail validation.
Example – Not Well-Formed:
<note><to>Tove</to><from>Jani</to></note>
This is not well-formed (the <from> tag is closed with </to>). The parser will
throw an error. Or using an unescaped &: <text>AT&T</text> is not well-
formed; it must be <text>AT&T</text>.
Key Distinctions:
- Well-formedness is strictly syntactic and can be checked by any XML
parser.
- Validity requires a schema/DTD and ensures the content model and data
types match expectations.
In summary, any valid XML is necessarily well-formed, but a well-formed
document isn’t automatically valid (it must explicitly reference a DTD or
Schema and meet its rules).
XML: Processing Instructions, CDATA, and Comments
Beyond elements and attributes, XML supports several other constructs:
Processing Instructions (PI): As noted, PIs start with <? and end
with ?>. They are intended for applications. E.g.,
<?xml-stylesheet type="text/xsl" href="[Link]"?>
Here xml-stylesheet is the target. The XML spec defines PIs as
“instructions for applications”; they are passed to the application but
not part of the element content[6]. The only reserved word is <?
xml ...?> at the top for the XML declaration.
CDATA Sections: Introduced above. Useful for embedding chunks of
text (like code snippets) that contain < or &. CDATA is not markup –
only the closing ]]> is recognized. For example:
<script><![CDATA[
if (a < b && c > d) alert("X");
]]></script>
Within the CDATA, characters < and & do not need escaping. The spec
notes that within CDATA “only ]]> is recognized as markup”[3].
Comments: Written <!-- comment -->, comments can appear almost
anywhere outside other markup[5]. They are strictly for humans, not
processed by XML parsers. Per the spec, “Comments may appear
anywhere in a document outside other markup; in addition, they may
appear within the DTD.”[5]. Comments must not contain -- (double
hyphen). Example:
<!-- This is a comment explaining the note element -->
<note> … </note>
Entity References: Apart from the five predefined entities, you can
define general entities (text substitutions) in a DTD (e.g. <!ENTITY
writer "Donald Duck">) and use them like &writer;. These allow
reusing text or marking reserved characters. We’ll see entity
declarations under DTD.
These constructs do not affect the element hierarchy but are important for
embedding instructions, text blocks, or annotations.
DTD (Document Type Definition)
A DTD defines the legal structure and vocabulary (elements, attributes,
entities) of a class of XML documents. It can be internal (inside <!DOCTYPE>
in the same XML file) or external (in a separate .dtd file referenced by a
SYSTEM or PUBLIC identifier).
DOCTYPE Declaration
The DTD is declared with a DOCTYPE. Example of internal DTD in an XML file:
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to><from>Jani</from><heading>Reminder</heading><body>…</
body>
</note>
This declares note as the root element containing to, from, heading, body,
each text-only (#PCDATA)[9]. You can also use an external DTD:
<!DOCTYPE note SYSTEM "[Link]">
with [Link] containing those <!ELEMENT> lines.
Element Declarations
An element declaration has the form <!ELEMENT name contentspec>, where
contentspec describes what children or text are allowed. Content models
include:
- EMPTY (element must be empty, no content), e.g. <!ELEMENT br EMPTY>[10].
- ANY (any content allowed, not usually recommended), e.g. <!ELEMENT
container ANY>[11].
- A sequence or choice of sub-elements: (a,b,c) means element “a” then “b”
then “c” in order; (a|b|c) means a choice of one of those. E.g. <!ELEMENT
para (text|emph)*> means a paragraph can have any number of text or
<emph> children in any order[10].
- Mixed content: (#PCDATA | child1 | child2)* allows text interspersed with
specified child elements. E.g. <!ELEMENT p (#PCDATA|a|i|b)*> allows text and
<a>, <i>, <b> in any order[12].
You can use occurrence indicators: ? (0 or 1), * (0 or more), + (1 or more)
after an element or a group. For example:
<!ELEMENT person (firstname, lastname, phone*)>
means <person> must contain exactly one <firstname>, one <lastname>, and
zero or more <phone> children. The spec details that an element model is
essentially a regular expression over child element types[13].
Examples:
<!ELEMENT list (item+)> <!-- list has one or more item
elements -->
<!ELEMENT section (title, (p|list)*, note?)> <!-- title, then many p
or list, then optional note -->
<!ELEMENT emptyExample EMPTY> <!-- element named emptyExample must
be empty -->
Attribute-List Declarations
After element declarations, you can declare attributes for an element with <!
ATTLIST>. Syntax: <!ATTLIST element-name attrName attrType defaultDecl>.
For example:
<!ATTLIST person
id ID #REQUIRED
lang CDATA #IMPLIED
status (single|married|divorced) "single">
This says <person> has an id attribute of type ID (a unique XML ID) and it is
required, a lang attribute of type CDATA (optional), and a status attribute
whose value must be one of the enumerated choices; if omitted, the default
is "single". Key points from the spec: ID values must be unique document-
wide[14], and an element may have at most one ID attribute[15]. Other
types include IDREF(S), ENTITY(IES), NMTOKEN(S), NOTATION, as defined in XML
1.0[16][17].
Default declarations control optional/required status:
- #IMPLIED means optional.
- #REQUIRED means it must appear.
- #FIXED "value" means if present, it must equal that value; if absent, it is as
if that value is used anyway.
Entity Declarations
Entities are reusable pieces of text. Two kinds exist: general entities (used in
content) and parameter entities (used in DTD content). A general entity is
declared like:
<!ENTITY writer "Donald Duck">
<!ENTITY copy "©2026">
Then &writer; or © can be used in the XML content. These are often
used for special characters or common phrases[18]. Numeric character
references (like © for ©) are also allowed.
Parameter entities (used in DTD only) start with %. For example:
<!ENTITY % htmlstruct "(head, body)">
<!ELEMENT html %htmlstruct;>
This lets you factor out DTD definitions. Parameter entities and external
subsets allow modular DTDs. Detailed rules about parameter entity
expansion and nesting are in the XML spec[19][20].
Notation Declarations
Notations declare formats for non-XML (binary) data referenced in attributes.
E.g.: <!NOTATION imgpng SYSTEM "image/png">. Notation attributes tie an
element to an external data type. These are rare in typical XML usage but
part of the DTD spec[21][17].
Content Models in DTDs
Key content model keywords: EMPTY, ANY, and mixed models with
(#PCDATA|...)*. Sequences and choices use commas and pipes respectively,
within parentheses. Example content model forms:
- Sequence: (title, author, year)
- Choice (one of): (yes|no|maybe)
- Mixed (text+elements): (#PCDATA|b|i|u)*
- Occurrence: (item)+, child?, etc.
DTD content models must be deterministic (unambiguous): the parser should
always know, reading left to right, which element type to expect. Non-
deterministic models (like (a|b|a)) are not allowed[22].
Example DTD and Validation
Consider the earlier <note> example. The DTD is:
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
To validate an XML file ([Link]) against this DTD ([Link]), a tool like
xmllint can be used:
xmllint --noout --dtdvalid [Link] [Link]
If [Link] is well-formed and follows the DTD, xmllint will produce no
output (no errors). If invalid, it will report which element/attribute is wrong.
For example, if <note> had <to> and <heading> only, it might say:
Element note content does not follow the DTD, expecting
(to,from,heading,body).
Step-by-step, a validating parser reads the DTD, constructs the grammar,
then parses the XML, checking each element’s children and attributes
against the declarations.
Well-Formedness Constraints (DTD Section)
Even in a DTD context, the XML document must be well-formed: properly
nested tags, matching start/end tags, unique attribute names in one
element, etc. The DTD adds extra constraints but does not override basic
XML syntax rules.
XML Schema (XSD)
XML Schema (often called XSD) is a W3C standard (Parts 1 and 2) for
defining XML document structure and data types in XML syntax. It is far more
powerful and expressive than DTDs[23]. Schemas are themselves XML
documents, and use XML Namespaces (usually the namespace
[Link]
Basics of XSD
A schema document begins with <xs:schema> (or <xsd:schema>) with
namespace declarations, e.g.:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
xmlns:xs="[Link]
targetNamespace="[Link]
xmlns="[Link]
elementFormDefault="qualified">
...
</xs:schema>
- targetNamespace declares the namespace of elements defined in this
schema.
- elementFormDefault="qualified" means local element names must be
qualified with this namespace.
- The schema uses <xs:element>, <xs:complexType>, <xs:simpleType>, etc., to
declare the valid structure.
XML schemas support namespaces, enabling mixing multiple schemas. You
can <xs:import> another namespace’s schema, <xs:include> another
schema in the same namespace, or <xs:redefine> to override declarations.
Elements and Types
In XSD, you declare elements either globally or locally. A global element is
top-level and can be referenced by name. For example:
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
This defines a global element <note> whose content is a complex type: a
sequence of four string elements to, from, heading, body. Attributes can be
declared inside the complexType with <xs:attribute>.
Simple Types: If an element has no sub-elements, only text, it is a simple
type. You can use built-in types like xs:string, xs:int, xs:date, xs:boolean,
etc. Or derive a custom simple type by restricting or extending a base type.
For example:
<xs:simpleType name="ZipCodeType">
<xs:restriction base="xs:string">
<xs:pattern value="\d{5}(-\d{4})?"/>
</xs:restriction>
</xs:simpleType>
<xs:element name="zip" type="ZipCodeType"/>
This defines ZipCodeType as a string matching US ZIP code patterns (5 or 9
digits). The element <zip> uses that type.
Schemas distinguish simpleType (no child elements, just a value) vs
complexType (can contain elements and attributes)【23†L125-134】【23†L127-
136】.
Example: A complex type with attributes:
<xs:complexType name="PersonType">
<xs:sequence>
<xs:element name="FirstName" type="xs:string"/>
<xs:element name="LastName" type="xs:string"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
<xs:attribute name="lang" type="xs:string" default="en"/>
</xs:complexType>
<xs:element name="Person" type="PersonType"/>
This says <Person> has subelements <FirstName> and <LastName> and
attributes id (must be unique ID) and optional lang (default "en").
Derivation (Extension and Restriction)
XSD lets you derive new types from existing ones. For complex types, you
can extend by adding elements/attributes, or restrict by removing or
narrowing them. For simple types, you typically restrict (by pattern,
maxLength, enumerations, etc.). Example of extension:
<xs:complexType name="EmployeeType">
<xs:complexContent>
<xs:extension base="PersonType">
<xs:sequence>
<xs:element name="Position" type="xs:string"/>
</xs:sequence>
<xs:attribute name="salary" type="xs:decimal"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
This EmployeeType extends PersonType by adding a <Position> element and a
salary attribute.
Substitution Groups
XSD supports substitution groups, where one global element can be
substituted by others. For instance, if <payment> is in a substitution group
with <credit> and <debit>, an element <payment> in the XML could legally be
<credit> or <debit> instead. This is an advanced feature enabling
polymorphism. A sub-group is declared by substitutionGroup="headElement"
on the substituting element.
Identity Constraints (key, unique, keyref)
XML Schema allows you to enforce relational constraints:
- <xs:unique name="uniqueID"> ensures a field or combination of fields is
unique among all selected elements.
- <xs:key name="keyName"> is similar to unique but also requires the field is
non-null (like a primary key).
- <xs:keyref name="keyRef"> says a field’s values must match some key’s
values (like a foreign key).
These use XPath-like selectors and fields. For example:
<xs:element name="catalog">
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="isbn" type="xs:ID" use="required"/>
<xs:attribute name="title" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:unique name="uniqueISBN">
<xs:selector xpath="book"/>
<xs:field xpath="@isbn"/>
</xs:unique>
</xs:element>
This enforces each <book> in <catalog> must have a unique isbn attribute. A
<keyref> could enforce a reference from one element’s field to another’s key
field (e.g. an order referencing a customer key).
Namespaces in Schema and Documents
Schemas usually define xmlns:xs="[Link] An
XML document referencing the schema uses xsi:schemaLocation or
xsi:noNamespaceSchemaLocation to indicate which schema to use. For
example:
<note xmlns:xsi="[Link]
xsi:noNamespaceSchemaLocation="[Link]">
...
</note>
or if using a targetNamespace:
<ns:note xmlns:ns="[Link]
xmlns:xsi="[Link]
xsi:schemaLocation="[Link] [Link]">
...
</ns:note>
Example Schema and Validation
Consider the earlier <note> example. An equivalent XSD might be:
<!-- [Link] -->
<xs:schema xmlns:xs="[Link]
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
With an XML file ([Link]):
<?xml version="1.0"?>
<note xmlns:xsi="[Link]
xsi:noNamespaceSchemaLocation="[Link]">
<to>Tove</to><from>Jani</from><heading>Reminder</heading><body>...</
body>
</note>
To validate using xmllint (for example):
xmllint --noout --schema [Link] [Link]
If the content matches the schema, no error is reported. If an element is
missing or an unexpected element appears, you get a validation error (e.g.
"Element 'note': Missing child element(s). Expected: 'body'."). If a type
mismatch occurs (say <to>123</to> with type xs:string is fine, but
<year>abc</year> with type xs:int would fail as “‘abc’ is not a valid value for
'xs:int'”).
Built-in and Custom Data Types
XSD Part 2 provides a rich hierarchy of built-in types (numeric, string,
date/time, binary, etc.)[23]. There are “atomic” primitive types like string,
integer, boolean, date, and derived types like positiveInteger,
normalizedString, token, etc. You can constrain these with facets:
minInclusive, pattern, length, etc. For example, to restrict a date:
<xs:simpleType name="YearType">
<xs:restriction base="xs:gYear">
<xs:minInclusive value="2000"/>
<xs:maxInclusive value="2025"/>
</xs:restriction>
</xs:simpleType>
Unions and lists allow combining types: a union might allow an element to be
one of several types, and a list allows space-separated lists of values. See
the XML Schema spec for full details[24][25].
Schema Extensibility and Versioning
XML Schema is extensible. You can <xs:include> to split a schema into files,
<xs:import> for multiple namespaces, and <xs:redefine> to alter included
components. Schemas themselves have version and final blocks to prevent
further extension if needed. The XSD is essentially object-oriented: types can
be inherited and reused.
Example – Derived Types
As an example of type derivation, the W3C XSD tutorial shows creating a
Dutch_ZIP_Code type by restricting xs:string with a pattern[26] (here’s a
paraphrase):
<xs:simpleType name="Dutch_ZIP_Code">
<xs:restriction base="xs:string">
<xs:pattern value="\d{4} {0,1}[A-Z]{2}"/>
</xs:restriction>
</xs:simpleType>
Then an element can use type="Dutch_ZIP_Code". The schema is checked
with XML tools (like xmllint --schema) ensuring each <zip> matches that
regex. This illustrates the power of XSD over DTD for typed data.
DTD vs XML Schema (XSD) Comparison
XML Schema
Feature DTD (XSD) Notes
Syntax SGML-based XML-based XSD is itself
(non-XML (uses an XML
syntax) namespaces) document
Data Typing None (all Rich built-in e.g. validate
content is types (string, number
text) integer, date, format, dates
etc.)[23];
custom types
via
restriction/ext
ension
XML Schema
Feature DTD (XSD) Notes
Namespaces Not supported Full support; DTD cannot
targetNamesp distinguish
ace, XMLNS
import/include
allowed
Element ANY, EMPTY, All of the XSD can
Models mixed, above plus express more
sequence, <xs:sequence> complex
choice[27][12] , <xs:choice>, models
<xs:all> in
complex types
Attributes Basic (CDATA, Any E.g. enforce
ID, IDREF, simpleType integer vs
ENTITY, (incl. arbitrary text
NMTOKEN, lists/unions);
NOTATION) default
[16][17] values, fixed
values,
required/optio
nal via use
Extensibility Limited; no High: Supports
inheritance of complexType reuse and
element types inheritance libraries
(extension/res
triction),
substitutionGr
oups
Identity No Yes: <xs:key>, Unique IDs,
Constraints <xs:unique>, foreign-key-
<xs:keyref> like
support constraints
relational
integrity
Validation Can only Validates both e.g. pattern
Complexity check structure and facets,
structure/orde data numeric
r; no datatype (patterns, ranges
checks ranges)[23]
Mixed Allowed via ...)*[12] Explicitly
Content mixed supported via
(#PCDATA mixed="true"
in
XML Schema
Feature DTD (XSD) Notes
complexType;
more flexible
Entities Supports Only general Parameter
general and entities (for entities allow
parameter XML instance DTD macros
entities; reuse), no
notations parameter
entities
Tooling & Older; most Modern; wide Newer and
Adoption XML parsers tool support more powerful
support it (xmllint,
natively; Xerces, .NET,
simpler to etc.)[23];
learn steeper
learning
Expressiven Simpler (no More One
ess data typing, expressive “superset” of
no (types, DTD
namespaces) namespaces, capabilities
modularity)
[23]
As one expert notes, “an XML Schema provides [the structure defined by a
DTD] plus a detailed way to define what the data can and cannot contain. It
provides far more control for the developer over what is legal, and it
provides an Object Oriented approach”[28]. The table above and [22†L36-
L41] underscore that XSD is a superset of DTD capabilities (with datatypes,
namespaces, inheritance)[23].
When to Use DTD vs XSD
There remain scenarios for DTDs: legacy systems, simplicity, or
performance. XSDs (especially with complex or large schemas) can be more
verbose to parse. SitePoint observes that DTDs are “mature and complex”
with existing libraries available, whereas XML Schema validation can impose
startup overhead (loading namespaces, DTDs for the schema itself, etc.)[29].
In environments where every millisecond counts or where only simple
validation is needed, a DTD might suffice. However, for most modern
applications (especially involving data exchange, web services, or where
data types matter), XSD is preferred for its precision.
The consensus:
- Use DTD if you need a quick, simple declaration of element structure, or if
working with old tools/standards that only support DTD.
- Use XML Schema if you need strong typing, namespace support, or
complex constraints (and have tools that support it).
Migrating from DTD to XSD
To convert a DTD to XML Schema, you can often use automated tools (e.g.
trang, OxygenXML converter) or manually rewrite:
- Declare a schema with the same root element.
- For each <!ELEMENT>, create an <xs:element> and an <xs:complexType> (with
<xs:sequence>/<xs:choice>) matching the DTD model.
- DTD attribute types (CDATA, ID, enums) map to XSD types (xs:string, xs:ID,
xs:NMTOKEN/enum restrictions).
- Entities and notation declarations have no direct schema equivalent;
hardcode entity values or use CDATA as needed.
As practical guidance, start by generating a basic schema from the DTD
(some XML editors can import DTD), then refine types and add namespaces.
Validate iteratively.
Validation Tools and Commands
A number of tools can check XML well-formedness and validity against DTDs
or XSDs:
xmllint (libxml2): Common on UNIX/Linux.
Check well-formed: xmllint --noout [Link] (no output = well-
formed).
Validate DTD: xmllint --noout --dtdvalid [Link] [Link].
Validate XSD: xmllint --noout --schema [Link] [Link].
Example: xmllint --noout --schema [Link] [Link] (no errors if
valid).
Xerces (Apache): A Java-based parser. Can be used as a command-
line (xercescmd) or through code. Validates both DTD and XSD.
XML IDEs/Editors: e.g. Oxygen XML Editor, XMLSpy, or even Visual
Studio/VS Code with XML plugins. They highlight validation errors with
DTD/XSD.
Online Validators: Many websites allow pasting XML and XSD/DTD to
check validity.
Browsers: Some browsers will validate XML if given a DTD (with
DOCTYPE) but this is old-fashioned and not reliable for complex
schemas.
When validating, common error messages include: missing required
elements/attributes, element not allowed here, datatype mismatch, or entity
undeclared. Always start by checking well-formedness before validating.
Examples
Sample XML + DTD
XML ([Link]):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE note SYSTEM "[Link]">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
DTD ([Link]):
<!ELEMENT note (to, from, heading, body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
Validation:
xmllint --noout --dtdvalid [Link] [Link]
- If all elements appear in order, no output (valid).
- Error example: If <body> is missing in [Link], xmllint reports: “[Link]:
line X: element note: validity error: Element 'note': Missing child element(s).
Expected: 'body'.”
Sample XML + XSD
XML ([Link]):
<?xml version="1.0"?>
<note xmlns:xsi="[Link]
xsi:noNamespaceSchemaLocation="[Link]">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
XSD ([Link]):
<xs:schema xmlns:xs="[Link]
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Validation:
xmllint --noout --schema [Link] [Link]
- If <body> were removed from [Link], it would error: “Element 'note':
Missing child element(s). Expected: 'body'.”
- If, say, <to> contained a number 123 and the schema said xs:string, it
would still pass (numeric is valid string). But if <to> were defined
type="xs:int" and was abc, an error would say “'abc' is not a valid value for
'xs:int'.”
Comparison Summary
Both DTDs and XML Schema serve to define XML structure, but with different
power:
Aspect DTD XML Schema (XSD)
Syntax Format SGML-like (not XML) XML-based, uses
namespaces
Data Types No (all text) Many built-in
(string, date,
integer, etc.)[23]
Namespaces None Full support
(targetNamespace,
import/include)
Extensibility Limited (no Extensive
inheritance) (extension/restrictio
n,
substitutionGroups)
Validation Structure-only Structure + data
(order/occurrence) (patterns, value
Aspect DTD XML Schema (XSD)
ranges)
Tool Support Widespread, simple Widespread
tools (modern parsers,
IDEs)
Use Cases Simple/legacy, free- Complex data with
form text types (e.g. Web
services)
As noted by the XML Schema spec, XSD “substantially reconstructs and
considerably extends the capabilities found in XML 1.0 DTDs”[30]. In
practice, for new designs XML Schema is preferred for its rigor, and DTDs
remain mainly for backwards compatibility.
Further Reading and References
W3C XML 1.0 (Fifth Edition) – for core syntax rules[31][2].
W3C Namespaces in XML – for namespace definitions[7].
W3C XML Schema Part 1: Structures – for schema component
definitions[30].
W3C XML Schema Part 2: Datatypes – for built-in and derived
types[23].
SitePoint, XML DTDs vs XML Schema – practical comparison[28][32].
This report draws on those authoritative sources (cited above) and various
XML tutorials/examples to ensure correctness and clarity. It is meant as a
comprehensive study guide on XML structure, DTDs, and XML Schema for
teaching or self-study.
[1] [2] [3] [4] [5] [6] [8] [10] [11] [12] [13] [14] [15] [16] [17] [19] [20] [21]
[22] [27] [31] Extensible Markup Language (XML) 1.0 (Fifth Edition)
[Link]
[7] Namespaces in XML 1.0 (Third Edition)
[Link]
[9] [18] XML DTD
[Link]
[23] [24] [25] XML Schema Part 2: Datatypes Second Edition
[Link]
[26] Introduction to XML Schemas - W3C
[Link]
[28] [29] [32] XML DTDs Vs XML Schema — SitePoint
[Link]
[30] XML Schema Part 1: Structures Second Edition
[Link]