0% found this document useful (0 votes)
29 views29 pages

Understanding XML Structure and Schema

The document provides an overview of XML including: - XML was created to facilitate data exchange and is used to represent nested, hierarchical data structures. - XML documents are defined with tags that provide meaning and context to the data. - Schemas like DTDs constrain the structure and elements of XML documents but not the data types. DTDs specify allowed elements, attributes, and nesting for XML tags.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views29 pages

Understanding XML Structure and Schema

The document provides an overview of XML including: - XML was created to facilitate data exchange and is used to represent nested, hierarchical data structures. - XML documents are defined with tags that provide meaning and context to the data. - Schemas like DTDs constrain the structure and elements of XML documents but not the data types. DTDs specify allowed elements, attributes, and nesting for XML tags.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

XMLI

Structure of XML Data XML Document Schema XPATH

Introduction

XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML Documents have tags giving extra information about sections of the document E.g. <title> XML </title> <slide> Introduction </slide> Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be handled for display

XML Introduction (Cont.)

The ability to specify new tags, and to create nested tag structures make XML a great way to exchange data, not just documents.

Much of the use of XML has been in data exchange applications, not as a replacement for HTML E.g. <bank> <account> <account_number> A-101 </account_number> <branch_name> Downtown </branch_name> <balance> 500 </balance> </account> <depositor> <account_number> A-101 </account_number> <customer_name> Johnson </customer_name> </depositor> </bank>

Tags make data (relatively) self-documenting

XML: Motivation

Data interchange is critical in todays networked world Examples: Banking: funds transfer Order processing (especially inter-company orders) Scientific data

Chemistry: ChemML, Genetics: BSML (Bio-Sequence Markup Language),

Paper flow of information between organizations is being replaced by electronic flow of information Each application area has its own set of standards for representing information XML has become the basis for all new generation data interchange formats

XML Motivation (Cont.)

Earlier generation formats were based on plain text with line headers indicating the meaning of fields Similar in concept to email headers Does not allow for nested structures, no standard type language Tied too closely to low level document structure (lines, spaces, etc) Each XML based standard defines what are valid elements, using XML type specification languages to specify the syntax DTD (Document Type Definition) XML Schema Plus textual descriptions of the semantics XML allows new tags to be defined as required However, this may be constrained by DTDs A wide variety of tools is available for parsing, browsing and querying XML documents/data

Comparison with Relational Data

Inefficient: tags, which in effect represent schema information, are repeated Better than relational tuples as a data-exchange format Unlike relational tuples, XML data is self-documenting due to presence of tags Non-rigid format: tags can be added Allows nested structures Wide acceptance, not only in database systems, but also in browsers, tools, and applications

Structure of XML Data


Tag: label for a section of data Element: section of data beginning with <tagname> and ending with matching </tagname> Elements must be properly nested Proper nesting <account> <balance> . </balance> </account> Improper nesting <account> <balance> . </account> </balance> Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element. Every document must have a single top-level element

Example of Nested Elements


<bank-1> <customer> <customer_name> Hayes </customer_name> <customer_street> Main </customer_street> <customer_city> Harrison </customer_city> <account> <account_number> A-102 </account_number> <branch_name> Perryridge </branch_name> <balance> 400 </balance> </account> <account> </account> </customer> . . </bank-1>

Motivation for Nesting

Nesting of data is useful in data transfer Example: elements representing customer_id, customer_name, and address nested within an order element Nesting is not supported, or discouraged, in relational databases With multiple orders, customer name and address are stored redundantly normalization replaces nested structures in each order by foreign key into table storing customer name and address information Nesting is supported in object-relational databases But nesting is appropriate when transferring data External application does not have direct access to data referenced by a foreign key

Structure of XML Data (Cont.)

Mixture of text with sub-elements is legal in XML. Example: <account> This account is seldom used any more. <account_number> A-102</account_number> <branch_name> Perryridge</branch_name> <balance>400 </balance> </account> Useful for document markup, but discouraged for data representation

Attributes

Elements can have attributes <account acct-type = checking > <account_number> A-102 </account_number> <branch_name> Perryridge </branch_name> <balance> 400 </balance> </account> Attributes are specified by name=value pairs inside the starting tag of an element An element may have several attributes, but each attribute name can only occur once <account acct-type = checking monthly-fee=5>

Attributes vs. Subelements

Distinction between subelement and attribute In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents In the context of data representation, the difference is unclear and may be confusing Same information can be represented in two ways

<account account_number = A-101> . </account> <account> <account_number>A-101</account_number> </account>

Suggestion: use attributes for identifiers of elements, and use subelements for contents

Namespaces

XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations, causing confusion on exchanged documents Specifying a unique string as an element name avoids confusion Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML Namespaces <bank Xmlns:FB=[Link] <FB:branch> <FB:branchname>Downtown</FB:branchname> <FB:branchcity> Brooklyn </FB:branchcity> </FB:branch> </bank>

More on XML Syntax

Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag <account number=A-101 branch=Perryridge balance=200 /> To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below <![CDATA[<account> </account>]]> Here, <account> and </account> are treated as just strings CDATA stands for character data

XML Document Schema

Database schemas constrain what information can be stored, and the data types of stored values XML documents are not required to have an associated schema However, schemas are very important for XML data exchange Otherwise, a site cannot automatically interpret data received from another site Two mechanisms for specifying XML schema Document Type Definition (DTD) Widely used XML Schema Newer, increasing use

Document Type Definition (DTD)


The type of an XML document can be specified using a DTD DTD constraints structure of XML data What elements can occur What attributes can/must an element have What subelements can/must occur inside each element, and how many times. DTD does not constrain data types All values represented as strings in XML DTD syntax <!ELEMENT element (subelements-specification) > <!ATTLIST element (attributes) >

Element Specification in DTD

Subelements can be specified as names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement) Example <! ELEMENT depositor (customer_name account_number)> <! ELEMENT customer_name (#PCDATA)> <! ELEMENT account_number (#PCDATA)> Subelement specification may have regular expressions <!ELEMENT bank ( ( account | customer | depositor)+)> Notation:

| - alternatives + - 1 or more occurrences * - 0 or more occurrences

Bank DTD
<!DOCTYPE bank [ <!ELEMENT bank ( ( account | customer | depositor)+)> <!ELEMENT account (account_number branch_name balance)> <! ELEMENT customer(customer_name customer_street customer_city)> <! ELEMENT depositor (customer_name account_number)> <! ELEMENT account_number (#PCDATA)> <! ELEMENT branch_name (#PCDATA)> <! ELEMENT balance(#PCDATA)> <! ELEMENT customer_name(#PCDATA)> <! ELEMENT customer_street(#PCDATA)> <! ELEMENT customer_city(#PCDATA)>

]>

Attribute Specification in DTD

Attribute specification : for each attribute Name Type of attribute CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

more on this later

Whether mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED) Examples <!ATTLIST account acct-type CDATA checking> <!ATTLIST customer customer_id ID # REQUIRED accounts IDREFS # REQUIRED >

IDs and IDREFs


An element can have at most one attribute of type ID The ID attribute value of each element in an XML document must be distinct Thus the ID attribute value is an object identifier An attribute of type IDREF must contain the ID value of an element in the same document An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document

Bank DTD with Attributes

Bank DTD with ID and IDREF attribute types. <!DOCTYPE bank-2[ <!ELEMENT account (branch, balance)> <!ATTLIST account account_number ID # REQUIRED owners IDREFS # REQUIRED> <!ELEMENT customer(customer_name, customer_street, customer_city)> <!ATTLIST customer customer_id ID # REQUIRED accounts IDREFS # REQUIRED> declarations for branch, balance, customer_name, customer_street and customer_city ]>

XML data with ID and IDREF attributes


<bank-2> <account account_number=A-401 owners=C100 C102> <branch_name> Downtown </branch_name> <balance> 500 </balance> </account> <customer customer_id=C100 accounts=A-401> <customer_name>Joe </customer_name> <customer_street> Monroe </customer_street> <customer_city> Madison</customer_city> </customer> <customer customer_id=C102 accounts=A-401 A-402> <customer_name> Mary </customer_name> <customer_street> Erin </customer_street> <customer_city> Newark </customer_city> </customer> </bank-2>

Limitations of DTDs

No typing of text elements and attributes All values are strings, no integers, reals, etc. Difficult to specify unordered sets of subelements Order is usually irrelevant in databases (unlike in the documentlayout environment from which XML evolved) (A | B)* allows specification of an unordered set, but Cannot ensure that each of A and B occurs only once IDs and IDREFs are untyped The owners attribute of an account may contain a reference to another account, which is meaningless owners attribute should ideally be constrained to refer to customer elements

Tree Model of XML Data


Query and transformation languages are based on a tree model of XML data An XML document is modeled as a tree, with nodes corresponding to elements and attributes Element nodes have child nodes, which can be attributes or subelements Text in an element is modeled as a text node child of the element Children of a node are ordered according to their order in the XML document Element and attribute nodes (except for the root node) have a single parent, which is an element node The root node has a single child, which is the root element of the document Example

XPath

XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by / Think of file names in a directory hierarchy Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank/customer/customer_name evaluated on the bank data we saw earlier returns <customer_name>Hayes</customer_name> <customer_name>Johnson</customer_name> E.g. /bank/customer/customer_name/text( ) returns the same names, but without the enclosing tags

XPath (Cont.)

The initial / denotes root of the document (above the top-level tag) Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] E.g. /bank/customer/account[balance > 400] returns account elements with a balance value greater than 400 /bank/customer/account[balance] returns account elements containing a balance subelement Attributes are accessed using @ E.g. /bank/customer/account[balance > 400]/@account_number returns the account numbers of accounts with balance > 400 Here we assume account_number is an attribute Otherwise /bank/customer/account[balance > 400]/account_number IDREF attributes are not dereferenced automatically (more on this later)

Functions in XPath

XPath provides several functions The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /bank/customer/[count(./account) > 1]

Returns customer with > 1 accounts

Also function for testing position (1, 2, ..) of node w.r.t. siblings Boolean connectives and and or and function not() can be used in predicates IDREFs can be referenced using function id() id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks E.g. /bank/customer/account/id(@owner) returns all customers referred to from the owners attribute of account elements.

More XPath Example

Element AA with two ancestors /*/*/AA First BB element of AA element /AA/BB[1] All the CC elements of the BB elements which has an sub-element A with value 3 /BB[A=3]/CC Any elements AA or elements CC of elements BB //AA | /BB/CC

Even More XPath Example

Select all sub-elements of elements BB of elements AA /BB/AA/* When you do not know the sub-elements Different from /BB/AA Select all attributes named aa //@aa Select all CITIES elements with an attribute named aa //CITIES[@aa] Select all CITIES elements with an attribute named aa with value 123 //CITIES[@aa = 123]

You might also like