7:03 AM
Chapter 10: XML
Introduction
n XML: Extensible Markup Language
n Defined by the WWW Consortium (W3C)
n Originally intended as a document markup language not a database language
H Documents have tags giving extra information about sections of the document
4 E.g. <title> XML </title> <slide> Introduction …</slide>
H Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML
H Extensible, unlike HTML
4 Users can add new tags, and separately specify how the tag should be handled for display
H Goal was (is?) to replace HTML as the language for publishing documents on the Web
n The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents.
H Much of the use of XML has been in data exchange applications, not as a replacement for HTML
n Tags make data (relatively) self-documenting
H E.g.
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name> Downtown </branch-name>
<balance> 500 </balance>
</account>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
</bank>
XML: Motivation
n Data interchange is critical in today’s networked world
H Examples:
4 Banking: funds transfer
4 Order processing (especially inter-company orders)
4 Scientific data
– Chemistry: ChemML, …
– Genetics: BSML (Bio-Sequence Markup Language), …
H Paper flow of information between organizations is being replaced by electronic flow of information
n Each application area has its own set of standards for representing information
n XML has become the basis for all new generation data interchange formats
n Earlier generation formats were based on plain text with line headers indicating the meaning of fields
H Similar in concept to email headers
H Does not allow for nested structures, no standard “type” language
H Tied too closely to low level document structure (lines, spaces, etc)
n Each XML based standard defines what are valid elements, using
H XML type specification languages to specify the syntax
4 DTD (Document Type Descriptors)
4 XML Schema
H Plus textual descriptions of the semantics
n XML allows new tags to be defined as required
H However, this may be constrained by DTDs
n A wide variety of tools is available for parsing, browsing and querying XML documents/data
Structure of XML Data
n Tag: label for a section of data
n Element: section of data beginning with <tagname> and ending with matching </tagname>
n Elements must be properly nested
H Proper nesting
4 <account> … <balance> …. </balance> </account>
H Improper nesting
4 <account> … <balance> …. </account> </balance>
H Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.
n Every document must have a single top-level element
Example of Nested Elements
<bank-1>
<customer>
<customer-name> Hayes </customer-name>
<customer-street> Main </customer-street>
<customer-city> Harrison </customer-city>
<account>
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
<account>
…
</account>
</customer>
.
.
</bank-1>
Motivation for Nesting
n Nesting of data is useful in data transfer
H Example: elements representing customer-id, customer name, and address nested within an order element
n Nesting is not supported, or discouraged, in relational databases
H With multiple orders, customer name and address are stored redundantly
H normalization replaces nested structures in each order by foreign key into table storing customer name and address information
H Nesting is supported in object-relational databases
n But nesting is appropriate when transferring data
H External application does not have direct access to data referenced by a foreign key
n Mixture of text with sub-elements is legal in XML.
H Example:
<account>
This account is seldom used any more.
<account-number> A-102</account-number>
<branch-name> Perryridge</branch-name>
<balance>400 </balance>
</account>
H Useful for document markup, but discouraged for data representation
Attributes
n Elements can have attributes
H <account acct-type = “checking” >
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
n Attributes are specified by name=value pairs inside the starting tag of an element
n An element may have several attributes, but each attribute name can only occur once
4 <account acct-type = “checking” monthly-fee=“5”>
Attributes Vs. Subelements
n Distinction between subelement and attribute
H In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents
H In the context of data representation, the difference is unclear and may be confusing
4 Same information can be represented in two ways
– <account account-number = “A-101”> …. </account>
– <account>
<account-number>A-101</account-number> …
</account>
H Suggestion: use attributes for identifiers of elements, and use subelements for contents
More on XML Syntax
n Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag
H <account number=“A-101” branch=“Perryridge” balance=“200 />
n To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below
H <![CDATA[<account> … </account>]]>
4 Here, <account> and </account> are treated as just strings
Namespaces
n XML data has to be exchanged between organizations
n Same tag name may have different meaning in different organizations, causing confusion on exchanged documents
n Specifying a unique string as an element name avoids confusion
n Better solution: use unique-name:element-name
n Avoid using long unique names all over document by using XML Namespaces
<FB:branch>
<FB:branchname>Downtown</FB:branchname>
<FB:branchcity> Brooklyn </FB:branchcity>
</FB:branch>
…
</bank>
XML Document Schema
n Database schemas constrain what information can be stored, and the data types of stored values
n XML documents are not required to have an associated schema
n However, schemas are very important for XML data exchange
H Otherwise, a site cannot automatically interpret data received from another site
n Two mechanisms for specifying XML schema
H Document Type Definition (DTD)
4 Widely used
H XML Schema
4 Newer, increasing use
Document Type Definition (DTD)
n The type of an XML document can be specified using a DTD
n DTD constraints structure of XML data
H What elements can occur
H What attributes can/must an element have
H What subelements can/must occur inside each element, and how many times.
n DTD does not constrain data types
H All values represented as strings in XML
n DTD syntax
H <!ELEMENT element (subelements-specification) >
H <!ATTLIST element (attributes) >
Element Specification in DTD
n Subelements can be specified as
H names of elements, or
H #PCDATA (parsed character data), i.e., character strings
H EMPTY (no subelements) or ANY (anything can be a subelement)
n Example
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT customer-name (#PCDATA)>
<! ELEMENT account-number (#PCDATA)>
n Subelement specification may have regular expressions
<!ELEMENT bank ( ( account | customer | depositor)+)>
4 Notation:
– “|” - alternatives
– “+” - 1 or more occurrences
– “*” - 0 or more occurrences
Bank DTD
<!DOCTYPE bank [
<!ELEMENT bank ( ( account | customer | depositor)+)>
<!ELEMENT account (account-number branch-name balance)>
<! ELEMENT customer(customer-name customer-street
customer-city)>
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT account-number (#PCDATA)>
<! ELEMENT branch-name (#PCDATA)>
<! ELEMENT balance(#PCDATA)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT customer-street(#PCDATA)>
<! ELEMENT customer-city(#PCDATA)>
]>
Attribute Specification in DTD
n Attribute specification : for each attribute
H Name
H Type of attribute
4 CDATA
4 ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
– more on this later
H Whether
4 mandatory (#REQUIRED)
4 has a default value (value),
4 or neither (#IMPLIED)
n Examples
H <!ATTLIST account acct-type CDATA “checking”>
H <!ATTLIST customer
customer-id ID # REQUIRED
accounts IDREFS # REQUIRED >
IDs and IDREFs
n An element can have at most one attribute of type ID
n The ID attribute value of each element in an XML document must be distinct
H Thus the ID attribute value is an object identifier
n An attribute of type IDREF must contain the ID value of an element in the same document
n An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document
Bank DTD with Attributes
n Bank DTD with ID and IDREF attribute types.
<!DOCTYPE bank-2[
<!ELEMENT account (branch, balance)>
<!ATTLIST account
account-number ID # REQUIRED
owners IDREFS # REQUIRED>
<!ELEMENT customer(customer-name, customer-street,
customer-city)>
<!ATTLIST customer
customer-id ID # REQUIRED
accounts IDREFS # REQUIRED>
… declarations for branch, balance, customer-name,
customer-street and customer-city
]>
XML data with ID and IDREF attributes
<bank-2>
<account account-number=“A-401” owners=“C100 C102”>
<branch-name> Downtown </branch-name>
<balance> 500 </balance>
</account>
<customer customer-id=“C100” accounts=“A-401”>
<customer-name>Joe </customer-name>
<customer-street> Monroe </customer-street>
<customer-city> Madison</customer-city>
</customer>
<customer customer-id=“C102” accounts=“A-401 A-402”>
<customer-name> Mary </customer-name>
<customer-street> Erin </customer-street>
<customer-city> Newark </customer-city>
</customer>
</bank-2>
Limitations of DTDs
n No typing of text elements and attributes
H All values are strings, no integers, reals, etc.
n Difficult to specify unordered sets of subelements
H Order is usually irrelevant in databases
H (A | B)* allows specification of an unordered set, but
4 Cannot ensure that each of A and B occurs only once
n IDs and IDREFs are untyped
H The owners attribute of an account may contain a reference to another account, which is meaningless
4 owners attribute should ideally be constrained to refer to customer elements
XML Schema
n XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports
H Typing of values
4 E.g. integer, string, etc
4 Also, constraints on min/max values
H User defined types
H Is itself specified in XML syntax, unlike DTDs
4 More standard representation, but verbose
H Is integrated with namespaces
H Many more features
4 List types, uniqueness and foreign key constraints, inheritance ..
n BUT: significantly more complicated than DTDs, not yet widely used.
XML Schema Version of Bank DTD
<xsd:element name=“bank” type=“BankType”/>
<xsd:element name=“account”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“account-number” type=“xsd:string”/>
<xsd:element name=“branch-name” type=“xsd:string”/>
<xsd:element name=“balance” type=“xsd:decimal”/>
</xsd:squence>
</xsd:complexType>
</xsd:element>
….. definitions of customer and depositor ….
<xsd:complexType name=“BankType”>
<xsd:squence>
<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Querying and Transforming XML Data
n Translation of information from one XML schema to another
n Querying on XML data
n Above two are closely related, and handled by the same tools
n Standard XML querying/translation languages
H XPath
4 Simple language consisting of path expressions
H XSLT
4 Simple language designed for translation from XML to XML and XML to HTML
H XQuery
4 An XML query language with a rich set of features
n Wide variety of other languages have been proposed, and some served as basis for the Xquery standard
H XML-QL, Quilt, XQL, …
Tree Model of XML Data
n Query and transformation languages are based on a tree model of XML data
n An XML document is modeled as a tree, with nodes corresponding to elements and attributes
H Element nodes have children nodes, which can be attributes or subelements
H Text in an element is modeled as a text node child of the element
H Children of a node are ordered according to their order in the XML document
H Element and attribute nodes (except for the root node) have a single parent, which is an element node
H The root node has a single child, which is the root element of the document
n We use the terminology of nodes, children, parent, siblings, ancestor, descendant, etc., which should be interpreted in the above tree model of XML data.
Xpath
n XPath is used to address (select) parts of documents using
path expressions
n A path expression is a sequence of steps separated by “/”
H Think of file names in a directory hierarchy
n Result of path expression: set of values that along with their containing elements/attributes match the specified path
n E.g. /bank-2/customer/customer-name evaluated on the bank-2 data we saw earlier returns
<customer-name>Joe</customer-name>
<customer-name>Mary</customer-name>
n E.g. /bank-2/customer/customer-name/text( )
returns the same names, but without the enclosing tags
n The initial “/” denotes root of the document (above the top-level tag)
n Path expressions are evaluated left to right
H Each step operates on the set of instances produced by the previous step
n Selection predicates may follow any step in a path, in [ ]
H E.g. /bank-2/account[balance > 400]
4 returns account elements with a balance value greater than 400
4 /bank-2/account[balance] returns account elements containing a balance subelement
n Attributes are accessed using “@”
H E.g. /bank-2/account[balance > 400]/@account-number
4 returns the account numbers of those accounts with balance > 400
H IDREF attributes are not dereferenced automatically (more on this later)
Functions in Xpath
n XPath provides several functions
H The function count() at the end of a path counts the number of elements in the set generated by the path
4 E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers
H Also function for testing position (1, 2, ..) of node w.r.t. siblings
n Boolean connectives and and or and function not() can be used in predicates
n IDREFs can be referenced using function id()
H id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks
H E.g. /bank-2/account/id(@owner)
4 returns all customers referred to from the owners attribute of account elements.
More XPath Features
n Operator “|” used to implement union
H E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)
4 gives customers with either accounts or loans
4 However, “|” cannot be nested inside other operators.
n “//” can be used to skip multiple levels of nodes
H E.g. /bank-2//customer-name
4 finds any customer-name element anywhere under the /bank-2 element, regardless of the element in which it is contained.
n A step in the path can go to:
parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not just to the children
H “//”, described above, is a short from for specifying “all descendants”
H “..” specifies the parent.
H We omit further details,
XSLT
n A stylesheet stores formatting options for a document, usually separately from document
H E.g. HTML style sheet may specify font colors and sizes for headings, etc.
n The XML Stylesheet Language (XSL) was originally designed for generating HTML from XML
n XSLT is a general-purpose transformation language
H Can translate XML to XML, and XML to HTML
n XSLT transformations are expressed using rules called templates
H Templates combine selection using XPath with construction of results
XSLT Templates
n Example of XSLT template with match and select part
<xsl:template match=“/bank-2/customer”>
<xsl:value-of select=“customer-name”/>
</xsl:template>
<xsl:template match=“*”/>
n The match attribute of xsl:template specifies a pattern in XPath
n Elements in the XML document matching the pattern are processed by the actions within the xsl:template element
H xsl:value-of selects (outputs) specified values (here, customer-name)
n For elements that do not match any template
H Attributes and text contents are output as is
H Templates are recursively applied on subelements
n The <xsl:template match=“*”/> template matches all
elements that do not match any other template
H Used to ensure that their contents do not get output.
n If an element matches several templates, only one is used
H Which one depends on a complex priority scheme/user-defined priorities
H We assume only one template matches any element
Creating XML Output
n Any text or tag in the XSL stylesheet that is not in the xsl namespace is output as is
n E.g. to wrap results in new XML elements.
<xsl:template match=“/bank-2/customer”>
<customer>
<xsl:value-of select=“customer-name”/>
</customer>
</xsl;template>
<xsl:template match=“*”/>
H Example output:
<customer> Joe </customer>
<customer> Mary </customer>
n Note: Cannot directly insert a xsl:value-of tag inside another tag
H E.g. cannot create an attribute for <customer> in the previous example by directly using xsl:value-of
H XSLT provides a construct xsl:attribute to handle this situation
4 xsl:attribute adds attribute to the preceding element
4 E.g. <customer>
<xsl:attribute name=“customer-id”>
<xsl:value-of select = “customer-id”/>
</xsl:attribute>
</customer>
results in output of the form
<customer customer-id=“….”> ….
n xsl:element is used to create output elements with computed names
Structural Recursion
Joins in XSLT
Sorting in XSLT
n Using an xsl:sort directive inside a template causes all elements matching the template to be sorted
H Sorting is done before applying other templates
n E.g.
<xsl:template match=“/bank”>
<xsl:apply-templates select=“customer”>
<xsl:sort select=“customer-name”/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match=“customer”>
<customer>
<xsl:value-of select=“customer-name”/>
<xsl:value-of select=“customer-street”/>
<xsl:value-of select=“customer-city”/>
</customer>
<xsl:template>
<xsl:template match=“*”/>
Xquery
n XQuery is a general purpose query language for XML data
n Currently being standardized by the World Wide Web Consortium (W3C)
H The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged.
n Alpha version of XQuery engine available free from Microsoft
n XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL
n XQuery uses a
for … let … where .. result …
syntax
for ó SQL from
where ó SQL where
result ó SQL select
let allows temporary variables, and has no equivalent in SQL
FLWR Syntax in XQuery
n For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath
n Simple FLWR expression in XQuery
H find all accounts with balance > 400, with each result enclosed in an <account-number> .. </account-number> tag
for $x in /bank-2/account
let $acctno := $x/@account-number
where $x/balance > 400
return <account-number> $acctno </account-number>
n Let clause not really needed in this query, and selection can be done In XPath. Query can be written as:
for $x in /bank-2/account[balance>400]
return <account-number> $x/@account-number
</account-number>
Path Expressions and Functions
n Path expressions are used to bind variables in the for clause, but can also be used in other places
H E.g. path expressions can be used in let clause, to bind variables to results of path expressions
n The function distinct( ) can be used to removed duplicates in path expression results
n The function document(name) returns root of named document
H E.g. document(“bank-2.xml”)/bank-2/account
n Aggregate functions such as sum( ) and count( ) can be applied to path expression results
n XQuery does not support group by, but the same effect can be got by nested queries, with nested FLWR expressions within a result clause
H More on nested queries later
Joins
n Joins are specified in a manner very similar to SQL
for $a in /bank/account,
$c in /bank/customer,
$d in /bank/depositor
where $a/account-number = $d/account-number
and $c/customer-name = $d/customer-name
return <cust-acct> $c $a </cust-acct>
n The same query can be expressed with the selections specified as XPath selections:
for $a in /bank/account
$c in /bank/customer
$d in /bank/depositor[
account-number = $a/account-number and
customer-name = $c/customer-name]
return <cust-acct> $c $a</cust-acct>
Changing Nesting Structure
n The following query converts data from the flat structure for bank information into the nested structure used in bank-1
<bank-1>
for $c in /bank/customer
return
<customer>
$c/*
for $d in /bank/depositor[customer-name = $c/customer-name],
$a in /bank/account[account-number=$d/account-number]
return $a
</customer>
</bank-1>
n $c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag
n Exercise for reader: write a nested query to find sum of account
balances, grouped by branch.
XQuery Path Expressions
n $c/text() gives text content of an element without any
subelements/tags
n XQuery path expressions support the “–>” operator for dereferencing IDREFs
H Equivalent to the id( ) function of XPath, but simpler to use
H Can be applied to a set of IDREFs to get a set of results
H June 2001 version of standard has changed “–>” to “=>”
Sorting in XQuery
n Sortby clause can be used at the end of any expression. E.g. to return customers sorted by name
for $c in /bank/customer
return <customer> $c/* </customer> sortby(name)
n Can sort at multiple levels of nesting (sort by customer-name, and by account-number within each customer)
<bank-1>
for $c in /bank/customer
return
<customer>
$c/*
for $d in /bank/depositor[customer-name=$c/customer-name],
$a in /bank/account[account-number=$d/account-number]
return <account> $a/* </account> sortby(account-number)
</customer> sortby(customer-name)
</bank-1>
Functions and Other XQuery Features
n User defined functions with the type system of XMLSchema
function balances(xsd:string $c) returns list(xsd:numeric) {
for $d in /bank/depositor[customer-name = $c],
$a in /bank/account[account-number=$d/account-number]
return $a/balance
}
n Types are optional for function parameters and return values
n Universal and existential quantification in where clause predicates
H some $e in path satisfies P
H every $e in path satisfies P
n XQuery also supports If-then-else clauses
Application Program Interface
n There are two standard application program interfaces to XML data:
H SAX (Simple API for XML)
4 Based on parser model, user provides event handlers for parsing events
– E.g. start of element, end of element
– Not suitable for database applications
H DOM (Document Object Model)
4 XML data is parsed into a tree representation
4 Variety of functions provided for traversing the DOM tree
4 E.g.: Java DOM API provides Node class with methods
getParentNode( ), getFirstChild( ), getNextSibling( )
getAttribute( ), getData( ) (for text node)
getElementsByTagName( ), …
4 Also provides functions for updating DOM tree
Storage of XML Data
n XML data can be stored in
H Non-relational data stores
4 Flat files
– Natural for storing XML
– But has all problems discussed in Chapter 1 (no concurrency, no recovery, …)
4 XML database
– Database built specifically for storing XML data, supporting DOM model and declarative querying
– Currently no commercial-grade systems
H Relational databases
4 Data must be translated into relational form
4 Advantage: mature database systems
4 Disadvantages: overhead of translating data and queries
Storage of XML in Relational Databases
n Alternatives:
H String Representation
H Tree Representation
H Map to relations
String Representation
n Store each top level element as a string field of a tuple in a relational database
H Use a single relation to store all elements, or
H Use a separate relation for each top-level element type
4 E.g. account, customer, depositor relations
– Each with a string-valued attribute to store the element
n Indexing:
H Store values of subelements/attributes to be indexed as extra fields of the relation, and build indices on these fields
4 E.g. customer-name or account-number
H Oracle 9 supports function indices which use the result of a function as the key value.
4 The function should return the value of the required subelement/attribute
n Benefits:
H Can store any XML data even without DTD
H As long as there are many top-level elements in a document, strings are small compared to full document
4 Allows fast access to individual elements.
n Drawback: Need to parse strings to access values inside the elements
H Parsing is slow.
Tree Representation
n Tree representation: model XML data as tree and store using relations
nodes(id, type, label, value)
child (child-id, parent-id)
n Each element/attribute is given a unique identifier
n Type indicates element/attribute
n Label specifies the tag name of the element/name of attribute
n Value is the text value of the element/attribute
n The relation child notes the parent-child relationships in the tree
H Can add an extra attribute to child to record ordering of children
n Benefit: Can store any XML data, even without DTD
n Drawbacks:
H Data is broken up into too many pieces, increasing space overheads
H Even simple queries require a large number of joins, which can be slow
Mapping XML Data to Relations
n Map to relations
H If DTD of document is known, can map data to relations
H A relation is created for each element type
4 Elements (of type #PCDATA), and attributes are mapped to attributes of relations
4 More details on next slide …
n Benefits:
H Efficient storage
H Can translate XML queries into SQL, execute efficiently, and then translate SQL results back to XML
n Drawbacks: need to know DTD, translation overheads still present
n Relation created for each element type contains
H An id attribute to store a unique id for each element
H A relation attribute corresponding to each element attribute
H A parent-id attribute to keep track of parent element
4 As in the tree representation
4 Position information (ith child) can be store too
n All subelements that occur only once can become relation attributes
H For text-valued subelements, store the text as attribute value
H For complex subelements, can store the id of the subelement
n Subelements that can occur multiple times represented in a separate table
H Similar to handling of multivalued attributes when converting ER diagrams to tables
n E.g. For bank-1 DTD with account elements nested within customer elements, create relations
H customer(id, parent-id, customer-name, customer-stret, customer-city)
4 parent-id can be dropped here since parent is the sole root element
4 All other attributes were subelements of type #PCDATA, and occur only once
H account (id, parent-id, account-number, branch-name, balance)
4 parent-id keeps track of which customer an account occurs under
4 Same account may be represented many times with different parents
0 comments:
Post a Comment