What is an XML Database?
An
eXtensible Markup Language (XML) database is a software system that permits
data storage in XML format. XML is a meta-markup language used to manage data
which employs user customizable tags to organize information. The flexibility
of the language, which allows the creation of custom data structures and
organizational systems, has led to its widespread use to exchange data in
multiple forms. XML databases are often used in applications such as
informational portals, document exchanges, and product catalogs.
It is generally considered more
efficient in terms of data conversion costs to use an XML database due to the
widespread use of this language in data transportation. There are two major
categories of these databases: XML-enabled databases and Native XML databases
(NXD). Each type of XML database is used to store different types of data.
An XML-enabled database directs data into a traditional relational
database in an XML format. The data is translated for storage, and returned to
its initial format upon output. This type of database is used to store
data-centric documents which include highly structured information, such as
patient records, and only use XML for data transfer.
Native XML databases store XML documents as a whole, instead of
separating out the data within them, and are designed to store semi-structured
information, such as marketing brochures or health data. XML documents that
contain semi-structured data are referred to as document-centric. A native XML
database does not conform to a certain physical storage model, being able to
use relational, hierarchical, or object-oriented structures as well as custom
storage formats. It manages documents by grouping them into logical
collections, and can set up and manage multiple collections simultaneously.
This type of database permits the user to store any type of XML document,
regardless of structure, within the same collection. Queries can be constructed
across the whole collection, generally making data organization and
manipulation more flexible.
An XML database uses a special programming language designed
specifically to extract and manipulate XML documents, known as XQuery. The
purpose of XQuery is to allow the construction of flexible queries that can
extract and manipulate information from XML documents, as well as other sources
that can be translated into XML. Some applications in which XQuery can be used
include searching text documents on the Web for relevant data and compiling the
results, extracting data from databases to be used in application integration,
and generating reports on the data contained in an XML database.
XML and Databases
In the document-centric model of XML where XML is typically used
as a means to creating semi-structured documents with irregular content that
are meant for human consumption. An example of document-centric usage of XML is
XHTML which is the XML based successor to HTML.
Sample XHTML document
<html xmlns ="http://www.w3.org/1999/xhtml">
<head>
<title>Sample Web Page</title>
</head>
<body>
<h1>My Sample Web Page</h1>
<p> All XHTML documents must be well-formed and valid.
</p>
<img src="http://www.example.com/sample.jpg" height
="50" width = "25"/>
</body>
</html>
The other primary usage of XML is in a data-centric model. In a
data-centric model, XML is used as storage or interchange format for data that
is structured, appears in a regular order and is most likely to be machine
processed instead of read by a human. In a data-centric model, the fact that
the data is stored or transferred as XML is typically incidental since it could
be stored or transferred in a number of other formats which may or may not be
better suited for the task depending on the data and how it is used. An example
of a data-centric usage of XML is SOAP. SOAP is an XML based protocol used for
exchanging information in a decentralized, distributed environment. A SOAP
message consists of three parts: an envelope that defines a framework for
describing what is in a message and how to process it, a set of encoding rules
for expressing instances of application-defined data types, and a convention
for representing remote procedure calls and responses.
Sample SOAP
message taken from w3c soap recommendation
<SOAP-ENV:Envelope
xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope/
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<m:GetLastTradePrice xmlns:m="Some-URI">
<symbol>DIS</symbol>
</m:GetLastTradePrice>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
In both models where XML is used, it is sometimes necessary to
store the XML in some sort of repository or database that allows for more
sophisticated storage and retrieval of the data especially if the XML is to be
accessed by multiple users. Below is a description of storage options based on
what model of XML usage is required.
Data-centric model:
In a data-centric model where data is stored in a relational
database or similar repository; one may want to extract data from a database as
XML, store XML into a database or both. For situations where one only needs to
extract XML from the database one may use a middleware application or component
that retrieves data from the database and returns it as XML. Middleware
components that transform relational data to XML and back vary widely in the
functionality they provide and how they provide it. For instance, Microsoft's
ADO.NET provides XML integration to such a degree that results from queries on
XML documents or SQL databases can be accessed identically via the same API.
The alternative to using middleware components to retrieve or
store XML in a database is to use an XML-enabled database that understands how
to convert relational data to XML and back. Currently, the Big 3 relational
database products all support retrieving and storing XML in one form or
another. IBM's DB2 uses the DB2 XML Extender. The DB2 extender gives one the
option to store an entire XML document and its DTD as a user-defined column or
to slice the document into multiple tables and columns. XML documents can then
be queried with syntax that is compliant with W3C XPath recommendation.
Updating of XML data is also possible using stored procedures.
Document-centric model
Content management systems are typically the tool of choice when
considering storing, updating and retrieving various XML documents in a shared
repository. A content management system typically consists of a repository that
stores a variety of XML documents, an editor and an engine that provides one or
more of the following features:
o version, revision and
access control
o ability to reuse documents
in different formats
o collaboration
o web publishing facilities
o support for a variety of
text editors (e.g. Microsoft Word, Adobe Framemaker, etc)
o
indexing and search capabilities
Content management systems have been primarily of benefit for
workflow management in corporate environments where information sharing is
vital and as a way to manage the creation of web content in a modular fashion
allowing web developers and content creators to perform their tasks with less
interdependence than exists in a traditional web authoring environment.
Hybrid model
In situations where both
document-centric and data-centric models of XML usage will occur, the best data
storage choice is usually a native XML database. The most coherent definition
so far is one that was reached by consensus amongst members of the XML: DB
mailing list which defines a native XML database as a database that has an XML
document as its fundamental unit of (logical) storage and defines a (logical)
model for an XML document, as opposed to the data in that document, and stores
and retrieves documents according to that model. At a minimum, the model must
include elements, attributes, PCDATA, and document order. Tamino is a native
XML database management system developed by Software AG. Tamino is a relatively
mature application, currently at version 2.3.1, that provides the means to
store & retrieve XML documents, store & retrieve relational data, as
well as interface with external applications and data sources. Schemas in
Tamino are DTD-based and are used primarily as a way to describe how the XML
data should be indexed. When storing XML documents in Tamino; one can specify a
pre-existing DTD which is then converted to a Tamino schema, store a
well-formed XML document without a schema which means that default indexing
ensues or a schema can be created from scratch for the XML document being
stored. A secondary usage of schemas is for specifying the data types in XML
documents.