Sunday, August 17, 2014

XML XSD Introduction

XML stands of extensible markup language. XML helps in defining your own markup language. For example we all know about HTML which is hyper text markup language. However HTML is standard and the markups are well defined. If we want to have a line break than we say <br/>, and being a standard all browsers understand it in the same way. HTML is limited to what has been defined by the standard. So to provide the ability to define one's own markup language XML was introduced. It took the world by storm and almost all the technologies, tools frameworks adopted it quickly. Before XML came into picture, people used to represent the data using CSV(comma separated files) or similar separators. However these plain text files were hard to read by human and also the it was not possible to validate the files for syntax. Also the lack of standards in defining resulted in lack of tools to handle them in a robust way.

Let's write a simple XML and note that how the XML is easier to read and compare this if we have to represent the same information in a CSV file.

user.xml

<?xml version="1.0" encoding="UTF-8"?>
<user startDate="04-04-2008">
     <homeAddress country="India">
             <houseNo>D-1</houseNo>
             <society>Akshay Park</society>
             <locality>Thergaon</locality>
             <city>Pune</city>
             <pin>411033</pin>
     </homeAddress>
     <officeAddress country="India">
             <houseNo>10</houseNo>
             <society>Akshay Center</society>
             <locality>Thergaon</locality>
             <city>Pune</city>
             <pin>411033</pin>
       </officeAddress>
       <productBought>
              <product productNo="AAA123">
                  <name>YoYo</name>
                  <quantity>4</quantity>
                  <price>89</price>
                  <comment>Green Colour</comment>
             </product>
             <product productNo="XYYZ123">
                  <name>Rocking Chair</name>
                  <quantity>1</quantity>
                  <price>3000</price>
            </product>
       </productBought>
</user>

Let's look into some of the rules of XML syntax.

XML Naming convention

Blanks space are not permitted in XML names. Names are case sensitive. <product> and <Product> are two different elements.A name must start with an alphabetical letter or an underscore.
Prolog: The top of XML <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> is known as prolog. It is not necessary but a good practice.If present ther version is mandatory, other two are optional. encoding identifies the character set used to encode the data. standalone identifies whether the sources accesses external data sources.
A XML contains only on root element. For example in the user.xml, the root element is <user>
A comment is represented as <!- - this is a comment -->

Elements and Attributes

An element in XML might have more elements nested in it. Also it can have attributes.

<officeAddress country="India">
          <houseNo>10</houseNo>
          <society>Akshay Center</society>

In the above case, country is an attribute and houseNo is an element. The choice has to be done whether you want to represent a data as an attribute or as element. Some rule of thumbs are:

If multiple instances of child element is possible than it has to be represented as element.
Usually the inherent property of an element is represented as attribute and the subelements are represented as element.Remember it's just a rule of thumb. Also attributes reduce the size of XML to some extent.
CDATA

CDATA section allows to mark a section of text as literal so that it will not be parsed for tags and symbols but will instead be considered just a string of characters.

<![CDATA[ <html> <p> </html> ]]>

Namespace

Namespaces are used to prevent naming collisions. If the namespace are not explicitly defined, the XML elements are considered to reside in a default namespace. Suppose we have two XML fragment

<Book>
        <Name>The Wonder that was India</Name>
        <Price>Basham</Price>
</Book>
The other XML

<Book>
        <Train>Jhelum</Train>
         <Number>1028</Number>
 </Book>

Combining the two fragment will create conflict. Namespace is declared using xmlns attribute

xmlns:lib="http://www.oyejava.com/lib"
xmlns:trn="http://www.oyejava.com/trn"

Default namespace – No prefix to be used

xmlns="http://www.lalit.com/lib"

Now conflict can be resolved using namespace

<lib:Book>
            <lib:Name>The Wonder that was India</lib:Name>
            <lib:Price>400</lib:Price>
 </lib:Book>

<trn:Book>
      <trn:Train>Jhelum</trn:Train>
       <trn:Number>1028</trn:Number>
</trn:Book>

Validation of XML

One of the big reason of popularity of XML is that the XML can be validated using tools. The validation can be done for:
  • A valid XML document is well formed. The tags are closed properly in a balanced way.
  • It conforms to XML specification.
  • It conforms to the constraints defined in a schema definition.

Also XML grammar can be defined using Schema definition language. There are two ways to define XML syntax which are as follows
  • DTD (Document Type definition) - This has been almost gone out of fashion but you might still see the DTD being referred on XML.
  • XSD (XML schema definition) - This is prevalent way of defining XML syntax. XSD itself is defined as an XML so they become amenable to tools who understand XML.

XSD

Let's write XSD for our user.xml. user.xsd

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
       <xsd:annotation>
             <xsd:documentation xml:lang="en"> User schema fo royejava.com Copyright 2008 oyejava.com. All rights reserved. </xsd:documentation>
       </xsd:annotation>

      <xsd:element name="user" type="UserType"/>
                <xsd:complexType name="UserType">
                       <xsd:sequence>
                            <xsd:element name="homeAddress" type="Address"/>
                            <xsd:element name="officeAddress" type="Address"/>
                             <xsd:element ref="comment" minOccurs="0"/>
                             <xsd:element name="productBought" type="Products"/>
                       </xsd:sequence>
                       <xsd:attribute name="startDate" type="xsd:date"/>
                 </xsd:complexType>

                 <xsd:complexType name="Address">
                            <xsd:sequence>
                                  <xsd:element name="houseNo" type="xsd:string"/>
                                  <xsd:element name="society" type="xsd:string"/>
                                  <xsd:element name="locality" type="xsd:string"/>
                                  <xsd:element name="city" type="xsd:string"/>
                                  <xsd:element name="pin" type="xsd:decimal"/>
                              </xsd:sequence>
                              <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="India"/>
                  </xsd:complexType>

                  <xsd:complexType name="Products">
                             <xsd:sequence>
                                  <xsd:element name="product" minOccurs="0" maxOccurs="unbounded">
                                         <xsd:complexType>
                                                <xsd:sequence>
                                                         <xsd:element name="name" type="xsd:string"/>
                                                         <xsd:element name="quantity">
                                                            <xsd:simpleType>
                                                                    <xsd:restriction base="xsd:positiveInteger">
                                                                       <xsd:maxExclusive value="100"/>
                                                                    </xsd:restriction>
                                                              </xsd:simpleType>
                                                          </xsd:element>
                                                         <xsd:element name="price" type="xsd:decimal"/>
                                                         <xsd:element ref="comment" minOccurs="0"/>
                                                         <xsd:element name="purchaseDate" type="xsd:date" minOccurs="0"/>
                                                  </xsd:sequence>
                                                 <xsd:attribute name="productNo" type="productCode" use="required"/>
                                          </xsd:complexType>
                                    </xsd:element>
                                 </xsd:sequence>
                           </xsd:complexType>

                           <!-- Product Code, a code for identifying products -->
                            <xsd:simpleType name="productCode">
                                       <xsd:restriction base="xsd:string">
                                             <xsd:pattern value="A-ZA-ZA-Z0-90-90-9"/>
                                        </xsd:restriction>
                             </xsd:simpleType>
    </xsd:schema>

Let's look into different definitions in XSD.

Elements

Element provide definition for the content of an XML data document. For example

<xsd:element name="user" type="UserType"/>

Element type can be primitive or complex.Primitive types are defined by XML schema specificaiton

  • string
  • binary
  • boolean
  • decimal
  • double
  • float
  • uri
  • timeInstant
  • timeDuration

There are many more.Check http:www.w3.org/TR/xmlschema-0/ We can define new simple types also

<xsd:simpleType name="productCode">
     <xsd:restriction base="xsd:string">
            <xsd:pattern value="A-ZA-ZA-Z0-90-90-9"/>
      </xsd:restriction>
</xsd:simpleType>

The number of occurrence of an element can be constrained.

minOccurs for minimum occurrence
maxOccurs for maximum occurrence
When both unspecified it default to 1.

To specify that the minimum occurrence should be 1 but maximum can be unlimited.

<xsd:element name="product" minOccurs="1" maxOccurs="unbounded">

Complex types consists of other elements and attributes.

<xsd:complexType name="UserType">
              <xsd:sequence>
                      <xsd:element name="homeAddress" type="Address"/>
                      <xsd:element name="officeAddress" type="Address"/>
                      <xsd:element ref="comment" minOccurs="0"/>
                      <xsd:element name="productBought" type="Products"/>
             </xsd:sequence>
           <xsd:attribute name="startDate" type="xsd:date"/>
</xsd:complexType>

In the hierarchical structure of elements, the lowest level of elements is considered to be of simple type, the rest are all complex types. The nesting of elements could be very deep, schema does not impose any restrictions on this. When the definition of an element is not to be reused than we can define it as a nameless or implicit type.

<xsd:complexType name="Products">
         <xsd:sequence>
                 <xsd:element name="product" minOccurs="0" maxOccurs="unbounded">
                 <xsd:complexType>
                        <xsd:sequence>
                              <xsd:element name="name" type="xsd:string"/>
                              <xsd:element name="quantity">
                                        …
                         </xsd:sequence>
                        <xsd:attribute name="productNo" type="productCode" use="required"/>
</xsd:complexType>

Attributes

Attributes provide additional information to an XML data element.

<xsd:complexType name="Address">
           <xsd:sequence>
              …
           </xsd:sequence>
           <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="India"/>
</xsd:complexType>

To make an attribute mandatory use the minOccurs option by setting it to 1. It defaults to 0 in the case of attributes. You can also use use

<xsd:attribute name="productNo" type="productCode" use="required"/>

Can use enumeration to restrict values

<xsd:attribute name="country" default=India ">
         <xsd:simpleType>
                  <xsd:restriction base="xsd:string">
                                  <xsd:enumeration value="India"> </xsd:enumeration>
                                   <xsd:enumeration value="Nepal"></xsd:enumeration>
                 </xsd:restriction>
          </xsd:simpleType>
</xsd:attribute>

Inheritance of Complex Types

XML definitive provide two types of inheritance:
  • Extension
  • Restriction

Extension inherits the element and attributes of base type and add new ones.

<xsd:complexType name="ExtendedAddress">
            <xsd:complexContent>
                   <xsd:extension base="om:Address">
                       <xsd:sequence>
                             <xsd:element name="state" type="xsd:string" />
                        </xsd:sequence>
                   </xsd:extension>
     </xsd:complexContent>
</xsd:complexType>

Restriction uses only the elements which are listed.

<xsd:complexType name="RestrictedAddress">
          <xsd:complexContent>
                <xsd:restriction base="om:Address">
                      <xsd:sequence>
                           <xsd:element name="city" type="xsd:string" />
                           <xsd:element name="pin" type="xsd:decimal" />
                     </xsd:sequence>
               </xsd:restriction>
           </xsd:complexContent>
</xsd:complexType>

Extension and restriction can be used in instance documents polymorphically

<homeAddress country="India“ xsi:type=“om:ExtendedAddress”>
                <houseNo>D-1</houseNo>
                ….
</homeAddress>

Restriction:

<officeAddress country="India“ xsi:type=“RestrictedAddress”>
           <houseNo>A-10</houseNo>
              …
</officeAddress>

Validation by parser will be done against the derived type

Abstract Type

An abstract type cannot be used directly in an instance document. A member of derived type must be use instead.

<xsd:complexType name="Address“ abstract="true">

Final

A complex type can be declared as final

<xsd:complexType name="Address“ final=“extension”>

Possible values for final attribute are:
  • restriction – prevents from deriving as restricted.
  • extension – prevents from being extended.
  • #all – The type cannot be derived at all.

Inheritance of Simple Type

We can fine tune validations by deriving simple types

<xsd:simpleType name="price">
        <xsd:restriction base=“xsd:float">
                    <xsd:minInclusive value="0"/>
                    <xsd:maxExclusive value="10000"/>
        </xsd:restriction>
</xsd:simpleType>

Facets are defined to restrict the data. Important one for float is:
  • maxInclusive – inclusive upper bound
  • maxExclusive – exclusive upper bound
  • minInclusive – inclusive lower bound
  • minExclusive – exclusive lower bound
  • enumeration – set of allowed values
  • pattern – format of values using regular expression.

Similar facets are available for other types.

Importing Schema

A schema may import types from other schemas, allowing more modular schema design and type reuse.

xmlns:abc="http://www.oyejava.com/ABC"
  ....
<import namespace==“http://www.oyejava.com/ABC” scehmaLocation=“http://www.oyejava.com/ABC/abc.xsd”

Import mechanism enables to combine schema to create larger more complex schema. To combine schemas with exactly the same targetnamespace

Schema 1: (Location - http://www.oyejava.com/XYZ/xyz_1.xsd)

<schema targetNamespace="http://www.oyejava.com/XYZ"

Include the first schema into second under same target namespace

Schema 2: (Location - http://www.oyejava.com/XYZ/xyz.xsd)

<schema targetNamespace ="http://www.oyejava.com/XYZ" … >
<include schemaLocation="http://www.oyejava.com/XYZ/xyz_1.xsd">


XML Parsers


Parsers helps in reading and generating XML. XML fragments are not for human consumption so they has to be dealt programmatically, that's where parsers come into picture. All the computer programming languages generally support processing of XML, considering that XML are so ubiquitous. Some parsers do support validation, which means basically to ensure that the XML instance document has the valid elements. For example the node which is supposed to contain Country name does not contains a State name. For validation, the parser has to provided with a XSD (XML schema definition).

No comments:

Post a Comment