Chapter 7 - DTDs and Schema - XML: A Deeper Understanding by John Shirrell

7.1 DTDs and Schema

The principle of documentation still applies to XML. Even though you might create an XML vocabulary that is so simple, since anyone could understand what a name or address element is, someone who is newly introduced your vocabulary needs to know which items are parents, which are children, what they contain, what attributes are available, what their default values are, etc. Do not document your XML vocabulary on sticky notes! There is a much better solution available, and it is the Document Type Definition (DTD). Contrast the word Definition with Declaration, because the Declaration that you place in your XML document declares the Definition, which will often be its own .dtd file.

DTDs were originally a part of SGML, and are now a part of the XML specification. They are structured lists of entities and attributes, and their relationships to one another. DTDs are not formed in XML, they are instead formed more like the DOCTYPE tag (the Document Type Declaration) from before. The file structure of a DTD can be unwieldy to look at, but it can be parsed by an XML editor or by utilities that draw the DTD as a tree diagram. There are several XML editors out there that have an autocomplete feature that helps you fill out tags automatically based on the DTD. They can also validate your document against the DTD before publishing. By creating a DTD, you are documenting your new XML vocabulary so others will be able to understand it without any ambiguity. However, if there are any additional notes to add, it is recommended that you add documentation to the DTD file within comments (same format as always: ).

7.2 Structure

Fred has finally created that website, and now he has begun a new e-mail coupon system to drive his restaurant business. He has wisely chosen to use XML for this system. Here is a sample document for a coupon in Fred's system:

<?xml version="1.1" encoding="UTF-8"?>
<!DOCTYPE coupon SYSTEM>
<coupon>
 <serial-number>1234567890</serial-number>
 <valid-at>
  <location>FREDS</location>
  <location>LITTL</location>
 </valid-at>
 <deal>
  <location>FREDS</location>
  <value>5.00</value>
  <requirement guests="8" dollars="75.00" />
 </deal>
 <deal>
  <location>LITTL</location>
  <value>7.00</value>
  <requirement guests="8" dollars="75.00" />
 </deal>
 <body>
  <text type="header">
   Save $5 at your next party at Fred's, or $7 off your next party at Little Italy!
  </text>
  <text type="regular">
   You will receive $5 off your check at Fred's Restaurant,
   or $7 off your check at Little Italy, when you bring a
   party of eight or more to visit and purchase at least
   $75 worth of food and drink.
  </text>
 </body>
 <terms>
  <boiler code="LIMIT1" />
  <boiler code="NOCOMBINE" />
  <boiler code="GRATUITY8" />
  <text>
   Coupon may not be applied toward price of alcoholic beverages.
  </text>
 </terms>
</coupon>

This might seem self-explanatory at first glance, but there are quite a few things you might be wondering. Does text occur anywhere in the document? Are requirements either-or or must they all be met? Fred now wants to create a Document Type Definition file for this document, to document the system's vocabulary for anyone who will ever need these questions answered.

DTDs can become very complicated—just look at the DTD for XHTML. The easiest way to start is with the root element and with a general comment on the vocabulary:

<!--
Fred's Restaurant Network
Coupon Document Type Definition

Defines the XML vocabulary used for defining coupons.
Coupon files are used for printing and/or e-mailing the coupons, storing a
local copy of the coupons, and validating and calculating the discounts in
the point of sale system when presented.
-->

<!ELEMENT coupon EMPTY>

This is now a DTD for an XML document that can only contain the coupon element with no contents. The EMPTY keyword is required for empty elements; you may not define an element without giving it some definition of its valid contents. It is easiest to continue defining the document type by continuing to higher levels of the tree. You start by listing the root element's children:

...
<!ELEMENT coupon (serial-number, valid-at, deal+, body, terms?)>

<!ELEMENT serial-number EMPTY>

<!ELEMENT valid-at EMPTY>

<!ELEMENT deal EMPTY>

<!ELEMENT body EMPTY>

<!ELEMENT terms EMPTY>

Notice how all of the new elements have been added to a list, in parentheses, on the coupon element definition. This indicates that those elements may appear as children (but does not require that they appear in the same order as given). To be more specific, all elements are required to appear as children of coupon for the document to be valid, except terms. To specify that a child element is optional, you place a ? immediately after the element name. To specify that a child element may be repeated, you place either a * or a + after the element name; * denotes an optional, repeatable field (appears zero or more times), + denotes a repeatable field that is not optional (appears one or more times). On fields with no operator, the element must occur exactly once. If you prefer, you may set these restrictions on a set of elements within parentheses by placing the operator at the end, as in this example:

...
<!ELEMENT m-m-bag (red, orange, yellow, green, blue, purple, brown)*>
...

By placing the operator at the end of the parentheses, all the elements inside may occur zero or more times. If you want to add another element to which a different rule applies, you may sub-group:

...
<!ELEMENT m-m-bag ((red, orange, yellow, green, blue, purple, brown)*, size)>
...

Returning to Fred's system, we know that serial-number contains only the serial number. How is this represented? XML elements may contain Parseable Character Data (PCDATA, you may remember this from the XHTML chapter). This is signified with #PCDATA:

...
<!ELEMENT serial-number (#PCDATA)>
...

This raises a question that many have about the XML specification. Why would an element's contents be Parseable Character Data when it can only contain character data, no elements? The reason is because no matter what your DTD says, the contents of serial-number are still treated as PCDATA by the parser (parsers do not usually read DTD files). Just as with XHTML, if you have a field like serial-number and you need to use special characters like the less-than sign or ampersand, you must use a CDATA section to prevent those characters from confusing the parser.

The location is also PCDATA:

...
<!ELEMENT valid-at (location+)>

<!ELEMENT location (#PCDATA)>
...

One weakness of the DTD is that you cannot specify a fixed list of allowed values on an element. It is possible to limit the allowed values in the DTD on an attribute, which would involve rewriting the vocabulary to suit the change. However, it would not be wise for Fred to put the names of his restaurants into the DTD. The DTD should define the document, and only the document. By putting his restaurant names into the DTD, Fred would need to update his DTD any time he opens a new restaurant. Although that may seem like a rare occurrence, it is a sign of a bad DTD. Instead Fred has location codes as PCDATA in the content of the location element, and his system validates the location code against a database rather than using the DTD.

The next two elements require some explanation. It is a great opportunity to add comments to the DTD:

...
<!ELEMENT deal (location+, value, requirement*)>

<!ELEMENT value (#PCDATA)>  <!-- Value is in the format N.NN for dollars and
                                 cents. Do not use a dollar sign. -->

<!ELEMENT requirement EMPTY> <!-- Multiple requirements are treated as
                                  a meet any relationship.
                                  All attributes within one requirement must
                                  be met at the same time.
                                  A coupon is valid if all attributes within
                                  any one requirement are met. -->
...

This explains the use of requirement more thoroughly. (Note that location is not redefined, it was already defined above.) To add the attributes that are valid on this element, you add an attribute list or ATTLIST:

...
<!ELEMENT requirement EMPTY> <!-- ... -->
<!ATTLIST requirement
          guests CDATA #IMPLIED
          dollars CDATA #IMPLIED>
...

Both attributes are #IMPLIED, which means they are optional. Later on you will see a use of #REQUIRED, which means the attribute is required to be defined. The CDATA keyword signifies that the attribute contains character data. You could put a list of valid values here, or one of a few other kinds of data that can be found in the specification. CDATA is the one you will use the most often; it is usually easier (and necessary) to validate character data in the application program than using a DTD. The dollar amount in dollars, for example, might contain a comma instead of a decimal point. A DTD cannot validate that kind of data. A good example of a set of attributes is a day of the week:

...
          weekday (su|mo|tu|we|th|fr|sa) #REQUIRED
...

The document will fail a DTD validation if the given attribute does not exactly match (case-sensitive) any of the values on the list. Remember that most XML parsers do not validate against the DTD, so if yours does not, you still need to validate this attribute in your application program. This may only serve to help document the vocabulary (because it can be confusing when some systems use two-letter days of the week, some use three, some use one, some use the whole word, etc.).

On to the next element:

...
<!ELEMENT body (text*, image*)>

<!ELEMENT text (#PCDATA)>
<!ATTLIST text
          type (header|regular) #IMPLIED>
...

This section should be fairly self-explanatory. Note that the type attribute is a good use of predefined validated values. Most coupons have only a header or regular text. However, there may be situations where the attribute makes no sense. This will become clear later with the terms element.

The image element (which was not used in the example) would simply contain a URL to an image file to be printed. This would simply be #PCDATA, but perhaps we want to make it more obvious that the element contains a URL. To do this, it is common practice to use an entity, which is replaced with predefined text. There are two kinds of entity in a DTD: The kind that you reference in a document (<, for example), and the kind you reference in the DTD, a parameter entity. The parameter entity syntax is very similar to a character entity, but uses a % sign instead of an ampersand. To signify that you are defining a parameter entity, you include the % as shown in the definition:

...
<!ENTITY % url "CDATA"> <!-- Place entity definitions at head of DTD file -->

...
<!ELEMENT image (%url;)>
...

Now it is clear that the content of an image element is a URL. Entities also allow you to change things around; some W3C standards actually place every element name in an entity to enable future conversion of all element names from English to another language. Entities may be used anywhere in the DTD in place of text.

...
<!ELEMENT terms (boiler, text)*> <!-- terms and conditions of use -->

<!ELEMENT boiler EMPTY> <!-- Boilerplate text -->
<!ATTLIST boiler
          code CDATA #REQUIRED>
...

The terms element can contain zero to many of either boiler or text elements. You do not need to define the text element twice (in fact, that would be a violation of the XML specification). However, it is important to note that when text is a child of terms, as practiced, you would never use the type attribute. There is no way to validate this rule using DTDs, so you will need to modify the application program or use XML Schema, which will be discussed later in the chapter.

You may also supply a default value for an attribute. For example, say Fred has hired a new employee who does not understand the boilerplate text that needs to appear on every coupon. Since he generates simple dollar-off coupons, he does not need access to very specific terms and conditions. To make the terms and conditions section simpler, Fred creates a new boilerplate code, DEFAULT, which contains all the boilerplate text that might be necessary on his new employee's coupons. However, it would currently still need to be coded like this:

...
<boiler code="DEFAULT" />
...

Fred adds a default value to the attribute in the DTD:

...
<!ELEMENT boiler EMPTY> <!-- Boilerplate text -->
<!ATTLIST boiler
          code CDATA "DEFAULT">
...

Note that the #REQUIRED property was replaced with the default value. Now, if the code attribute is not set, it will be set to DEFAULT. However, if the code attribute is present, the value that is supplied by the author will be used. By doing this, Fred's new employee can simply add the <boiler /> tag with no attributes.

Finally, getting back to entities, Fred would like to make it easier to include the ¢ sign in coupon text. To do this, he defines a character entity, which is used in the XML document. Note the absence of a %, which is used only to define parameter entities.

...
<!ENTITY cent "&#162;">
...

The entity ¢ is a numeric character entity, which is automatically defined for all XML documents. This entity definition short-hands the same entity to a new entity, which would be called from the document in this fashion: ¢ Many similar character entities exist for HTML and XHTML.

By looking at the whole document carefully, you can determine the way elements nest. Some DTD processing tools will draw the elements as a tree. Here is the full DTD for Fred's coupon vocabulary:

<!--
Fred's Restaurant Network
Coupon Document Type Definition

Defines the XML vocabulary used for defining coupons. Coupon
files are used for printing and/or e-mailing the coupons,
storing a local copy of the coupons, and validating and
calculating the discounts in the point of sale system when presented.
-->

<!ENTITY % url "CDATA">

<!ENTITY cent "&#162;">

<!ELEMENT coupon (serial-number, valid-at, deal+, body, terms?)>

<!ELEMENT serial-number (#PCDATA)>

<!ELEMENT valid-at (location+)>

 <!ELEMENT location (#PCDATA)>

<!ELEMENT deal (location+, value, requirement*)>

 <!ELEMENT value (#PCDATA)>  <!-- Value is in the format N.NN for dollars and
                                 cents. Do not use a dollar sign. -->

 <!ELEMENT requirement EMPTY> <!-- Multiple requirements are treated as
                                  a meet any relationship.
                                  All attributes within one requirement must
                                  be met at the same time.
                                  A coupon is valid if all attributes within
                                  any one requirement are met. -->
 <!ATTLIST requirement
           guests CDATA #IMPLIED
           dollars CDATA #IMPLIED>


<!ELEMENT body (text*, image*)>

 <!ELEMENT text (#PCDATA)>
 <!ATTLIST text
           type (header|regular) #IMPLIED>

 <!ELEMENT image (%url;)>

<!ELEMENT terms (boiler, text)*> <!-- terms and conditions of use -->

 <!ELEMENT boiler EMPTY> <!-- Boilerplate text -->
 <!ATTLIST boiler
           code CDATA "DEFAULT">

Note that the child elements are indented. This makes the DTD slightly easier to understand when the nesting of elements is predictable. However, when you have elements that could be listed by an element at any level in the hierarchy, this would only make the document more confusing and it would be best to leave the element definitions flush left.

Now that we have finished the DTD, we have established the documentation of the vocabulary, and we have defined character entities that will be used. For many situations, this is good enough documentation for the vocabulary. However, there are still several weaknesses that have been spotted during the creation of this DTD:

The dollars attribute is not validated as a two-decimal-place number field without a dollar sign.
The type attribute on text does not apply when it is a child of terms.
There is no way to specify a set of acceptable values for the content of an element; this can only be done for attribute values.
There is no way to define a hard minimum or maximum number of instances of a given element or attribute; only one or one-to-many.

All four of these problems, and too many others to list, are addressed with the W3C standard XML Schema.

7.3 XML Schema

Although DTDs allow a great deal of control over the structure of an XML vocabulary, there are still holes within its structure that prevent DTDs from fully controlling a document. To improve upon this SGML crutch of XML, the W3C has come out with a recommendation for XML Schema. XML Schema is an XML vocabulary that is used to define the structure of your own XML vocabulary, much like you can do with DTDs. However, XML Schema offers many, many more controls over your document, and in fact, too many to list. You can buy an entire book on just XML Schema, or you can view the W3C standards for the normative definitions of all the functions of XML Schema.

Of course, XML Schema is still not able to validate everything. The benefit of using XML for XML Schema is that it is every bit as extensible as any other format, and you can extend XML Schema for your own applications. It is also easier to parse XML Schema; you can use the same XML parser rather than a separate DTD parser. The main problem that accompanies XML Schema's power is its complexity. I will do what few authors who cover XML Schema do; I will keep the XML Schema syntax that I cover short. We will convert Fred's coupon system from DTD to XML Schema.

Before I begin, I want to point one thing out. The way XML Schema is defined does not allow the use of default namespaces for any element in the Schema vocabulary. This forces us to use qualified names on every element (usually xsd:localpart) for every element in the schema. This can be very ugly to look at and difficult to follow, so I will leave the xsd:'s off until the end.

First comes the schema element, which is the root element of an XML Schema (but, since you will probably be embedding it, needs not be the root element of your XML document).

<schema xmlns="http://www.w3.org/2001/XMLSchema">

</schema>

XML Schema has elements and attributes, but they are no longer just a flat listing. Now, elements and attributes are nested within each other, just as they appear in the actual document. For starters, the easiest element: serial-number. Notice how it is nested under the root element definition.

<schema xmlns="http://www.w3.org/2001/XMLSchema">

 <element name="coupon">
  <complexType>
   <sequence>
    <element name="serial-number" type="xsd:string" />
   </sequence>
  </complexType>
 </element>

</schema>

Even for a two-element document, this is already a very complicated Schema. Let's look at it piece by piece.

First, there is the element element. This is the same as an element definition (!ELEMENT) in a DTD. The element name is then supplied in the name attribute. An element can be defined one of three ways:

As having only character data content that fits one of the XML Schema predefined formats: xsd:string, xsd:decimal, xsd:integer, xsd:boolean, xsd:date, or xsd:time. This is done using the type attribute, as was done on serial-number.
As having only character data content that is derived from those formats with more specific rules, which are called restrictions and extensions, which is known as a simpleType. All of this will be covered shortly.
As containing other elements as children, or having any other rules that are not covered by simpleType, which is known as a complexType. This is how the root element, coupon, has been defined.

The complexType element is used to define the contents of the coupon element. There are three operations that appear as elements that are children of complexTypes:

all – all of the elements under this operation must appear exactly once (or they may be made optional with the minOccurs="0" attribute) and they may appear in any order.
choice – allows only one of the elements under this operation to appear. maxOccurs and minOccurs apply to repetitions of the same element; if the maxOccurs is set to 3, you may have three of one element, but they must be the same element.
sequence — all of the elements in a sequence must appear in the order specified, and they each appear from minOccurs to maxOccurs times (default for both is 1).

You may use operators on other operators; you could have a choice of a sequence of city, state, zip, or a sequence of city, province, and postal-code. The possibilities are endless. What would you do if you did not want any restriction on the number or order of elements? You would simply have a sequence of choices, and the maxOccurs of the sequence is unbounded (in other words, unlimited). The attributes minOccurs and maxOccurs may be set on individual elements as well as operators.

As one side note, XML Schema does not have a mechanism for named character entities like DTDs do. As a result, you will still need to create DTDs for documents that use them. In most cases, it is much simpler to use DTDs and enforce the stricter formatting rules in your application program than to create a Schema. At least you now know what is involved in making a Schema and could understand one that was already produced.

Now to add the valid-at element. Previously the DTD did not include the location codes because DTD had no mechanism for enforcing the value of an element, and also it would not be easy to update the definition when a new restaurant has been added. The latter part of the explanation still holds true, but for the sake of example, here is how one would enforce the value of the location codes under the location element:

...
   <sequence>
    <element name="serial-number" type="xsd:string" />
    <element name="valid-at">
     <complexType>
     <complexType>
      <sequence minOccurs="1" maxOccurs="unbounded">

       <element name="location">
        <simpleType>
         <restriction base="xsd:string">
          <enumeration value="FREDS" />
          <enumeration value="CHITO" />
          <enumeration value="LITTL" />
         </restriction>
        </simpleType>
       </element>

      </sequence>
     </complexType>
    </element>
   </sequence>
...

The simpleType element defines the content of the location element. A simpleType may contain restrictions, but extensions must be placed in a simpleContent element. A restriction takes the set of all existing possible values for the element (or attribute) to which it applies, and it removes all values that are not defined under the restriction from that set; an extension takes the set of values and adds the values that are defined under the extension to that set. An example of an extension will come later when we arrive at the text element. Also, the base is the predefined set of values that is being restricted or extended.

In the example above, the only three values that are valid for the location element are the three location codes for Fred's restaurants. They appear as enumerations. If Fred ever added a new restaurant, he would need to update the Schema. (Because it is an XML document, depending on his system, Fred might be able to add this using the Document Object Model. There are very few instances in real life where this would be practical, though.)

Also note the use of a sequence operator. Even though only one element is defined in this complexType, the Schema specification does not allow a complexType to contain an element definition as a child. Element definitions must be contained in an operator.

The Schema gets more complicated with the deal element.

...
    <element name="deal" maxOccurs="unbounded">
     <complexType>
      <sequence minOccurs="0" maxOccurs="unbounded">
       <element name="location" />
...
      </sequence>
     </complexType>
    </element>
   </sequence>
...

Hold the phone. We just defined the location element. If we define it again here, with the three enumerations, Fred will have to update his restaurant locations in two places! To redefine the location element here would be a bad idea. What should we do instead?

We can avoid duplicate definitions by writing a global definition. Any element in the Schema—be it a complexType or an element or a simpleType—can be made into a global definition as necessary. It is possible to overdo it, though, and cause your Schema to be even more confusing than it is anyway. This is an example of a necessary global definition:

<schema xmlns="http://www.w3.org/2001/XMLSchema">

 <element name="location">
  <simpleType>
   <restriction base="xsd:string">
    <enumeration value="FREDS" />
    <enumeration value="CHITO" />
    <enumeration value="LITTL" />
   </restriction>
  </simpleType>
 </element>
...

You then place a reference to this global definition wherever the location element appears:

...
    <element name="valid-at">
     <complexType>
      <sequence maxOccurs="unbounded">
       <element ref="location" />
      </sequence>
     </complexType>
    </element>
    <element name="deal" maxOccurs="unbounded">
     <complexType>
      <sequence minOccurs="0" maxOccurs="unbounded">
       <element ref="location" />

      </sequence>
     </complexType>
    </element>
   </sequence>
...

Now both instances of location are defined up at the top of the Schema, and one change affects both of them. Anytime you have repeating elements, you should strongly consider doing this. Note that you may not use both name and ref on a referenced element, just use ref.

The next child of a deal is the value element. Previously the only rule was that this must contain character data, but there were other rules needed (as evidenced by the comment in the DTD). Schema gives us much more control over the data:

...
       <element name="location" ref="location" />

       <element name="value" minOccurs="1" maxOccurs="1">
        <annotation>
         <appinfo>Occurs exactly one time</appinfo>
         <documentation>Each deal may have only one value.</documentation>
        </annotation>

        <simpleType>
         <restriction base="xsd:decimal">
          <fractionDigits value="2" />
         </restriction>
        </simpleType>
       </element>
...

First, note the annotation. This is the same as a comment. The reason why XML Schema has annotations is because an XML parser would remove standard XML comments before parsing the Schema, and you may want your comments to be parsed and rendered by an application program. Each annotation must contain an appinfo and documentation. How you use them is up to you.

The format of a value is a decimal number, restricted to values with no more than 2 digits past the decimal point. If a dollar sign were entered, it would not be a valid decimal number.

Next comes the requirement element. The semantics behind its use are not really enforceable, but this is a good place for another annotation:

...
       <element name="requirement">
        <annotation>
         <appinfo>Usage of requirement</appinfo>
         <documentation>Multiple requirements are treated as
         a meet any relationship. All attributes within one requirement
         must be met at the same time. A coupon is valid if all
         attributes within any one requirement are met.</documentation>
        </annotation>
        <complexType>
         ...
        </complexType>
       </element>
...

This element is going to require a complexType, because it contains attributes. Attributes cannot be contained in a simpleType. In a way they are treated the same as child elements, except that they are defined by the attribute element. As one quick note before adding attributes, you never group attributes in alls/choices/sequences as you would elements. Also, attributes should not be repeated in XML, and cannot be repeated in any document validated with DTD or Schema. If you think about it for a moment, you are equating a name with a value; if you equate a name with one value and then equate the same name with another, you are saying the first value equals the second, different second value, which is not a valid equality. Also, the order of attributes does not matter.

In Schema, attributes are defined as being optional, required, or oddly enough, prohibited.

...
        <complexType>

         <attribute name="dollars" use="optional">
          <simpleType>
           <restriction base="xsd:decimal">
            <fractionDigits value="2" />
           </restriction>
          </simpleType>
         </attribute>

         <attribute name="guests" use="optional">
          <simpleType>
           <restriction base="xsd:integer">
            <minInclusive value="0" />
           </restriction>
          </simpleType>
         </attribute>
        </complexType>
...

The first attribute, dollars, is defined the same way as value was earlier. This simpleType could have been made a global definition just like location, but for simplicity it was left as-is. One may decide to define the value element to have a maximum value, to prevent a sneaky employee from generating $100 off coupons. $100 might be a valid restriction dollar amount, so this maximum value should only be set on value, and it would be more complicated to revise the Schema later to accommodate this. You may define global types and then redefine those in this fashion. There will be an example of that later.

Speaking of maximum values, the restriction on guests is a great example of, well, a minimum value. But I will go ahead and tell you what all four minimum/maximum elements are: minInclusive and maxInclusive, whose value is set to the minimum or maximum value, and that value is included in the restriction (in other words, still considered valid). In the above case, the minInclusive value includes 0, so you can have a coupon that is valid at a table with zero guests (this might mean that it applies to carry-out or delivery orders). To make 1 the minimum, and require that the coupon's value be greater than 1, you use minExclusive (the opposite end being maxExclusive). Zero is then excluded from the restriction.

Moving on to body, it seems like we need to make the text element a global definition. However, what do we do to address the attributes that are defined for one context, and forbidden under another? Unfortunately, the specification states that if you use a reference to a global declaration of an element, you may not write a new complexType or simpleType or add to it. In this case it is easier to just define text twice.

...
    <element name="body">
     <complexType>
      <sequence minOccurs="0" maxOccurs="unbounded">

       <element name="text">
        <complexType>
         <simpleContent>
          <extension base="xsd:string">

           <attribute name="type">
            <simpleType>
             <restriction base="xsd:string">
              <enumeration value="header" />
              <enumeration value="regular" />
             </restriction>
            </simpleType>
           </attribute>
          </extension>
         </simpleContent>
        </complexType>
       </element>
      </sequence>
     </complexType>
    </element>
...

This is a fairly tricky definition. You will notice we are finally using the simpleContent element. This allows us to define extensions on the content of the element. Why is it necessary to extend the content? Because in XML Schema, oddly enough, an attribute is treated as part of the content of an element. The default for xsd:string is the element containing text, without any attributes set. We add the type attribute as an extension to the content. The base type of the extension defines the content of the text element, which is string. The base type of the restriction on the type attribute defines the content of the attribute's values, which are strings. Then, the value is restricted to two valid options, header and regular.

After a mess like that, the terms element definition should be easy to understand: Two simple strings.

...
    <element name="terms">
     <complexType>
      <sequence minOccurs="0" maxOccurs="unbounded">
       <element name="boiler" maxOccurs="unbounded">
        <complexType>
         <attribute name="code" type="xsd:string" />
        </complexType>
       </element>

       <element name="text" type="xsd:string" />
      </sequence>
     </complexType>
    </element>
...

As I mentioned earlier, Fred may want to set a limit on coupon values to $50. He also wants to ensure that dollar amounts for both value and requirement dollars are not negative. You could modify both types individually, but instead you should define a global type definition:

<schema xmlns="http://www.w3.org/2001/XMLSchema">

<simpleType name="dollars">
  <restriction base="xsd:decimal">
   <fractionDigits value="2" />
   <minInclusive value="0" />
  </restriction>
 </simpleType>
...

Then you remove the simpleTypes from the page and replace it with type="dollars".

         <attribute name="dollars" use="optional" type="dollars" />

Next, we add a restriction that the maximum value is $50 on the value of a coupon.

       <element name="value" minOccurs="1" maxOccurs="1">
...
        <simpleType>
         <restriction base="dollars">
          <maxInclusive value="50" />
         </restriction>
        </simpleType>
       </element>

The base is the type we start with, which is now our own derived dollars type. Then you restrict it as you have always restricted the W3C standard types.

Now that we have our Schema set up, it is time to validate it. You first need to add xsd: to all the elements and set up the namespace to apply to that prefix. I use search and replace to do this (and save a clean copy if you need to make changes). Be careful not to blindly replace < with <xsd: because the end tags need to have their forward slash before the prefix. There is a W3C validator for XML Schema, and it is located here. The catch is that this one does not allow direct text input, so you must save and upload your Schema file. The error messages you get can be very confusing, so read them carefully. A common mistake is using an element in a location where it is not allowed, and the validator will tell you what elements are allowed to appear in that context.

The W3C validator does not validate your XML document, it only validates your Schema syntax. You should also use a validator that will test your known-to-be-good document against your Schema. You can also intentionally put mistakes in your document to see if it catches them. One good validator is at xmlme.com.

The final product is too large to include in the text, so you can view the file here. As you may have noticed, the attributes are not given prefixes. The W3C documentation did not use prefixes on attributes, and the validator accepted the unqualified attributes without a hitch. Don't worry about prefixing attributes, but doing so won't hurt you.

7.4 Chapter Review & Exercises

In this chapter, you have seen the syntax for the XML DTD, which is based on the SGML DTD. You should be able to define entities (both parameter and character), elements, and attributes, and define the allowed relationships between elements. You should know the syntax for zero-to-many, one-to-many, and optional elements. You should also understand XML Schema, and know what a simpleType or complexType can do, as well as the difference between extensions and restrictions. You should also know how to define an annotation, and why this mechanism is necessary in XML Schema.

Develop a Document Type Definition for your computer lab XML vocabulary.
Develop a Document Type Definition for Fred's menu XML vocabulary.
Develop an XML Schema for your computer lab XML vocabulary.
Develop an XML Schema for Fred's menu XML vocabulary.