- XML Introduction
- XML Syntax
- XML Documents
- XML Declaration
- XML Tags
- XML Elements
- XML Attributes
- XML Comments
- XML Character Entities
- XML CDATA Sections
- XML White Spaces
- XML Processing
- XML Encoding
- XML Validation
- XML Namespaces
- XML Parser
- XML DTD
- XML Schema
- XML Vs HTML
What is Markup?
XML as a markup language characterizes a set of rules for encoding scripts (documents) in an arrangement that is both comprehensible and machine-decipherable. Thus, a developer would love to ask know, what precisely is a markup language? Markup is data added to a script that upgrades its significance in so many ways, it distinguishes, however, the components and how they identify with one another. Particularly, a markup language is a bunch of emblems that can be put in the content of a script to differentiate and name the pieces of that script.
Consider the following example of XML markup when put together in a bit of text:
<message> <text>Hello, world!</text> </message>
This piece incorporates the markup emblem, or the labels, for example, <message>…</message> also, <text>… </text>. The labels (tags) <message> and </message> mark the beginning and the end of the XML code section. The labels (tags) <text> and </text> encompass the content Hello, world!.
The process of invention of XML started around the 1990s with the sole aim of integrating the definition of new text elements. XML Working Group (Initially known as the SGML Editorial Review Board) created XML in the year 1996. The group was led by Jon Bosak of Sun Microsystems with the dynamic cooperation of a XML Special Interest Group (recently known as the SGML Working Group) likewise sorted out by the W3C. Don Connolly who filled in as the Working Group’s contact was among the team as a contact with the World Wide Web Consortium (W3W).
Extensible markup language (XML) is a script (document) formatting language consumed by a few websites. The extensibility of XML makes it imperative, however, extremely useful language. More precisely, XML is a disentangled type of standard generalized markup language (SGML) expected to target scripts that are circulated on the Internet. Similar to SGML, XML utilizes document type definitions (DTDs) for characterization of documents and also the implications of tags utilized in them. XML gives a larger number of sorts of hypertext joins than HTML, for example, bidirectional connections and connections comparative with a script subsection. Furthermore, the ability of XML to adopt conventions allows it fully interpret and decipher text elements. For example, script elements are set by start and end tags, <BEGIN>… </BEGIN>.
The plan objectives for XML are:
- XML will be direct-usable over the Web.
- XML will uphold a broad assortment of uses.
- XML will be viable with SGML.
- It will be anything but difficult to compose programs which measure XML scripts (documents).
- The quantity of alternative highlights in XML is meant to be indisputably the base, preferably zero.
- The XML documents ought to be human-intelligible and sensibly clear.
- The XML configuration ought to be arranged rapidly.
- XML document shall be any means be straightforward and simple to create.
By far, XML is immensely significant. Dr. Charles Goldfarb who was individually engaged during its innovation said, “the sacred goal of computing, tackling the issue of general data trade between unique frameworks.” It is likewise a helpful organization for virtually all things ranging from circumscribing files to data and scripts of practically any sort. XML is a user-friendly language and automatically produced that you would not need to be very much familiar with everything commands/specifications so as to execute programs or technically benefit from it. What makes a difference is to comprehend the main logic behind XML and what it does, and consequently, you can perceive how to manoeuvre it in your own activities.
Some features of XML
- It represents an extensible markup language.
- It is a markup document formatting language like hypertext markup language (HTML).
- It was invented to backlog and convey data.
- It was invented to be naturally engaging.
HTML and XML
- XML is extensible, while HTML is not.
- Both XML and HTML are markup languages.
- XML was invented to backlog and convey data, while HTML is meant for publishing and visualizing data.
- HTML tags are more defined than XML tags.
The syntax rules of XML are exceptionally basic and cogent. The standards are anything but straightforward to learn and simple to utilize. Under this part, we are going to simply explore the basic syntax rule for writing an uncomplicated XML document.
The question is: are you ready? Consider the following example of a making a complete XML document
<?xml version = "1.0"?> <contact-info> <name>Anil Kumar</name> <company>GreatLearning</company> <phone>(91) 987-3679</phone> </contact-info>
You can see there are two sorts of data in the above model –
- Markup, as <contact-info>
- The text, or the character information, Great Learning and (91) 987-3679.
The diagram below displays the rules governing writing syntax in XML and other different forms of markup and text in an XML script.
Kindly observe every part of the afore-stated example below.
XML syntax alludes to the principles that decide how an XML application can be composed. The XML syntax is extremely direct. Thereupon, this makes XML exceptionally simple to learn. The following are the central matters to recollect while making XML script.
- XML components/elements must have an end tag.
- XML labels/tags are case touchy.
- All XML components must be appropriately nested.
- All XML scripts must have a root component.
- Attributes esteem should consistently be cited.
As characterized in this particular, a data object becomes an XML document once it is well-formed. A very much formed XML document may, moreover, be legitimate if some precise requirements are met. Physical structure and logics exist in every single XML document. Actually, the document is formed of divisions that are named substances (entities). An entity may allude to different entities to push their integration and consideration in the document. That said, a document begins in a document entity. Coherently, the document is made out of declaration, component or elements, comments, character references, and processing instruction, which are all shown by explicit markup. Properly, the physical structure and logics must nest ultimately.
For every XML document, it must have a solitary tag-pair to characterize a root element. All different elements must be inside this root element. Also, all elements can now have sub-elements. The so-called sub-elements must be appropriately nested inside their parent element.
Now, take a look at this example:
<root> <child> <subchild>.....</subchild> </child> </root>
XML Document Rules
In the event that you’ve seen HTML documents, you’re acquainted with the essential ideas of utilizing tags to markup the content of a document. This segment examines the contrasts between HTML records and XML documents. It goes over the essential principles of XML documents and talks about the phrasing used to depict them.
One significant point about XML documents is: The XML detail requires a parser to dismiss any XML document that doesn’t adhere to the fundamental principles. Virtually all HTML parsers will acknowledge messy markup, thereby, making a theory with respect to what the developer of the document proposed. To dodge the approximately organized wreck found in the normal HTML document, the makers of XML chose to uphold document structure from the earliest starting point.
Note: A parser is a bit of code that endeavours to pursue a document and decipher its substance/contents.
There are three types of XML documents:
1. Valid Document: Valid documents observe both the XML syntax structure rules and the standards characterized in their DTD or composition (schema).
2. Invalid Document: Invalid documents don’t keep the syntax structure rules characterized by the XML particular. Once a developer characterizes some certain rules for what a document may contain in a DTD or schema, and the document doest observe those rules of a developer, then, that document remains invalid.
3. Well-formed documents keep the XML syntax structure rules yet don’t have a DTD or pattern (schema).
The root component
Accurately, an XML document must be enclosed in a solitary element. That solitary element is known as the root element, and it encloses all the content and some other elements in the documents. In the accompanying instance, the XML document is enclosed in a solitary element, the <greeting> element. Kindly, notice the document has a remark that is outside the root element; that is totally legitimate.
Are you excited to explore examples? Let us roll on this one
- <?xml version="1.0"?> - <!-- A well-formed document --> - <greeting> - Hello, World! - </greeting>
Here comes a document that doesn’t contain a single root element:
- <?xml version="1.0"?> - <!-- An invalid document --> - <greeting> - Hello, World! - </greeting> - <greeting> - Namaste, Duniya! - </greeting>
An XML parser is designed to dismiss this document, nonetheless, of the data, it might contain.
Well-Formed XML Documents
An object becomes a well-formed XML document if it possesses the following characteristics:
i. If taken overall, it coordinates the creation named document.
ii. If it plays catch-up with all the well-formedness requirements given in this detail.
iii. For each of the parsed elements which are referred to in a direct or indirect way in the document is well-formed
document ::= prolog element
Coordinating the document creation infers that:
i. It has at least one or more than one element.
ii. Or, there is actually one element, called the root, document element, no portion of which shows up in the substance of some other element. For every other element, if the beginning tag is in the substance of another element, the end-tag is in the substance of a similar element. All the more just expressed, the elements, delimited by the beginning-and end-tags, nest appropriately inside one another.
As an outcome of this, for each non-root component X in the document, there is one other element Y in the document with the end goal that X is in the substance of Y, however, isn’t in the substance of whatever other elements that are in the substance of Y. Y is alluded to as the parent of X, and X as an offspring of Y.
XML data consists of an essential unit called an XML document, and this particular unit is made out of elements, plus another markup in an old package. In precise, an XML document has a wide vast assortment of data. For instance, raw data of numbers, numbers in the textual representation of molecular-structure, or numerical equations.
The following fully display sections of an XML document
1. Document Prolog Section: This singular part of the document reign at the top of the document, before the document element (root element). It contains the XML declaration and Declaration type of a document.
2. Document Elements Section: The document elements are the backbone of XML. It divides the document into some sort of segment, each filling a particular need. Therein, you can isolate a document into numerous segments; hence, they can be delivered in an unexpected way, or utilized through a search engine. The elements can be described as containers with a mix of texts, and different elements.
XML Document Example
<?xml version = "1.0"?> Document Prolog <contact-info> <name>Anil Kumar</name> <company>GreatLearning</company> Document Elements <phone>(91) 987-3679</phone> </contact-info>
The XML declaration demonstrates that the document is written in XML and determines which variant of XML. The XML declaration, whenever included, must be on the first line of the document. Likewise, the XML declaration can indicate the language encoding for the document (discretionary) and if the application alludes to external entities (discretionary). For example, let us determine that the document utilizes UTF-8 encoding (in spite of the fact that we don’t generally need to as UTF-8 is the default), and we indicate that the document alludes to external entities by utilizing standalone=”no”. This isn’t an independent document as it depends on an external resource (for example the DTD). Despite the fact that the XML declaration is discretionary, the W3C suggests that you remember it for your XML documents. Regardless, you will need the XML declaration to effectively validate your document.
Virtually all XML documents begin with an XML declaration that gives fundamental data about the document to the parser. An XML declaration is suggested, however not needed. Whenever it is being used in the document, it must be the first thing. The declaration shall have up-to three name-value sets (some people call the name-value attributes, albeit actually, they’re most certainly not). The adaptation is the rendition of XML utilized; at present, this worth must be 1.0. The encoding is the character set utilized in this document. The ISO-8859-1 character set referred to in this statement incorporates the entirety of the characters utilized by most Western European dialects. In the event that no encoding is indicated, the XML parser expects that the characters are in the UTF-8 set, a Unicode standard that underpins essentially every character and ideograph from the world’s dialects.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
At last, standalone, which is either yes or no, characterizes whether this document can be handled without perusing some other files. For instance, if the XML document doesn’t reference some other documents, you would determine standalone=”yes”. On the off chance that the XML document references different records that depict what the document can enclose (more about those files in a moment), you could indicate standalone=”no”. Since standalone=”no” is the default, you infrequently observe independence in XML declarations.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
Rules of governing XML Declaration
An XML declaration ought to follow the accompanying guidelines:
i. Once the XML declaration is available in the XML, it must be put as the mainline in the XML document.
ii. If the XML declaration is incorporated, it must have an attribute version number.
iii. Case sensitive is the name of the parameters and values and must start with “<?xml>” where “xml” is written in lower-case.
iv. The names are consistently contained in lower case.
v. The request for putting the parameters is significant. The right request is version, encoding, and standalone.
vi. Either single or twofold statements (quotes) may be utilized.
vii. The XML declaration has no end tag, for example </?xml>
viii. An encoding can be overrun by an HTTP protocol that you included in the declaration of XML.
XML Declaration Examples
1. XML declaration with no parameter:
2. XML declaration with version definition:
3. XML declaration with all parameters defined:
<?xml version=”1” encoding=”UTF-8” standalone=”no” ?>
4. XML declaration with all parameters defined in single quotes:
<?xml version=’1’ encoding=’iso-8859-1’ standalone=’no’ ?>
For every XML document, it must contain one root element ONLY. Eventually, other root elements will be situated within the one root element.
<root> <child>Data</child> <child>More Data</child> </root>
XML statement without any boundaries
As stated above, an XML Declaration shows up as the principal line of an XML document. Its utilization is discretionary. Find the below example of declaration: encoding demonstrates how the individual pieces relate to a character set, version demonstrates the XML version, and standalone shows whether an external sort definition must be counselled so as to accurately handle the document.
An XML document can alternatively be written as follows –
<?xml version = "1.0" encoding = "UTF-8"?>
Note: In the above, version is the XML version and encoding describes the character encoding used inside the document.
Evidently, the XML tags are one of the most significant parts of XML. Tags establish and building rock of XML. They characterize the extent of a component in XML. They can likewise be utilized to embed comment, declare settings needed for parsing the environment, and to embed uncommon instructions.
That said, we can extensively classify XML tags as follows:
- Start tag: The start of each non empty XML element is set apart by a start tag. Consider an example below:
- End tag: Each element that has start-tag must have an end tag. Consider an example below:
Note: The end-tags incorporate a solidus (“/”) right before the name of an element.
- Empty tag: When a text appears between the start-tag and eng-tag, it is called content. An element is called empty when it has no content. An empty-tag can be written in the following ways:
- A start tag quickly followed by an end-tag: <hr></hr>
- A total empty element tag: <hr />
Empty-element tags might be utilized for any component which has no content.
Elements are demarcated with < and >. Like we said, element names are case sensitive and can’t incorporate spaces (the full character set can be found in the specification). Therefore, attributes can be included as space-isolated name or value pairs with values encased in quotes. (either single or double quotes).
The structure of XML
In addition to text, elements may also contain different elements.
• Start-tag starts with “<” and end with “>”.
• End-tag starts with “<” and end with “>”.
• Empty tags (for example tags with no content, and the start-tag is quickly trailed by an end-tag) can on the other hand be spoken to by a single-tag. These empty-tags start with “<” and end with “/>”. As such, empty-tags are handwriting. For instance: <br><br> is equivalent to <br/>. This means that, while changing HTML over to XHTML, all <br> tags must be in both of the permitted types of the empty tags.
• Every start-tag must contain an end-tag and should be appropriately nested. For instance, coming up next isn’t very much formed, since it isn’t appropriately nested.
Below is well-formed:
Most current HTML web-browsers can effectively deal with inappropriately nested documents. Is this piece of the HTML detail? Attempt to discover more about the likenesses and contrasts among XML and HTML tags. End tags can’t be left out. In the example beneath, the markup isn’t legitimate on the grounds that there are no end section ( </p>) tags. While this is worthy in HTML (and, at times, SGML), a XML parser will dismiss it.
- <!– NOT legal XML markup –>
- <p>Yada yada yada…
- <p>Yada yada yada…
In the event that a component contains no markup at all it is called an unfilled component; the HTML break ( <br>) and picture ( <img>) components are two models. In empty elements in XML documents, you can place the end cut in the start-tag. The two break elements and the two images elements underneath mean something very similar to a XML parser:
- <!– Two equivalent break elements –>
- <br />
- <!– Two equivalent image elements –>
- <img src=”../img/c.gif”></img>
- <img src=”../img/c.gif” />
A XML document is organized by a few XML elements, additionally called XML-nodes or XML tags. The names of XML-elements are encased in triangular brackets < > as appeared below
Syntax Rules for Elements and Tags
Element syntax: Each XML element should be closed either with the start elements or end elements as appeared below
or on the other hand in basic cases, simply along these lines −
Elements nesting: XML element may contain various XML elements as its kids, however, the kids elements must not in any way over-lap – that’s an end-tag of an element must contain a similar name as that of the latest unrivaled start-tag.
Below example shows inaccurate nested tags:
<?xml version = "1.0"?> <contact-info> <company>GreatLearning </contact-info> </company>
Below example shows accurate nested tags:
<?xml variant = "1.0"?> <contact-info> <company>GreatLearning</company> <contact-info>
An XML document can contain just one root element. For instance, an example given below isn’t correct XML-document, in light of the fact that both a and b elements happen at the high level without a root element.
The correct syntax is as follows
<root> <a>...</a> <b>...</b> </root>
Case Sensitivity: XML-elements names are case sensitive. The names of XML-components are case-touchy. That implies the name of the start-elements and end-elements should actually be in a similar case.
For instance, <contact-info> is not the same as <Contact-Info>
As rightly stated above, XML is case sensitive. XML is a case sensitive language.
This is correct
This is incorrect
The first letter of the start-tag is in small letter, while the first letter of the end-tag is in capital letter, and hence, this is an incorrect/invalid XML.
Root Element is mandatory in XML: XML-document must contain a root-element. A root-element can contain child-elements have and sub-child elements.
For instance: In the accompanying XML-document, <message> is the root-element and <to>, <from>, <subject> and <text> are child elements.
<?xml version="1.0" encoding="UTF-8"?> <message> <to>Anuj</to> <from>Deepak</from> <subject>Message from teacher to Student</subject> <text>You have an exam tomorrow at 8:00 AM</text> </message>
The accompanying XML document is invalid, for there exists no root-element.
<?xml version="1.0" encoding="UTF-8"?> <to>Anuj</to> <from>Deepak</from> <subject>Message from teacher to Student</subject> <text>You have an exam tomorrow at 8:00 AM</text>
XML elements must contain an end-tag. Most XML documents must contain an end-tag.
<text classification = message>hello</text> - >correct <text classification = message>hello - >wrong
It’s invalid to discard the end-tag when you’re making XML syntax. XML-elements must contain an end tag.
Invalid syntax: <body>See Spot run. <body>See Spot get the ball. Valid syntax: <body>See Spot run.</body> <body>See Spot get the ball.</body>
Element Type Declarations
An element declaration type takes the form. The element-structure of an XML-document can, for approval intentions, be obliged utilizing element-type and attribute list declaration. An element-type declaration obliges the element’s declarations. Element-type declarations regularly compel which element types can show up as children of the element. At user choice, an XML processor MAY give an admonition when a declaration makes reference to an element-type for which no declaration is given, however, this isn’t a mistake.
Find examples of element type declarations:
<!ELEMENT br EMPTY> <!ELEMENT p (#PCDATA|emph)* > <!ELEMENT %name.para; %content.para; > <!ELEMENT container ANY>
Element-type mustn’t be declared more than one time.
An element-type has element content when elements of that type SHOULD have only child elements, alternatively isolated by white-space (characters matching the non-terminal S). Definition: For this situation, the limitation incorporates a content model, a Basic English structure influencing the permitted types of the child elements and the order in which they are permitted to appear.
The grammar is built-on content particles, which has names, choice-lists-of-content-particles, or sequence-lists-of-content-particles:
- children ::= (choice | seq) (‘?’ | ‘*’ | ‘+’)?
- cp ::= (Name | choice | seq) (‘?’ | ‘*’ | ‘+’)?
- choice ::= ‘(‘ S? cp ( S? ‘|’ S? cp )+ S? ‘)’
Proper Group/PE Nesting
- seq ::= ‘(‘ S? cp ( S? ‘,’ S? cp )* S? ‘)’
Proper Group/PE Nesting
Where each Name is the kind of an element which may show up as a child. Because, any content-particle in a choice-list may show up in the element-content at the area where the choice list shows up in the grammar; content particles happening in a succession list MUST each show up in the element-content in the order given in the list. The discretionary character following a name or list administers whether the element or the content particles in the list may happen at least one or more (+), at least zero or more (*), or zero or one times (?). The absence of such an operator implies that the element or content particle MUST show up precisely once. This syntax and meaning are indistinguishable from those utilized in the productions in this specification. The content of an element coordinates a content-model if and just on the off chance that it is conceivable to follow-out a way through the content-model, complying with the sequence, decision, and repetition operators and marching every element in the content against an element-type in the content-model. For similarity, it is a mistake if the content-model permits an element to match more than one occurrence of an element-type in the content-model.
Generally, an attribute determines a solitary property for the element, using a value pair. An XML element can have at least one or more attributes. For instance −
<a href = "http://www.greatlearning.com/">Greatlearning!</a>
Here href is the quality name and http://www.greatlearning.com/ is attribute value.
Normally, attribute names are portrayed without quotes. In the same vein, attribute values ought to reliably appear in the quotes. The following example displays invalid xml linguistic-structure
<a b = x>....</a>
In the above accentuation, the property assessment isn’t portrayed in quotes.
Talking about XML attributes, let us see the sentence structure of properties. An underlying-tag in XML can have credits, the traits are name and worth sets.
Check this out!
- The trait-names are case-sensitive and shouldn’t be in quotes.
- The trait-esteems must be in at one or double reference.
<text grouping = "message">You have a test tomorrow at 8:00 AM</text>
Here grouping is the quality name and message is the property assessment.
Let us take hardly additional guides for see authentic and invalid cases of qualities.
A tag can at least contain or more name and worth sets, at any rate two property names cannot be same.
- <text class = message>hello</text> – >wrong
- <text “class” = message>hello</text> – >wrong
- <text class = “message”>hello</text> – >correct
- <text class = “message” reason = “greet”>hello</text> – >correct
- <text class = “message” classification =”greet”>hello</text> – >wrong
XML Attributes Syntax Rules
- Unlike HTML, attribute names in XML are case-sensitive, i.e. HREF and href are viewed as two distinctive XML attributes.
- In syntax, two values can’t have the same attributes. The accompanying example shows invalid syntax in light of the fact that the attribute b is indicated twice.
<a b = "x" c = "y" b = "z">....</a>
Attribute names are characterized without quotes, while quality qualities should consistently show up in quotes. Following model exhibits wrong xml linguistic structure
<a b = x>....</a>
In the above punctuation, the property estimation isn’t characterized in quotes.
Rule: Attribute should always be quoted
It is not proper to discard quotes marks attribute values. Additionally, XML elements should have attributes in name/value pairs: in any case, the attribute-value should consistently be quoted.
Invalid syntax: <?xml version= “1.0” encoding=“ISO-8859-1”?> <note date=02/02/02> <to>Deepak</to> <from>Spoorthi</from> </note> Valid syntax: <?xml version= “1.0” encoding=“ISO-8859-1”?> <note date=”02/02/02”> <to>Deepak</to> <from>Spoorthi</from> </note>
It will make a wrong document; the date attribute in the note isn’t quoted.
Declarations of Attribute-List
Attributes can be used to relate name-value pairs with elements. Specifications of attribute mustn’t appear outside of start tags and empty tags; consequently, the productions used to remember start tags, end tags, and empty element tags.
Attribute list declaration
• To characterize the set of attributes relating to a given element-type.
• To set up type constraints for these attributes.
• To give default esteems to attributes.
Attribute list declarations determine the name, data-type, and default-value (if any) of each attribute related with a given element-type.
Attribute List Declaration Example
- AttlistDecl ::= ‘<!ATTLIST’ S Name AttDef* S? ‘>’
- AttDef ::= S Name S AttType S DefaultDecl
The Name in the AttlistDecl rule is the kind of an element. At user-choice, an XML-processor may give a warning if attributes are declared for an element-type not itself declared, but rather this isn’t a blunder. The Name in the AttDef rule is the name of the attribute.
- AttlistDecl ::= ‘<!ATTLIST’ S Name AttDef* S? ‘>’
- AttDef ::= S Name S AttType S DefaultDecl
Assume that, when at least one or more AttlistDecl is provided with a given element-type, the contents of each element type provided will be merged. Again, when at least one or more definition is provided for a similar attribute of a given element-type, the first-declaration is mandatory and the subsequent declaration is disregarded. For flexibility, the coders of DTDs can decided to give at-most one attribute list declaration for a given attribute-name, at-most one attribute definition for a given attribute-name in an declaration of attribute-list, plus at-least one attribute definition in every attribute-list declaration. More so, for flexibility, an XML-processor may at user choice issue a cautioning when more-than one attribute-list declaration is provided for a given element-type, or one or more attribute-definition is provided for a given attribute, yet this isn’t a blunder.
Types of Attributes
XML attribute types are of three kinds: a string type, a set of tokenized types, and enumerated types. The string type may take any literal string as a value; the tokenized types are more constrained. The validity constraints noted in the grammar are applied after the attribute value has been normalized as described in 3.3.3 Attribute-Value Normalization.
We have three types of XML attributes namely:
- String type: The string-type may accept any literal-string as a value.
- Set of tokenized type: This particular type of attribute is more obliged, however, constrained. The validity obligations noted in the grammar are applied after the attribute-value has been standardized.
- Enumerated types:
AttType ::= StringType | TokenizedType | EnumeratedType
StringType ::= ‘CDATA’
TokenizedType ::= ‘ID’
Attribute-declaration gives data on whether the attribute’s essence #REQUIRED, and if not, how an XML-processor is to respond once an attribute declared is missing in a document.
- DefaultDecl ::= ‘#REQUIRED’ | ‘#IMPLIED’
In an attribute-declaration, #REQUIRED implies that the attribute must consistently be given; #IMPLIED that no default value is given. [Note: If the declaration is neither #REQUIRED nor #IMPLIED, at that point the AttValue value contains the declared default-value; the #FIXED main-word expresses that the attribute must consistently have the default value. At the point when an XML processor experiences an element without a particular for an attribute for which it has perused a default value-declaration, it should report the attribute with the declared default and value to the application.
Value Normalization of Attribute
Right before a certain value of an attribute is moved to the application or crosschecked for accuracy (validity), the XML-processor should normalize the attribute value by applying the algorithm underneath, or by utilizing some other technique with the end goal that the value passed to the application is equivalent to that delivered by the algorithm.
- All-line breaks should have been normalized on input to #xA.
- Start with a normalized-value comprising of the unfilled (empty) string.
- Every character, entity-reference or character-reference in the un-normalized attribute-value, starting with the first and preceding to the last do the accompanying:
– For instance, character-reference, add the referred (referenced) character to the normalized value.
– Again, for entity-reference, recursively apply stage 3 of this algorithm to the substitution of the text of the entity.
– Also, for a white-space-character (#x20, #xD, #xA, #x9), append a space-character (#x20) to the normalized-value.
- Finally, add the character to the normalized-value.
On the off chance that the attribute-type isn’t CDATA, at that point the XML-processor must farther deal with the normalized-attribute value by disposing of any leading and trailing space (#x20) characters, and by supplanting sequences of space (#x20) characters by a single-space (#x20) character.
Take note: If the un-normalized attribute-value has a reference character to a white-space character other-than space (#x20), the normalized-value has the reference character itself (#xD, #xA or #x9). This differences with the situation where the un-normalized value has a white-space character (not a reference), which is supplanted with a space-character (#x20) in the normalized-value and furthermore appears differently in relation to the situation where the un-normalized-value has a entity-reference whose substitution text has a white-character; being recursively processed, the white-space character is supplanted with a space character (#x20) in the normalized-value.
Eventually, all attributes for which no declaration has thoroughly been perused must be treated by a non validating XML-processor as though declared by CDATA. It is, however, a huge mistake if an attribute value has a reference to an entity for which no declaration has been perused. Following are instances of attribute normalization. Given the accompanying declaration:
<!ENTITY d “
<!ENTITY a “
<!ENTITY da “
The attribute specifications in the left column beneath would be normalized to the character sequences of the center column if the attribute a is declared NMTOKENS and to those of the right columns if a is declared CDATA.
|Attribute specification||a is NMTOKENS||a is CDATA|
|a=” xyz”||x y z||#x20 #x20 x y z|
|a=”&d;&d;A&a; &a;B&da;”||A #x20 B||#x20 #x20 A #x20 #x20 #x20 B #x20 #x20|
”||#xD #xD A #xA #xA B #xD #xA||#xD #xD A #xA #xA B #xD #xA|
Another thing to notice is: The previous example isn’t correct/invalid (but rather well-formed), if a is declared to remain type of NMTOKENS.
An element-tag may show extra-properties for its contents. For instance, xml:space is utilized to show if white-space is critical. When all is said in done, it is accepted that all white-space outside of the tag-structure is critical.
Another special attributes is xml:lang which can be utilized to show the language of the content. For instance:
<p xml:lang=”en”>I do not speak</p> Hindi
<p xml:lang=”es”>Main nahin bolata</p> Hindi
Attributes must have quoted values
There are two rules for attributes in XML documents:
• Attributes MUST CONTAIN values.
• Those values MUST BE enclosed within quotation marks.
Compare the two examples below. The markup at the top is legal in HTML, but not in XML. To do the equivalent in XML, you have to give the attribute a value, and you have to enclose it in quotes. Look at the two examples beneath. The mark-up at the top is valid in HTML, yet not in XML. To do the identical in XML, you need to give the attribute a value, and you need to encase it in “quotes”.
- <!– NOT legal XML markup –> Example 1
- <ol compact>
- <!– legal XML markup –> Example 2
- <ol compact=”yes”>
You can utilize either single or double quotes, similarly insofar as you’re consciously steady. In the event that the attribute has a single or double quote, you could utilize the other sort of quote to encompass the value (as in name=”Deepak’s vehicle”), or utilize the elements " for a double quote and ' for a single-quote. An entity is a symbol, for example, ", that the XML parser replaces with other text, for example, “.
We might not have fully covered in details the concept of DTDs and how it works, yet there’s one more essential topic to cover here: Defining attributes. You can characterize attributes for the elements that will show up in your XML-document. Using an DTD, you can likewise:
• Interpret which of the attributes are required.
• Interpret default values for attributes.
• List the entirety of the valid values for a given attribute.
Assume that you need to change the DTD to make state an attribute of the <city> element. Here’s the means by which to do that:
2 <!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA #REQUIRED>
This characterizes the <city> element as in the past, yet the reviewed example additionally utilizes an ATTLIST declaration to list the attributes of the elements. The name city inside the attribute-list tells the parser that these attributes are characterized for the <city> element. The name-state is the name of the attribute, and the watchwords CDATA and #REQUIRED tell the parser that the state attribute contains text and is required (if it’s discretionary, CDATA #IMPLIED will work).
To characterize various attributes for an element, compose the ATTLIST like this:
<!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA #REQUIRED
postal-code CDATA #REQUIRED>
The above example characterizes both state and postal-code as attributes of the <city> element.
At last, DTDs permit you to characterize default values for attributes and identify the entirety of the correct values for an attribute:
<!ELEMENT city (#PCDATA)>
<!ATTLIST city state CDATA (AZ|CA|NV|OR|UT|WA) “CA”>
To cap it all, the example here demonstrates that it just backings addresses from the conditions of Arizona (AZ), California (CA), Nevada (NV), Oregon (OR), Utah (UT), and Washington (WA), and that the default state is California. Consequently, you can do a restricted type of data-validation. While this is a valuable function, it’s a little subset of how you can deal with XML-schemas.
Comments may show up anyplace in a document outside other mark-up; moreover, they may show up inside the document-type declaration at places permitted by the grammar. They’re not part of the document’s character data; an XM- processor may, yet needn’t, make it feasible for an application to recover the text of comments. For similarity, the string ” – ” (double-hyphen) mustn’t happen inside comments.] Parameter substance references mustn’t be perceived inside comments.
Comment ::= ‘<!–‘ ((Char – ‘-‘) | (‘-‘ (Char – ‘-‘)))* ‘–>’
This is another means by which a comment should look-like in XML-document.
<!– This is just a comment –>
A case for a comment:
- <!– declarations for <head> & <body> –>
- Note that the grammar doesn’t permit a comment ending-in — >. The accompanying example isn’t well framed.
– <!– B+, B, or B—>
Comments can show up anyplace in the document; they can even show up before or after the root element. A comment starts with <!- – and closes with – >. A comment cannot contain a double hyphen ( — ) aside from toward the end; with that special case, a comment can contain anything. Above all, any mark-up inside a comment is overlooked; only if you need to eliminate a huge section of a XML-document, essentially enclose that section by a comment. (To reestablish the commented-out section, essentially eliminate the comment tags.)
Here comes a mark-up that contains a remark:
2 <!– Here’s a PI for Cocoon: –>
XML Character Entities
Anyplace the XML processor finds the string &dw;, it replaces the entity with the string developerWorks. The XML-spec additionally characterizes five entities you can use instead of different special characters.
An entity reference must not contain the name of an unparsed entity. Unparsed entities maybe referred to just in attribute values declared to be of type entity or entities.
The entities are:
• < for the less than sign
• > for the greater than sign
• " for a double-quote
• ' for a single quote (or apostrophe)
• & for an ampersand.
Character and Entity References
A character reference alludes to a particular character in the ISO/IEC 10646 character set, for instance one not straightforwardly open from accessible info devices.
CharRef ::= ‘&#’ [0-9]+ ‘;’
| ‘&#x’ [0-9a-fA-F]+ ‘;’
Well-formed-ness limitation: Legal Character
Characters alluded to utilizing character references MUST match the production for Char.
On the off chance that the character reference starts with ” &#x “, the digits and letters up to the ending ; give a hexadecimal representation of the character’s code point in ISO/IEC 10646. Again, if it starts just with ” &# “, the digits up to the ending ; give a decimal representation of the character’s code point.
Entity reference: An entity reference alludes to the content of a named entity. References to parsed general elements use ampersand (and) and semicolon (;) as delimiters. Parameter entity references use percent-sign (%) and semicolon (;) as delimiters.
Entity Reference Reference ::= EntityRef | CharRef EntityRef ::= '&' Name ';' [WFC: Entity Declared] [VC: Entity Declared] [WFC: Parsed Entity] [WFC: No Recursion] PEReference ::= '%' Name ';' [VC: Entity Declared] [WFC: No Recursion] [WFC: In DTD] Case 1: Character and entity references example Type <key>less-than</key> (<) to save options. This document was prepared on &docdate; and is classified &security-level;. Case 2: Parameter-entity reference example <!-- declare the parameter entity "ISOLat2"... --> <!ENTITY % ISOLat2 SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" > <!-- ... now reference it. --> %ISOLat2;
CDATA Sections: CDATA sections may happen anyplace where character-data may happen; they are utilized to get away from blocks of text containing characters which would somehow or be perceived as mark-up. The sections of CDATA start with the string ” <![CDATA[ ” and end with the string ” ]]> “:].
- CDSect ::= CDStart CData CDEnd
- CDStart ::= ‘<![CDATA[‘
- CData ::= (Char* – (Char* ‘]]>’ Char*))
- CDEnd ::= ‘]]>’
Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using ” < ” and ” & “. CDATA sections cannot nest.
Inside a CDATA-section, just the CDEnd string is perceived as mark-up, so that left angle brackets and ampersands may happen in their exacting form; they needn’t (and can’t) be avoided utilizing ” < ” and ” & “. CDATA sections can’t nest.
Consider an example of a CDATA sections, ” <greeting> ” and ” </greeting> ” are perceived as character-data, not mark-up:
The CDATASection object
The CDATASection object represents a CDATA-segment in a document. A CDATA-section contains text that won’t be parsed by a parser. Tags within a CDATA-section won’t be treated as mark-up and elements won’t be extended. The basic role is for including material, for example, XML-fragments, without expecting to get away from all the delimiters.
The main delimiter that is perceived in a CDATA area is “]]>” – which demonstrates the finish of the CDATA section. CDATA areas can’t be nested.
The processing instructions start with <? and, end with ?>. Processing instructions are guidelines for the XML-processor. Processing instructions aren’t incorporated with the XML-recommendation. Or maybe, they’re processor-dependant so not all processors see all processing instructions. Our example is a typical processing-instruction that numerous processors understand. The instructions to the processor are to utilize an external style-sheet.
Processing Instructions: Processing directions (PIs) permit documents to contain instructions for applications.
Processing Instructions Example
- PI ::= ‘<?’ PITarget (S (Char* – (Char* ‘?>’ Char*)))? ‘?>’
- PITarget ::= Name – ((‘X’ | ‘x’) (‘M’ | ‘m’) (‘L’ | ‘l’))
Processing instructions (PIs) aren’t part of the document’s character-data, but rather must be gone through to the application. The PI starts with a target (PITarget) used to recognize the application to which the instruction is directed. The target names ” XML “, ” xml “,, etc are saved for standardization in this or future versions of this specifications. The XML Notation mechanism maybe utilized for formal declaration of PI targets. Parameter entity references mustn’t be perceived inside processing instructions (PIs).
White Space XML
White-space is essentially clear/blank space made via carriage returns, line feeds, tabs, or potentially spaces. White-space doesn’t influence the processing of the document, so you can decide to incorporate white-space or not. Actually, the XML recommendation determines that XML-documents utilize the UNIX convention for line endings. This implies that you should utilize a linefeed character just (ASCII code 10) to indicate the end of a line.
Discussing white-space, there’s an special attribute (xml:whitespace) that you can use to preserve white-space inside your elements (however we won’t fret about that a few seconds ago).
White-spaces are preserved in XML
Dissimilar to HTML that doesn’t preserve white-space, the XML-document preserves white-spaces.
White Space Handling
In altering XML-documents, it is frequently advantageous to utilize “white-space” (spaces, tabs, and blank lines) to separate the mark-up for more prominent readability. Such white-space is ordinarily not proposed for inclusion in the delivered version of the document. Then again, “huge” white-space that ought to be preserved in the delivered form is normal, for instance in poetry and source code.
An XML-processor must consistently pass all characters in a document that aren’t mark-up through to the application. A validating XML-processor must likewise inform the application which of these characters constitutes white-space appearing in element content. An exceptional attribute named xml:space maybe joined to an element to single an expectation that in that element, white-space ought to be preserved by applications. In correct documents, this attribute, similar to some other, must be declared if it’s used. At the point when declared, it must be given as a counted type whose values are either of “default” and “preserve”.
<!ATTLIST poem xml:space (default|preserve) ‘preserve’>
<!ATTLIST pre xml:space (preserve) #FIXED ‘preserve’>
The value “default” signals that applications’ default white-space processing-modes are worthy for this element; the value “preserve” shows the purpose that applications preserve all the white-space. This declared goal is considered to apply to all elements inside the content of the element where it is specified, except if superseded with another example of the xml:space property. This determination doesn’t offer significance to any estimation of xml:space other than “default” and “preserve”. It is a blunder for different specification to be specified; the XML-processor may report the mistake or may recoup by overlooking the attribute specification or by reporting the (mistaken) value to the application. Wrong values may be overlooked or rejected by application.
Encoding is the way toward converting unicode characters into their identical binary representation. At the point when the XML-processor peruses a XML-document, it encodes the document contingent upon the type of encoding. Consequently, we have to indicate the type of encoding in the XML declaration.
Types of encoding
There are essentially two types of encoding:
UTF represents UCS Transformation Format, and UCS itself implies Universal Character Set. The number 8 or 16 alludes to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes). For the documents without encoding data, UTF-8 is set by default.
Validation in XML
Validation is defined as a process by which an XML-document is validated. An XML-document is said to be valid if its contents coordinate with the elements, attributes and related-document type declaration (DTD), and if the document conforms to the limitations expressed in it. Validation is managed in two different ways by the XML parser.
- Well-formed XML document
- Valid XML document
Well-formed XML Document: An XML document is supposed to be well-formed in the event that it clings to the accompanying guidelines:
- Non-DTD XML files must utilize the predefined character entities for amp(&), apos(single quote), gt(>), lt(<), quot(double quote).
- It also must follow the ordering of the tag. i.e., the internal tag must be encased prior to the shutting the external tag.
Every one of its starting tag must’ve an end tag or it must be a self-ending tag. (<title>….</title> or <title/>).
It must’ve just one attribute in a start-tag, which should be quoted.
amp(&), apos(single quote), gt(>), lt(<), quot(double quote) entities other than these must be declared.
Following is a case of a well-formed XML-document: <?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?> <!DOCTYPE address [ <!ELEMENT address (name,company,phone)> <!ELEMENT name (#PCDATA)> <!ELEMENT company (#PCDATA)> <!ELEMENT phone (#PCDATA)> ]> <address> <name>Deepak Kumar</name> <company>GreatLearning</company> <phone>91 123-4567</phone> </address>
The above example is said to be well-formed as –
- It characterizes the type of document. Here, the document-type is an element-type.
- It incorporates a root-element named as address.
Every one of the kid elements among name, company and phone is encased in its self simple-tag.
Maintained is the order of the tags.
Valid XML Document:
In the event that an XML-document is well-formed and has a related Document Type Declaration (DTD), at that point it is supposed to be a valid XML-document.
In XML, the names of the tags used are defined by the developer. While mixing the XML documents from different XML applications, this naming might result in conflicts. So, XML namespaces provide a method to avoid this issue of element name conflicts.
Name Conflict Example:
The following XML code carries information of HTML table:
<table> <tr> <td>Table</td> <td>Chair</td> </tr> </table>
The following XML code carries the information about a table (Shape):
<table> <name>Rectangle</name> <length>100</length> <width>60</width> </table>
If the above XML code fragments were to be added together, it would result in a name conflict as both contain an element , but the content and meaning of both the elements are different.
An XML application or a user will not be able to know how to handle such differences.
Using Prefix to Solve Name Conflict
Name prefix can be used in XML to avoid name conflicts.
The following code carries the data of both HTML Table and Shape Table:
<t:table> <t:tr> <t:td>Table</t:td> <t:td>Chair</t:td> </t:tr> </t:table> <s:table> <s:name>Rectangle</s:name> <s:length>100</s:length> <s:width>60</s:width> </s:table>
The example given above will have no conflict as both the <table> elements have different names.
The XML parser is a package or a software library that provides an interface for the applications of clients to work with XML documents. It may validate the XML documents and checks for a proper format for the XML document. Programs use XML with the help of an XML parser.
Types of parsers:
- DOM Parser
- SAX Parser
- JDOM Parser
- StAX Parser
- XPath Parser
- DOM4J Parser
The Document Object Model (DOM) parser loads the document’s complete contents and creates its entire hierarchical tree in the memory to parse a document. DOM parser is officially recommended by the World Wide Web Consortium (W3C).
Make use of a DOM parser when :
- A lot of information regarding the structure of a document is required.
- Movement of parts of an XML document is required.
- Data in an XML document is to be used more than once
- API is simple to use.
- DOM Parser supports both read and write operations.
- When random access to widely separated parts of a document is required, DOM Parser is preferred.
- As the whole XML document requires to be loaded into memory, DOM Parser consumes excess memory; hence, it is memory efficient.
- It is slower in comparison to other parsers.
Simple API for XML (SAX) does not load the complete document in the memory; instead, it parses the document on event-based triggers. No parse trees are created by SAX Parser. SAX is a streaming interface for XML, i.e. that when the XML document being processed an element and attribute, applications using SAX receive event notifications at a time in chronological order, starting from the beginning of the XML document and ending with the closing of the ROOT element.
- SAX Parser recognizes the tokens that make up a well-formed XML document by reading the XML document from top to bottom.
- The way the tokens appear in the document, they get processed in that exact order.
- An “event” handler is provided by the application program that must be registered with the parser.
- Callback methods in the handler are invoked as the tokens are identified with the relevant information.
Use a SAX Parser when:
- The XML document is not deeply nested.
- The XML document can be processed linearly from top to down.
- A massive XML document is being processed whose DOM tree would be consuming too much memory. (Ten bytes of memory is used to represent one byte of XML while implementing DOM.)
- Only a part of the XML document is involved while solving the problem.
- An XML document arrives over a stream as the data is available as soon as the parser sees it.
- SAX Parser is simple to use and memory efficient.
- It works well for huge documents.
- It works very fast.
- Its API is less intuitive as it is event-based.
- As the data is broken into pieces, the client never knows the complete information.
- You need to write the code and store the data on your own to keep track of data the parser has seen or change the items’ order.
JDOM Parser is a Java developer-friendly API, Java-optimised and uses Java collections like Lists and Arrays. It works along with DOM and SAX APIs, combining the best of the two. It uses less memory and is as fast as SAX.
It parses in the same way as the SAX parser but in a more efficient manner.
It parses an XML document based on the expression. It is extensively used in conjunction with XSLT.
It is a java library that uses Java Collections Framework to parse XML, XPath and XSLT. DOM4J parser also provides support for DOM, SAX and JAXP
Text String Parsing:
<!DOCTYPE html> <html> <body> <p id="example"></p> <script> var text, parser, xmlDoc; <!--define text string--> text = "<mall><shop>" + "<name>Everyday Items</name>" + "<item>bucket</item>" + "<price>50</price>" + "</shop></mall>"; <--create XML DOM parser--> parser = new DOMParser(); <--parser creates a new XML DOM object using the text string--> xmlDoc = parser.parseFromString(text,"text/xml"); document.getElementById("example").innerHTML = xmlDoc.getElementsByTagName("name").childNodes.nodeValue; </script> </body> </html>
Document Type Definition (DTD) defines the legal attributes and elements along with the structure of an XML document. An XML document is well-informed if the syntax is correct, but the XML Document that gets validated against a DTD is both well-informed and valid.
Valid XML Documents
A valid XML document is not only well-informed but also conforms to the rules of a DTD.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE note SYSTEM "Note.dtd"> <note> <to>Chanchal</to> <from>Harshit</from> <heading>Message</heading> <body>Hi! How are you doing?</body> </note>
The DOCTYPE declared above contains a reference to the DTD file whose content has been shown and explained below.
<!DOCTYPE note <!—defines the element of the document as note--> [ <!ELEMENT note (to,from,heading,body)> <!—defines note element must contain the elements - to, from, heading and body--> <!ELEMENT to (#PCDATA)> <!—defines to element of type ‘#PCDATA’--> <!ELEMENT from (#PCDATA)> <!—defines from element of type ‘#PCDATA’--> <!ELEMENT heading (#PCDATA)> <!—defines heading element of type ‘#PCDATA’--> <!ELEMENT body (#PCDATA)> <!—defines body element of type ‘#PCDATA’--> ]>
XML Schema, also known as XML Schema Definition (XSD), is used to describe and validate the structure and content of XML data. It defines attributes, elements and data types. It is similar to DTD but provides more control over the XML structure.
Having the correct syntax makes an XML document well-informed. Being validated against schema means that the XML document is both well-informed and valid.
XML Schema as an alternative to DTD:
<xs:element name="note"> <!--defines the element “note”--> <xs:complexType> <!--element note is a complex type--> <xs:sequence> <!--complex type is a sequence of elements--> <xs:element name="to" type="xs:string"/> <!--element “to” is of type string (text)--> <xs:element name="from" type="xs:string"/> <!--element “from” is of type string--> <xs:element name="heading" type="xs:string"/><!--element “heading” is of type string--> <xs:element name="body" type="xs:string"/><!--element “body” is of type string--> </xs:sequence> </xs:complexType> </xs:element>
XML Shema Data Types
XML schemas have two types of data types:
- simpleType: It allows you to have text-based elements. It cannot be left empty and contains fewer attributes and child elements.
- complexType: You are allowed to hold multiple elements and attributes in complexType. It can be left empty and can have additional sub-elements.
Why are XML Schemas more potent than DTD?
- XML Schemas are written in XML.
- XML Schemas are extendible to additions.
- Data Types are supported by XML Schemas.
- Namespaces are supported by XML Schemas.
XML Vs HTML
|XML stands for Extensible Markup Language.||HTML stands for Hypertext Markup Language.|
|The main focus of XML is on data transfer.||The main focus of HTML is on data presentation.|
|It is content-driven.||It is format driven.|
|It provides the support of namespaces.||It does not provide support of namespaces.|
|Compulsory to add the closing tag.||Not compulsory to add the closing tag.|
|XML tags are not predefined.||HTML tags are predefined.|
|XML has extensible tags.||HTML has limited tags.|
This brings us to the end of the blog on Octave Tutorial. We hope that you found this helpful and were able to learn more about the concepts. If you wish to learn more such skills, join Great Learning Academy’s pool of Free Online Courses!