CERN Accelerating science

This website is no longer maintained. Its content may be obsolete. Please visit http://home.cern/ for current CERN information.

CERN home pageCERN home pageDocument Access by ReferenceDocument Access by ReferenceComputer News LetterComputer News LetterCNL issues in year 2000CNL issues in year 2000CNL-2000-002CNL-2000-002CNL-2000-002CNL Help pages

Contents

Editorial Information
Editorial
If you need help
Announcements Physics Computing Desktop Computing Internet Services and Network Scientific Applications and Software Engineering Desktop Publishing The Learning Zone User Documentation Just For Fun ...
Previous:XML Applications at CERN
Next:The Learning Zone
 (See printing version)



XML document strategies for the Web

Michel Goossens , IT/API


Abstract

The use of XML as a central strategy for managing scientific documents on the Web is investigated. We look at the various ways of handling XML in the Web context in the framework of a general document repository. Finally we introduce the TIPS Project that proposes a new approach to scientific information production and dissemination with XML at the of its storage paradigm.


XML [28], a meta-language introduced at the beginning of 1998 for describing structured data on the Web, builds upon experience gained with SGML and HTML during the last decade. XML was developed with the Web in mind and also guarantees a seamless integration with modern programming languages, such as Java, Perl, and Python. XML is based on Unicode [18] (see [10] for an introduction), so that it is well suited for dealing with multi-language documents, especially those containing lots of non-ASCII characters. Thus, XML [28] is an ideal storage format for a central repository containing data in various source formats, languages, and markup schema.

XML is widely supported by all major players in the Internet world, Open Source initiatives, as well as commercial vendors. Several free and commercial tools are available for all conceivable operating systems and purposes. In the near future, most Web browsers and visual editors will support XML natively.

Handling scientific documents

Different ways of using an XML document in a repository are shown in Figure XML as the central part of a document strategy for the Web. At the top right we represent the XML document with its defining vocabulary (DTD or XML Schema [23], [24] and [25]). This document, which is encoded in Unicode, can be viewed, searched, indexed, edited, validated without problems by any of a series of XML-aware applications all over the world. The XML document can be typeset using TeX (three methods, labelled A, B, and C, are discussed in [5]) or its 16-bit Unicode-aware variant Omega [12]. It can also be transformed into HTML for viewing with present-day browsers (X2H via XSL). In the (near) future, once browsers will be able to handle XML directly, we can probably skip the HTML intermediate format and let CSS [20] (possibly via XSLT) style the XML file directly for display on the Web. Figure XML as the central part of a document strategy for the Web also contains arrows going from left (TeX) to right (XML/HTML, browsers). They indicate programs to transform existing LaTeX source documents into XML (using one or more standard DTDs) to store the information for archiving purposes. The vertical ellipse in the centre represents other editing tools, such as Adobe's FrameMaker [1], Microsoft's Word [14], and Corel's WordPerfect [7], that allow or are expected to allow import/export of XML documents. Thus, XML genuinely becomes the central element in a global strategy for managing electronic documents by allowing information to be stored, saved, shared, and used by different applications on all computer platforms.

Figure 1. XML as the central part of a document strategy for the Web

It is not sufficient for XML documents to be available on the Web, but it must also be possible to output them in a typographically optimal way. LaTeX [8] has been one of the pilers of typesetting scientific documents for many years. Therefore, it is important that tools be available to transform an electronic document between XML and LaTeX [9]. PassiveTeX [16] and xmltex [4] are two recent developments that ensure that XML sources can be typeset directly or with the help of XSLT [29] stylesheets with TeX. As a added bonus, PassiveTeX supports mathematics marked up in MathML [21] directly, so that an XSLT style sheet can pass MathML's <math> source elements and its children through unchanged. This guarantees that mathematical material, involving even complex formulae, will be typeset perfectly. An alternative using DSSSL [13] and Jade [6] is available.

In the other direction, one can translate LaTeX to HTML (or its XML version XHTML [27]) with LaTeX2html [15] or TeX4ht [11]. One can also choose another target language, such as DocBook [19], for computer documentation, or TEI [3], used in the humanities. For both the DocBook and TEI DTDs, XSLT styles sheets exist to transform their source form to HTML or XSL Formatting Objects [30], a generic format that can be translated into PDF by PassiveTeX or FOP [2]

All documents in the XML repository should be accompanied by metadata in the form of RDF [22]. Source files (images, scanned texts or manuscripts, data files) into which it is impossible or impractical to introduce XML markup, should be accompanied by an external XML file containing similar RDF data. This ensures that all documents in the database present a uniform XML interface. This is important for indexing, search, and data mining.

An example of a scientific document

Part of a source document marked up using the TEI and MathML XML languages follows.

<div1 id="vavref">
<head>Vavilov theory</head>

<p>Vavilov<ptr type="bib" target="bib-VAVI"/> derived a
more accurate straggling distribution by introducing the kinematic
limit on the maximum transferable energy in a single collision, rather
than using
<inlinemath><math><msub><mi>E</mi><mrow><mtext>max</mtext></mrow></msub>
<mo>=</mo><mi>&infin;</mi></math></inlinemath>.

Now we can write<ptr type="bib" target="bib-SCH1"/>: 

<eqnarray ><subeqn><math><mi>f</mi> <mfenced open='(' close=')'>
<mi>&epsi;</mi><mo>,</mo><mi>&delta;</mi><mi>s</mi></mfenced>
<mo>=</mo> <mfrac><mrow><mn>1</mn></mrow>
<mrow><mi>&xi;</mi></mrow>
</mfrac><msub><mi>&phi;</mi><mrow><mi>v</mi></mrow></msub>
<mfenced open='(' close=')'>
<msub><mi>&lambda;</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>&kappa;</mi><mo>,</mo><msup><mi>&beta;</mi><mrow><mn>2</mn></mrow>
</msup></mfenced></math></subeqn></eqnarray> 
where
<eqnarray><subeqn><math><msub><mi>&phi;</mi><mrow><mi>v</mi></mrow></msub> 
<mfenced open='(' close=')'>
<msub><mi>&lambda;</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>&kappa;</mi><mo>,</mo>
<msup><mi>&beta;</mi><mrow><mn>2</mn></mrow></msup></mfenced>  
  <mo>=</mo>   
<mfrac><mrow><mn>1</mn></mrow>
       <mrow><mn>2</mn><mi>&pi;</mi><mi>i</mi></mrow>
</mfrac>
<msubsup><mo>&int;</mo>
<mrow><mi>c</mi><mo>+</mo><mi>i</mi><mi>&infin;</mi></mrow>
<mrow><mi>c</mi><mo>-</mo><mi>i</mi><mi>&infin;</mi></mrow></msubsup>
<mi>&phi;</mi><mfenced open='(' close=')'><mi>s</mi></mfenced>
<msup><mi>e</mi><mrow><mi>&lambda;</mi><mi>s</mi></mrow></msup>
<mi>d</mi><mi>s</mi><mspace width='2cm'/><mi>c</mi><mo>&geq;</mo><mn>0</mn>
                 </math></subeqn>
                 
<subeqn><math><mi>&phi;</mi><mfenced open='(' close=')'><mi>s</mi></mfenced> 
<mo>=</mo><mo>exp</mo><mfenced open='[' close=']'><mi>&kappa;</mi>
<mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>&beta;</mi>
      <mrow><mn>2</mn></mrow></msup><mi>&gamma;</mi><mo>)</mo></mrow>
</mfenced><mo>exp</mo><mfenced open='[' close=']'><mi>&psi;</mi> 
<mfenced open='(' close=')'><mi>s</mi></mfenced></mfenced>
<mo>,</mo> </math></subeqn>
The result of typesetting this document with xmltex and PassiveTeX is shown in Figure The document formatted by LaTeX. Although MathML is rather verbose, it is not too difficult to recognize the code for the formulae shown, so that it becomes possible to search all parts of a documents, including the mathematical formulae, and, soon the graphics, once SVG (Scalable Vector Graphics, see [26]) will be more widely supported.

Figure 2. The document formatted by LaTeX

The TIPS Project

Tim Bray said that XML is the ASCII for the 21st century. XML allows documents in all major world languages to be viewed and transmitted in a standard reliable way. Many tens of XML applications and vocabularies exist for XML-encoded tree-structured documents and data.

In order to leverage these XML technologies and to benefit from the interoperability and robustness of widely deployed XML solutions of the Web the TIPS project [17] was initiated.

TIPS (Tools for Innovative Publishing in Science) is a European Union funded IST project between five academic institutes and one commercial company from four different countries. TIPS proposes a new approach to scientific information production and dissemination. The aim is to develop a set of user-friendly and advanced tools and services that are organized in an open system to support research information production, management, access, and use in a coherent manner. As an implementation and to make possible the evaluation of the system, these tools and services will be integrated on a web-based portal for the high-energy physics community.

The proposed system will support the activities of document writing, reviewing, publishing, searching, disseminating and reading, as well as the communication among members of the research community. This approach is suitable for supporting a more productive research community, in which researchers can work in a more effective, inexpensive, and pleasant way: delays and costs due to paper documents can be considerably reduced, multimedia can be added to electronic documents, information access can be improved (and information overload decreased) by using advanced information retrieval and filtering techniques. The full possibilities of XML technologies will be fully exploited wherever possible.

Bibliography

  1. Adobe. FrameMaker 6.0. http://www.adobe.com/products/framemaker.
  2. Apache XML Project. FOP, XSL Formatting Object Processor in Java. http://xml.apache.org/fop/.
  3. Lou Burnard and C.M. Sperberg-McQueen. TEI Guidelines for Electronic Text Encoding and Interchange. http://etext.lib.virginia.edu/TEI.html.
  4. David Carlisle. xmltex A non validating (and not 100% conforming) namespace aware XML parser implemented in TeX. Available on CTAN in the directory macros/xmltex/.
  5. David Carlisle, Michel Goossens and Sebastian Rahtz. De XML à PDF avec xmltex, XSLT et PassiveTeX. http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/35-carlisle.pdf

    An updated version, focussing on PassiveTeX will be published in the proceedings of the TUG2000 Conference (TUGBoat Vol 23, September 2000).

  6. James Clark. DSSSL implementation. http://www.jclark.com/jade.
  7. Corel. WordPerfect Office 2000. http://www.corel.com/Office2000 (Microsoft Windows) and http://linux.corel.com/products/wpo2000_linux (Linux).
  8. Michel Goossens, Frank Mittelbach, and Alexander Samarin. The LaTeX Companion.. Addison-Wesley, Reading, 1994.
  9. Michel Goossens and Sebastian Rahtz. The LaTeX Web Companion. Addison-Wesley, Reading, 1999.
  10. Tony Graham. Unicode: A Primer. M and T Books (IDG), Foster City, 2000.
  11. Eitan Gurari. TeX4ht: LaTeX and TeX for Hypertext. http://www.tug.org/applications/tex4ht/mn.html.
  12. Yannis Haralambous and John Plaice. The latest developments in Omega. TUGBoat, 17 (2), pages 181-183, June 1996. (See also http://www.gutenberg.eu.org/omega/).
  13. International Organization for Standardization. Information Technology -- Processing Languages -- Document Style Semantics and Specification Language (DSSSL). First edition, 1996. International Standard ISO/IEC 10179:1996, ISO Geneva, 1996. A PDF version is at the URL ftp://ftp.ornl.gov/pub/sgml/WG8/DSSSL/dsssl96b.pdf.
  14. Microsoft. Office Tools, Word. http://www.microsoft.com/office/word.
  15. Ross Moore et al. LaTeX2HTML. http://saftsack.fs.uni-bayreuth.de/~latex2ht/.
  16. Sebastian Rahtz. Passive TeX. http://users.ox.ac.uk/~rahtz/passivetex/.
  17. Sissa, Udine, City University, Grenoble University, CERN, IOP. Tools for Innovative Publishing in Science (TIPS). http://tips.sissa.it/
  18. The Unicode Consortium. The Unicode Standard, Version 3.0. Addison-Wesley, Reading, 2000. See also http://www.unicode.org.
  19. Norman Walsh and Leonard Muelner. Docbook. The Definitive Guide. O'Reilly and Associates, Inc., Sebastopol, USA, 1999. http://nwalsh.com/docbook/index.html. The URL http://www.oasis-open.org/docbook/documentation/reference/index.html gives access to the reference documentation, while DTDs and stylesheets are at http://nwalsh.com/docbook/index.html.
  20. World Wide Web Consortium. Håkon Wium Lie, Bert Bos, Chris Lilley and Ian Jacobs (editors). Cascading Style Sheets, level 2. http://www.w3.org/TR/REC-CSS2.
  21. World Wide Web Consortium. Patrick Ion and Robert Miner (editors). Mathematical Markup Language (MathML[tm]) 1.01 Specification. http://www.w3.org/TR/REC-MathML/.
  22. World Wide Web Consortium. Resource Description framework. http://www.w3.org/RDF/.
  23. World Wide Web Consortium, David C. Fallside (editor). XML Schema Part 0: Primer (W3C Working Draft). http://www.w3.org/TR/xmlschema-0.
  24. World Wide Web Consortium, Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn (editors). XML Schema Part 1: Structures (W3C Working Draft). http://www.w3.org/TR/xmlschema-1.
  25. World Wide Web Consortium, Paul V. Biron, Ashok Malhotra (editors). XML Schema Part 2: Datatypes (W3C Working Draft). http://www.w3.org/TR/xmlschema-2.
  26. World Wide Web Consortium, Jon Ferraiolo (editor). Scalable Vector Graphics (SVG) 1.0 Specification (W3C Working Draft). http://www.w3.org/TR/SVG.
  27. World Wide Web Consortium. XHTML 1.0: The Extensible HyperText Markup Language. A Reformulation of HTML 4 in XML 1.0. http://www.w3.org/TR/xhtml1/.
  28. World Wide Web Consortium. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen (editors). Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/REC-xml. An annotated version of the specification is at http://www.xml.com/axml/axml.html.
  29. World Wide Web Consortium, James Clark (editor). XSL Transformations (XSLT), Version 1.0 (W3C Recommendation 16 November 1999). http://www.w3.org/TR/xslt.
  30. World Wide Web Consortium, Stephen Deach (editor). Extensible Stylesheet Language (XSL), Version 1.0 (W3C Working Draft). http://www.w3.org/TR/WD-xsl.


For matters related to this article please contact the author.
Cnl.Editor@cern.ch


CERN-CNL-2000-002
Vol. XXXV, issue no 2


Last Updated on Fri Aug 18 19:49:22 GMT+04:30 2000.
Copyright © CERN 2000 -- European Organization for Nuclear Research