Abstract
The use of XML as a central strategy for managing scientific documents
on the Web is investigated. We look at the various ways of handling
XML in the Web context in the framework of a general document
repository. Finally we introduce the TIPS Project that proposes a new
approach to scientific information production and dissemination with
XML at the of its storage paradigm.
XML [28], a meta-language
introduced at the beginning of 1998 for describing structured data on
the Web, builds upon experience gained with SGML and HTML during the
last decade. XML was developed with the Web in mind and also
guarantees a seamless integration with modern programming languages,
such as Java, Perl, and Python. XML is based on Unicode [18] (see [10] for an introduction), so that it is well
suited for dealing with multi-language documents, especially those
containing lots of non-ASCII characters. Thus, XML [28] is an ideal storage format for a central repository
containing data in various source formats, languages, and markup
schema.
XML is widely supported by all major players in the Internet world,
Open Source initiatives, as well as commercial
vendors. Several free and commercial tools are available for all
conceivable operating systems and purposes. In the near future, most
Web browsers and visual editors will support XML natively.
Handling scientific documents
Different ways of using an XML document in a repository are shown in
Figure XML as the central part of a document strategy for the Web. At the top right we represent
the XML document with its defining vocabulary (DTD or XML
Schema [23], [24] and [25]). This document, which is encoded in
Unicode, can be viewed, searched, indexed, edited, validated without
problems by any of a series of XML-aware applications all over the
world. The XML document can be typeset using TeX (three methods,
labelled A, B, and C, are discussed in [5]) or its 16-bit Unicode-aware variant
Omega [12]. It can also be
transformed into HTML for viewing with present-day browsers (X2H via
XSL). In the (near) future, once browsers will be able to handle XML
directly, we can probably skip the HTML intermediate format and let
CSS [20] (possibly via XSLT) style
the XML file directly for display on the Web. Figure XML as the central part of a document strategy for the Web also contains arrows going from left (TeX) to
right (XML/HTML, browsers). They indicate programs to transform
existing LaTeX source documents into XML (using one or more standard
DTDs) to store the information for archiving purposes. The vertical
ellipse in the centre represents other editing tools, such as Adobe's
FrameMaker [1],
Microsoft's Word [14],
and Corel's WordPerfect [7], that allow or are expected to allow
import/export of XML documents. Thus, XML genuinely becomes the
central element in a global strategy for managing electronic documents
by allowing information to be stored, saved, shared, and used by
different applications on all computer platforms.
|
Figure 1. XML as the central part of a document strategy for the Web |
It is not sufficient for XML documents to be available on the Web, but
it must also be possible to output them in a typographically optimal
way. LaTeX [8] has been one of the
pilers of typesetting scientific documents for many years. Therefore,
it is important that tools be available to transform an electronic
document between XML and LaTeX [9].
PassiveTeX [16] and
xmltex [4] are two recent
developments that ensure that XML sources can be typeset directly or
with the help of XSLT [29]
stylesheets with TeX. As a added bonus, PassiveTeX supports mathematics
marked up in MathML [21]
directly, so that an XSLT style sheet can pass MathML's <math>
source elements and its children through unchanged. This guarantees
that mathematical material, involving even complex formulae, will be
typeset perfectly. An alternative using DSSSL [13] and Jade [6] is
available.
In the other direction, one can translate LaTeX to HTML (or its XML
version XHTML [27]) with
LaTeX2html [15] or
TeX4ht [11]. One can also choose
another target language, such as DocBook [19], for computer documentation, or TEI [3], used in the humanities. For both the
DocBook and TEI DTDs, XSLT styles sheets exist to transform their
source form to HTML or XSL Formatting Objects [30], a generic format that can be translated into PDF
by PassiveTeX or FOP [2]
All documents in the XML repository should be accompanied by metadata
in the form of RDF [22]. Source
files (images, scanned texts or manuscripts, data files) into which it
is impossible or impractical to introduce XML markup, should be
accompanied by an external XML file containing similar RDF data. This
ensures that all documents in the database present a uniform XML
interface. This is important for indexing, search, and data mining.
An example of a scientific document
Part of a source document marked up using the TEI and MathML XML
languages follows.
<div1 id="vavref">
<head>Vavilov theory</head>
<p>Vavilov<ptr type="bib" target="bib-VAVI"/> derived a
more accurate straggling distribution by introducing the kinematic
limit on the maximum transferable energy in a single collision, rather
than using
<inlinemath><math><msub><mi>E</mi><mrow><mtext>max</mtext></mrow></msub>
<mo>=</mo><mi>∞</mi></math></inlinemath>.
Now we can write<ptr type="bib" target="bib-SCH1"/>:
<eqnarray ><subeqn><math><mi>f</mi> <mfenced open='(' close=')'>
<mi>ε</mi><mo>,</mo><mi>δ</mi><mi>s</mi></mfenced>
<mo>=</mo> <mfrac><mrow><mn>1</mn></mrow>
<mrow><mi>ξ</mi></mrow>
</mfrac><msub><mi>φ</mi><mrow><mi>v</mi></mrow></msub>
<mfenced open='(' close=')'>
<msub><mi>λ</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>κ</mi><mo>,</mo><msup><mi>β</mi><mrow><mn>2</mn></mrow>
</msup></mfenced></math></subeqn></eqnarray>
where
<eqnarray><subeqn><math><msub><mi>φ</mi><mrow><mi>v</mi></mrow></msub>
<mfenced open='(' close=')'>
<msub><mi>λ</mi><mrow><mi>v</mi></mrow></msub><mo>,</mo>
<mi>κ</mi><mo>,</mo>
<msup><mi>β</mi><mrow><mn>2</mn></mrow></msup></mfenced>
<mo>=</mo>
<mfrac><mrow><mn>1</mn></mrow>
<mrow><mn>2</mn><mi>π</mi><mi>i</mi></mrow>
</mfrac>
<msubsup><mo>∫</mo>
<mrow><mi>c</mi><mo>+</mo><mi>i</mi><mi>∞</mi></mrow>
<mrow><mi>c</mi><mo>-</mo><mi>i</mi><mi>∞</mi></mrow></msubsup>
<mi>φ</mi><mfenced open='(' close=')'><mi>s</mi></mfenced>
<msup><mi>e</mi><mrow><mi>λ</mi><mi>s</mi></mrow></msup>
<mi>d</mi><mi>s</mi><mspace width='2cm'/><mi>c</mi><mo>≥</mo><mn>0</mn>
</math></subeqn>
<subeqn><math><mi>φ</mi><mfenced open='(' close=')'><mi>s</mi></mfenced>
<mo>=</mo><mo>exp</mo><mfenced open='[' close=']'><mi>κ</mi>
<mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>β</mi>
<mrow><mn>2</mn></mrow></msup><mi>γ</mi><mo>)</mo></mrow>
</mfenced><mo>exp</mo><mfenced open='[' close=']'><mi>ψ</mi>
<mfenced open='(' close=')'><mi>s</mi></mfenced></mfenced>
<mo>,</mo> </math></subeqn>
The result of typesetting this document with xmltex and PassiveTeX is
shown in Figure The document formatted by LaTeX. Although MathML is
rather verbose, it is not too difficult to recognize the code for the
formulae shown, so that it becomes possible to search all parts of a
documents, including the mathematical formulae, and, soon the
graphics, once SVG (Scalable Vector Graphics, see [26]) will be more widely supported.
|
Figure 2. The document formatted by LaTeX |
The TIPS Project
Tim Bray said that XML is the ASCII for the 21st century.
XML allows documents in all major world languages to be viewed and
transmitted in a standard reliable way. Many tens of XML applications
and vocabularies exist for XML-encoded tree-structured documents and
data.
In order to leverage these XML technologies and to benefit from the
interoperability and robustness of widely deployed XML solutions of
the Web the TIPS project [17] was
initiated.
TIPS (Tools for Innovative Publishing in Science) is a
European Union funded IST project between five academic institutes and
one commercial company from four different countries. TIPS proposes a
new approach to scientific information production and
dissemination. The aim is to develop a set of user-friendly and
advanced tools and services that are organized in an open system to
support research information production, management, access, and use
in a coherent manner. As an implementation and to make possible the
evaluation of the system, these tools and services will be integrated
on a web-based portal for the high-energy physics community.
The proposed system will support the activities of document writing,
reviewing, publishing, searching, disseminating and reading, as well
as the communication among members of the research community. This
approach is suitable for supporting a more productive research
community, in which researchers can work in a more effective,
inexpensive, and pleasant way: delays and costs due to paper documents
can be considerably reduced, multimedia can be added to electronic
documents, information access can be improved (and information
overload decreased) by using advanced information retrieval and
filtering techniques. The full possibilities of XML technologies
will be fully exploited wherever possible.
Bibliography
-
Adobe.
FrameMaker 6.0.
http://www.adobe.com/products/framemaker.
-
Apache XML Project.
FOP, XSL Formatting Object Processor in Java.
http://xml.apache.org/fop/.
-
Lou Burnard and C.M. Sperberg-McQueen.
TEI Guidelines for Electronic Text Encoding and Interchange.
http://etext.lib.virginia.edu/TEI.html.
-
David Carlisle.
xmltex A non validating (and not 100% conforming)
namespace aware XML parser implemented in TeX. Available
on CTAN in the directory macros/xmltex/.
-
David Carlisle, Michel Goossens and Sebastian Rahtz.
De XML à PDF avec xmltex, XSLT et PassiveTeX.
http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/35-carlisle.pdf
An updated version, focussing on PassiveTeX will be published in the proceedings of the TUG2000 Conference (TUGBoat Vol 23, September 2000).
-
James Clark.
DSSSL implementation.
http://www.jclark.com/jade.
-
Corel.
WordPerfect Office 2000.
http://www.corel.com/Office2000
(Microsoft Windows) and
http://linux.corel.com/products/wpo2000_linux
(Linux).
-
Michel Goossens, Frank Mittelbach, and Alexander Samarin.
The LaTeX Companion..
Addison-Wesley, Reading, 1994.
-
Michel Goossens and Sebastian Rahtz.
The LaTeX Web Companion.
Addison-Wesley, Reading, 1999.
-
Tony Graham.
Unicode: A Primer.
M and T Books (IDG), Foster City, 2000.
-
Eitan Gurari.
TeX4ht: LaTeX and TeX for Hypertext.
http://www.tug.org/applications/tex4ht/mn.html.
-
Yannis Haralambous and John
Plaice.
The latest developments in Omega.
TUGBoat, 17 (2), pages 181-183, June 1996.
(See also
http://www.gutenberg.eu.org/omega/).
-
International Organization for Standardization.
Information Technology -- Processing Languages --
Document Style Semantics and Specification Language (DSSSL). First
edition, 1996.
International Standard ISO/IEC 10179:1996, ISO Geneva, 1996.
A PDF version is at the URL ftp://ftp.ornl.gov/pub/sgml/WG8/DSSSL/dsssl96b.pdf.
-
Microsoft.
Office Tools, Word.
http://www.microsoft.com/office/word.
-
Ross Moore et al.
LaTeX2HTML.
http://saftsack.fs.uni-bayreuth.de/~latex2ht/.
-
Sebastian Rahtz.
Passive TeX.
http://users.ox.ac.uk/~rahtz/passivetex/.
-
Sissa, Udine, City University, Grenoble University, CERN, IOP.
Tools for Innovative Publishing in Science (TIPS).
http://tips.sissa.it/
-
The Unicode Consortium.
The Unicode Standard, Version 3.0.
Addison-Wesley, Reading, 2000.
See also
http://www.unicode.org.
-
Norman Walsh and Leonard
Muelner.
Docbook. The Definitive Guide.
O'Reilly and Associates, Inc., Sebastopol, USA, 1999.
http://nwalsh.com/docbook/index.html.
The URL
http://www.oasis-open.org/docbook/documentation/reference/index.html
gives access to the reference documentation, while DTDs and
stylesheets are at
http://nwalsh.com/docbook/index.html.
-
World Wide Web Consortium. Håkon Wium Lie,
Bert Bos, Chris Lilley and Ian Jacobs (editors).
Cascading Style Sheets, level 2.
http://www.w3.org/TR/REC-CSS2.
-
World Wide Web Consortium. Patrick Ion and Robert Miner
(editors).
Mathematical Markup Language (MathML[tm])
1.01 Specification.
http://www.w3.org/TR/REC-MathML/.
-
World Wide Web Consortium.
Resource Description framework.
http://www.w3.org/RDF/.
-
World Wide Web Consortium, David C. Fallside
(editor).
XML Schema Part 0: Primer (W3C
Working Draft).
http://www.w3.org/TR/xmlschema-0.
-
World Wide Web Consortium, Henry S. Thompson,
David Beech, Murray Maloney, Noah Mendelsohn (editors).
XML Schema Part 1: Structures (W3C
Working Draft).
http://www.w3.org/TR/xmlschema-1.
-
World Wide Web Consortium, Paul V. Biron,
Ashok Malhotra (editors).
XML Schema Part 2: Datatypes (W3C
Working Draft).
http://www.w3.org/TR/xmlschema-2.
-
World Wide Web Consortium, Jon Ferraiolo
(editor).
Scalable Vector Graphics (SVG) 1.0 Specification (W3C
Working Draft).
http://www.w3.org/TR/SVG.
-
World Wide Web Consortium.
XHTML 1.0: The Extensible HyperText Markup Language.
A Reformulation of HTML 4 in XML 1.0.
http://www.w3.org/TR/xhtml1/.
-
World Wide Web Consortium.
Tim Bray, Jean Paoli, and C. M.
Sperberg-McQueen (editors).
Extensible Markup Language (XML) 1.0.
http://www.w3.org/TR/REC-xml. An annotated version of
the specification is at
http://www.xml.com/axml/axml.html.
-
World Wide Web Consortium,
James Clark (editor).
XSL Transformations (XSLT),
Version 1.0 (W3C Recommendation 16 November 1999).
http://www.w3.org/TR/xslt.
-
World Wide Web Consortium, Stephen Deach (editor).
Extensible Stylesheet Language (XSL), Version 1.0 (W3C
Working Draft).
http://www.w3.org/TR/WD-xsl.