One of the tasks of the XML parser is to open entities. Entities can be external files, but also strings, or channels, or anything that can be considered as a stream of bytes. Entities are identified by ID's. PXP knows four kinds of ID's:
SYSTEM ID's are URL's pointing to arbitrary resources. PXP includes
only support for opening file URL's.PUBLIC ID's are abstract names for entities, such as the well-known
"-//W3C//DTD HTML 4.01//EN" string. Usually, PUBLIC ID's are
accompanied by SYSTEM ID's to provide an alternate method for
getting the entity.SYSTEM or PUBLIC identifier in the parsed XML text, or we have a
private or anonymous identifier that was passed down by some user
program. The second step is to make the identifier absolute. This step
is only meaningful for SYSTEM identifiers, because they can be given
by relative URL's. These URL's are made absolute. Finally, we run a
lookup algorithm that gives us the entity to open back as stream of
bytes. The lookup algorithm is highly configurable in PXP, and this
chapter of the PXP manual explains how to do this.
Pxp_readerPxp_types.from_filePxp_types.from_stringPxp_types.from_channelPxp_types.from_obj_channel
The simple form of an (external) entity ID is Pxp_types.ext_id: It
enumerates the four cases:
System urlPublic(public_name, system_url)Private pAnonymous
let file_url = Pxp_reader.make_file_url filename
let file_url_string = Neturl.string_of_url file_url
During resolution, a different representation of the ID is preferred -
Pxp_types.resolver_id:
type resolver_id =
{ rid_private: private_id option;
rid_public: string option;
rid_system: string option;
rid_system_base: string option;
A value of resolver_id can be thought as a matching criterion:
rid_private is set to Some p, entities with
an ext_id of Private p match the resolver_id.rid_public is set to Some public_name, entities with
an ext_id of Public(public_name,_) match the resolver_id.rid_system is set to Some url, entities match the
resolver_id when their ext_id is System url or Public(_,url).resolver_id with a particular entity. Note that Anonymous is
missing in this list - it simply matches with any resolver_id.
The resolver_id value can be modified during the resolution process,
for example by rewriting. For example, one could rewrite all URL's
http://sample.org to some local file URL's when the contents of
this web site are locally available.
It is not said that rid_system is already an absolute URL when the
resolution process starts. It is usually rewritten into an absolute
URL during this process. For that reason, we also remember
rid_system_base. This is the base URL relative to which the URL in
rid_system is to be interpreted.
The resolution algorithm is expressed as Pxp_reader.resolver.
This is an object providing a method open_rid (open by resolver ID)
that takes a resolver_id as input, and returns the opened entity.
There are a number of predefined classes in Pxp_reader for
setting up resolver objects. Some classes can even be used to
construct more complex resolvers from simpler ones, i.e. there is
resolver composition.
Besides Pxp_reader.resolver, there are also sources, type
Pxp_types.source. Sources are concrete applications of resolvers to
external ID's, i.e. they represent the task of opening an entity with a
certain algorithm, applied to a certain ID. There are several ways of
constructing sources. First, one can directly use the source values
Entity, ExtID or XExtID. Second, there are a number of functions
for creating common cases of sources, e.g. Pxp_types.from_file.
For example, to open the ext_id value e with a resolver r,
the source has to be
let source = ExtID(e,r)
There is also XExtID which allows one to set the base URL in the
resolver_id, and for very advanced cases there is Entity (which
is beyond an introduction).
We give a short summary of the function provided by the resolver class.
Some classes provide quite low-level functionality, especially those
named resolve_to_*. A beginner should avoid them.
Every resolver matches the ID to open with some criterion of ID's the resolver is capable to open. If this matching is successul we also say the resolver accepts the ID. After being accepted the rest of the resolution process is deemed to be successful, e.g. a non-existing file will lead to a "file not found" error. Not accepting an ID means that in a composed resolver another part resolver might get the chance, and tries to open it.
We especially mention whether relative URL's are specially handled
(i.e. converted to absolute URL's). If not, but you would like to
support relative URL's, it is always possible to wrap the resolver
into norm_system_id. This is generally recommended.
Some resolvers can only be used once because the entity is "consumed" after it has been opened and the XML text is read. Think of reading from a pipe.
Also note that you can combine all resolvers with the from_*
functions in Pxp_types, e.g.
let source = Pxp_types.from_file
~alt:r
filename
The resolver given in alt is tried when the resolver built-in
to from_file does not match the input ID. Here, from_file
only matches file URL's, so everything else is passed down
to alt, e.g. PUBLIC names.
These classes open certain entities. Some also allow you to pass
the resolution process over to a subresolver, but the resolver_id
is not modified.
resolve_to_this_obj_channel
Pxp_reader.resolve_to_this_obj_channelNetchannels.in_obj_channelext_idext_id or resolver_id
This example matches against the id argument, and reads from the
object channel ch when the resolver matches:
let ch = new Netchannels.string_channel "<foo></foo>"
let r = new Pxp_reader.resolve_to_this_obj_channel
~id:(Public("-//FOO//", ""))
()
ch
This is a one-time resolver because the data of ch is consumed
afterwards.
resolve_to_any_obj_channel
Pxp_reader.resolve_to_any_obj_channelNetchannels.in_obj_channel that is created
for every matched IDext_idresolve_to_url_obj_channel
Pxp_reader.resolve_to_url_obj_channelNetchannels.in_obj_channel that is created
for every matched IDext_id, but this resolver is only
reasonable for SYSTEM ID's.resolve_as_file
Pxp_reader.resolve_as_fileSYSTEM or PUBLIC ID's with an url
using filefile URL's,
no matter of whather the files exist or not (will lead later to an
error)
let r = new Pxp_reader.resolve_as_file ()
If the file "/data/foo.xml" exists, and the user wants to open
SYSTEM "file://localhost/data/foo.xml" this resolver will do it.
lookup_id
Pxp_reader.lookup_idext_idext_id's maps to
the subresolverslookup_id_as_file
Pxp_reader.lookup_id_as_fileext_idext_id's maps to
file names
let r = new Pxp_reader.lookup_id_as_file
[ System "http://foo.org/file.xml", "/data/download/foo.org/file.xml";
Private p, "/data/private/secret.xml"
]
If the user opens SYSTEM "http://foo.org/file.xml", the file
/data/download/foo.org/file.xml is opened. Note that relative URL's
are not handled. To enable that, wrap r into a norm_system_id
resolver.
If the user opens the private ID p, the file /data/private/secret.xml
is opened.
lookup_id_as_string
Pxp_reader.lookup_id_as_stringext_idext_id's maps to
string constants
let p = alloc_private_id()
let r = new Pxp_reader.lookup_id_as_string
[ Private p, "<foo>data</foo>" ]
let source = ExtID(Private p, r)
lookup_public_id
Pxp_reader.lookup_public_idPUBLIC ID's by included public_namepublic_name's maps to
the subresolverslookup_public_id_as_file
Pxp_reader.lookup_public_id_as_filePUBLIC ID's by included public_namepublic_name's maps to
file nameslookup_public_id_as_string
Pxp_reader.lookup_public_id_as_stringPUBLIC ID's by included public_namepublic_name's maps to
string constantslookup_system_id
Pxp_reader.lookup_system_idSYSTEM or PUBLIC ID's by included urlurl's maps to
the subresolverslookup_system_id_as_file
Pxp_reader.lookup_system_id_as_fileSYSTEM or PUBLIC ID's by included urlurl's maps to
file nameslookup_system_id_as_string
Pxp_reader.lookup_system_id_as_stringSYSTEM or PUBLIC ID's by included urlurl's maps to
string constantsnorm_system_id
These classes pass the resolution process over to a subresolver, and
the resolver_id to open is rewritten before the subresolver is invoked.
Note that the rewritten ID is only visible in the subresolver, e.g. in
let r = new Pxp_reader.combine
[ new Pxp_reader.norm_system_id sub_r1;
sub_r2
]
the class norm_system_id rewrites the ID, and this is only visible in
sub_r1, but not in sub_r2.
norm_system_id
Pxp_reader.norm_system_idext_id
let r = new Pxp_reader.norm_system_id
(new lookup_system_id_as_string
[ "http://foo.org/file1.xml", "<foo>&file2;</foo>";
"http://foo.org/file2.xml", "<bar>data</bar>";
]
)
We also assume here that the general entity file2 is declared
as SYSTEM "file2.xml", i.e. with a relative URL. (The declaration
should be added to the file1 XML text to make the example complete.)
The resolver norm_system_id adds the support for relative URL's
that is otherwise missing in lookup_system_id_as_string.
The XML parser would read the text "<foo><bar>data</bar></foo>".
Without norm_system_id, the user can only open the ID's when they
are exactly given as in the catalog list, e.g. as SYSTEM
"http://foo.org/file1.xml".
rewrite_system_id
Pxp_reader.rewrite_system_idext_idfoo.org are locally available, and so
foo.org URL's can be rewritten to file URL's:
let r =
new Pxp_reader.rewrite_system_id
[ "http://foo.org/", "file:///usr/share/foo.org/"
]
(new Pxp_reader.resolve_as_file())
combine
Pxp_reader.combineext_id