Robert Tansley - Christopher Gutteridge
This document describes the workings of the Eprints archive software, and how it interacts with the other components to provide services.
What this document does not contain is an API specification. Nor does it contain detailed information about what each file in the distribution does. The code is richly commented; this information is by and large not duplicated in this document. Producing such a document would have delayed the Eprints release by weeks. The best way to learn about the system, and how to achieve something, is to look at the code, read the comments therein, and follow the code's example. The comprehensiveness of this document should improve over time.
Before I delve in, a short note about programming style. The code is designed to be viewed with a tab width of 3. Tabs are always used for indentation. Indentation of eight spaces is far too much, and using a combination of spaces and tabs just causes problems.
In your site specific code, and especially if you're developing the core code itself, please follow the style of the existing code, and comment richly, to enable the Eprints software to be easily maintainable in the future.
Technical queries (not bug reports) concerning the Eprints software should be sent to:
support@eprints.org
General information queries should be sent to:
info@eprints.org
If you suspect you have found a bug, or an error in the documentation, please check the bug reporting system at:
http://bugs.eprints.org/
If the problem has not already been reported, submit a bug report to the system.
Suggestions for new features to add to the software (wishlist items) can also be submitted to the bug report system.
Eprints works by interacting with several other components, as shown in figure 1. Most interaction occurs through a Web browser interface, though during installation and in rare circumstances some Eprints scripts are run from a command line.
This section lists and describes the directories in an installed Eprints system.
This directory contains various command line scripts. Some are used during installation and maintenance, some are invoked regularly via crontab or equivalent, and one is invoked whenever an email message arrives at the automatic administration account.
Contains the plain text-based configuration files that constitute part of the site-specific portion of the Eprints archive. They contain information about what metadata to hold about each eprint and user, the initial subject hierarchy and templates for standard mails automatically sent to users.
Contains the Perl scripts that are invoked by the Apache WWW server. These scripts create those pages with dynamic content, for example search results and the eprint depositing interface. It is strongly recommended that you do not edit the contents of the scripts in this directory, since it will make upgrading the Eprints software rather difficult. You may wish to add further scripts; it is recommended that you name these carefully to avoid name clashes with any scripts that may be added to the core Eprints software in the future, for example by adding your own prefix my_newscript.
This directory is the document root for the Apache WWW server. All static files are stored under this directory. Note that the contents of this directory are managed entirely by the Eprints code, and its contents should not be manually edited. To manually add static files (e.g. image files or HTML documents) to the archive, use the static directory and the generate_static script.
This directory should contain symbolic links to areas eprints can be stored in. It must contain at least one before the system can run. See section 2.2 for further details. Note that while it is possible to simply create a subdirectory in which to store eprint files, this is not recommended since the contents of the html directory should be considered volatile, but the eprint files themselves are not.
Contains the core Eprints Perl library files. It is strongly recommended that you do not edit these files at all, since it will make upgrading the Eprints software extremely difficult.
Contains the Perl library files that constitute part of the site-specific portion of the software. These contain information that is difficult or inefficient to store in plain text-based configuration files. If and when you change or add functionality to the archive, it is strongly recommended that you put as much of the code as possible in this directory.
Contains the old Open Archives subset of the Dienst software developed at Cornell University, with some minor modifications to work with the Eprints software. This code is invoked by the Apache WWW server to respond to Dienst requests, enabling the harvesting of metadata in the archive by Open Archives service providers. This protocol is no longer used by the Open Archives Initiative, but is included for the time being to allow Eprints archives to be harvested by service providers only supporting this old protocol.
Contains the static files (apart from the eprints themselves) that make up the Eprints WWW site. In the distributed version of Eprints these include a ``home page,'' some online help ages for users, the staff page menu and a general information page. However it is highly likely that these will end up varying greatly from site to site. You can therefore add, remove and change as much as you like in this directory. However, it is recommended that if possible, you do not change the online help files (in the help directory) since these have been designed to be site-independent, and will be kept updated with new capabilities and features in new Eprints releases.
The generate_static copies the files in static, including the directory hierarchy, to the html directory. Non-HTML files are copied verbatim. HTML files are ``filled in'' with appropriate values in place of the placeholders, and given the site ``skin'' (look and feel) if appropriate.
The contents of the html directory are controlled entirely by the software; you should not edit or add things directly. If you want to add an HTML page, graphic or other miscellaneous file to your site, add it to the static directory and run generate_static (or update_laf.) The file will be added to the archive Web server.
In order to reduce processor load, and to enable search engines such as Google or AltaVista to index them, the ``Browse by Subject'' views and eprint abstract pages are generated once and then stored as ordinary .html files.
You don't generally need to worry about abstract files; the core code updates these as necessary. If you make a change to the site configuration (e.g. the HTML ``skin'') you can force Eprints to regenerate all of the abstract files by running the generate_abstracts script. (The update_laf script will also do the same job, updating all other pages on the site at the same time.)
The ``Browse by Subject'' views are generated by running generate_views (or again update_laf). This should be done at least once a day; the automatic installer can install a suitable crontab for you. If your site has a lot of traffic you may wish the views to be generated more than once a day; this is easy to achieve by setting up a suitable crontab. Note that in the crontab you should run generate_views, and not update_laf, since the latter may regenerate a large number of pages and affect your server's performance.
When an eprint object is created, is it given an ID code, and a directory under html/documents is created. The ID code is just a prefixed ordinal number; it holds no information about (say) the date it was created. The advantage of using a simple number scheme is that it gives the user a great deal of leeway when entering an ID code into the `view eprint' box; they do not need to remeber how many digits an ID code has to be for example. Given any unprefixed number the system can reason about which eprint the user is referring to.
When the eprint object is created, it will alphabetically scan each subdirectory, and the first subdirectory it finds with enough space will be used to store the new eprint and document files. The name of the directory for the eprint is worked out from its ID code. An eprint with the idcode ep12345678 will be in the directory:
12/34/56/78
When the abstract pages of eprints are generated, the system gives the page the site skin (html_head and html_tail in SiteInfo.pm), and calls the eprint_render_full method in SiteRoutines.pm to obtain the content of the page. The resulting HTML is stored in a file called index.html in the eprint's directory.
Documents files stored with the eprint are stored in further subdirectories. A collection of document files pertaining to a single storage format have a separate document ID and are held in a separate directory. This ID is just the same as the eprint ID, with -00 appended for the first document format, -01 for the second, and so on.
For example, the eprint ep12345678 might have two associated document storage formats: A collection of HTML files and a PDF file. In the file system, this might look like:
12/
34/
56/
78/
index.html
ep12345678-00/
article.html
figure1.gif
figure2.gif
ep12345678-01/
article.pdf
Information about the files stored in each for each document storage format is not stored in the database; only the name of the file that is to be displayed first when the user wishes to view that format is stored. In the above example, that would be article.html for document ep12345678-00, and article.pdf for document ep12345678-01.
This section describes how metadata is specified, stored and accessed in the Eprints system.
Metadata fields in the Eprints system will be of one of the types listed in table 1:
|
When you specify what metadata fields the system should store for each user or eprint, you need to decide which of the above types should be used for each field. The system will then know for each field:
There are two ways in which these metadata fields may be specified. The first is a fairly verbose method, akin to an Apache httpd.conf file, used in the configuration files. In some cases, it is necessary to specify a field in code. In this case, a second, more terse format is used. The format used in the configuration files is described first.
A field is specified in the general form:
<field fieldtag> arg1 = val1 varg2 = val2 ... </field>
Here's an example. To specify that you want an optional ``Title'' field containing no more than 40 characters:
<field title> required = NO type = "text" displayname = "Title" maxlength = 40 editable = YES visible = YES </field>
Below is a second example, describing a field in which the user can select one of three values:
<field ispublished> editable = YES type = "enum" visible = YES displayname = "Status" value unpub = "Unpublished" value inpress = "In Press" value pub = "Published" help = "Please state here whether your deposit has been " help = "published, is currently in the process of being " help = "published (<strong>in press</strong>), or has " help = "not been previously published." </field>
The various arguments and possible values are listed in table 2. Note that the field tag, which must be unique, short and contain no spaces, is given in the <field> line and is used to refer to the field in all other configuration files and code.
In order to allow some fields to be specified in-line in the code, there is an alternative, less human-readable way of specifying a metadata field. These consist of a single string, in the following form:
fieldname:type:misc arguments:displayable name:required?: editable?:visible?:indexed?
Descriptions of each part are given below.
Miscellaneous arguments:
never,Never (Off);daily,Every Day;weekly,Every Week
This means the field can have the value never, daily or weekly, which are displayed as ``Never (Off)'', ``Every Day'' and ``Every Week'' respectively.
For example:
datestamp:date::Submission Date:1:0:1:0
This specifies a date field internally called datestamp and displayed to the user as ``Submission Date''. It is a required field, but is not editable. It is publically visible, but not indexed by the MySQL database.
Regardless of which way the metadata field was specified, it is used to instantiate a MetaField object. It is by looking at this object that the system works out how the field should be displayed, stored, searched and read from an HTML form. Thus, the verbose method is there purely to ease the process of configuring the metadata that Eprints will store.
This section describes how the MySQL database is used, and the tables created and maintained by the Eprints software.
Eprints creates and uses the following tables:
Most tables are created by reading the relevant metadata field specifications from the relevant Perl class. The users, inbox, buffer and archive tables have additional columns. These are read in from the site metadata configuration files metadata.* in eprints/cfg. The database tables must correspond to these configuration files in order for the software to be able to extract and store data in the database correctly. Thus, if one a configuration file is changed, the database must be changed to match, and vice versa.
Database access is performed using the Database.pm module. This provides methods to create tables, and create, update and retrieve records.
Due to the fact that the metadata schema is completely configurable by individual sites, Eprints cannot store all of the data in third normal form. This section describes how each metadata field type is stored.
Table 3 shows how each metadata field type is stored in the database. The may all have the value NULL, provided that the particular field is not a primary key or indexed field in the database. Site-configured metadata fields may not be indexed as the core code cannot guarantee that the value will always be non-NULL from the start.
The name, set and subject types would ideally be stored in extra tables; however this would greatly increase the complexity of the database and reduce the performance of searches. A particular advantage with storing names in this format is that alphabetical ordering by surname can be achieved by simple ordering in an SQL statement.
The methods in Database.pm automatically escape relevant characters and so on when instructing and querying the database with SQL. This means that, for example, you don't have to worry about prepending double quotes with a backslash.
Eprints doesn't actually do any JOINs. Each record in the database (whether an eprint, document storage format, subject or user) is represented by an object in Perl, and one calls appropriate methods to discover related items. For example, in order to retrieve the document storage formats that a user has deposited for a particular eprint, one calls the get_all_documents method of the EPrints::EPrint class.
Eprints also knows how to search each of the metadata field types in the database. In order to do this, for each format, it must:
This means that Eprints has very powerful search capabilities in that any metadata field can be searched with fine granularity. The main drawback is that there is no notion of ``relevance'' of results; records are either retrieved or not retrieved, and the ordering is based on (for example) primary author's surname.
It should be noted that the search capabilities provided by Eprints are intended only as a starting point, in order to make the software useful in isolation, before Open Archives compliant services become commonplace. In the very near future, more powerful search and linking engines will be able to provide far more sophisticated services, and operate over many archives at once.
SearchField.pm is used by can render HTML input boxes for, store and query individual metadata search fields. It can actually search several fields at once; just pass in an array of MetaField objects to its constructor.
The make_meta_fields method in SearchExpression.pm can be used to make an array of MetaField objects by just passing in their names separated by slashes. So you can call make_meta_fields with:
title/abstract/keywords
and pass the resulting array into the SearchField constructor to make a search field that will search the title, abstract and keywords fields at once. Note that the fields must all use the same HTML input, internal representation and SQL; see the following sections. So, for example, you can search a text field and a multitext field as one, but not a text field and a date field.
Details of how the system performs each of the above three tasks for each metadata field type are given below:
any:mind reasoning consciousness
This will match any record containing any of the words ``mind,'' ``reasoning'' or ``consciousness.''
If no search terms have been entered the internal representation is NULL in MySQL and undef in the Perl code.
tag1:tag2:tag3:ANY
This will retrieve records where the field has any one of the values tag1, tag2 or tag3.
In addition to the individual search fields, other parameters can be set, either in code or by the user on an HTML form. These parameters, together with the search fields themselves, are known as a search expression.
A search expression is represented by a SearchExpression object (in the SearchExpression.pm module.) In addition to the search fields themselves, it holds:
The ordering is specified in terms of SQL, for example:
year DESC, authors, title
This sorts results by descending year, then authors, then title. The default order if ASCending or DESCending are not specified is ascending.
SearchExpression.pm can also produce a text string which, at a later time, can be fed back into SearchExpression.pm to recreate the same search. This technique is used by the subscriptions mechanism to store and retrieve user subscriptions.
Search forms are rendered and handled by the SearchForm.pm module. This in turn uses the SearchExpression.pm module in order to construct the query.
Hence, in code, you should use SearchForm.pm if you are creating an interactive search form, and SearchExpression.pm if your are handling a search purely in code (for example, for a tailored subscription service.) You should always use SearchExpression.pm even if your search is only searching one field, since SearchExpression.pm contains code to construct the SQL query as a whole.
SearchForm.pm handles more or less everything to do with an interactive search, including rendering the form, receiving the user input, performing the search and presenting the results. Creating a customised search is trivial; see search and adv_search in the eprints/cgi directory are good examples.
You could also quite easily put a ``quick search'' button on your archive's front page using something like the following HTML:
<FORM METHOD="GET" ENCTYPE="application/x-www-form-urlencoded"
ACTION="http://foo.ac.uk/perl/search">
Enter a search term:
<INPUT TYPE="text" NAME="title_abstract_keywords">
<INPUT TYPE="hidden" NAME="title_abstract_keywords_srchtype"
VALUE="all">
<INPUT TYPE="submit" NAME="Search" VALUE="Search">
</FORM>
Note the use of hidden fields to specify defaults. The search scripts will also use defaults for any values that are omitted from the form. In the above example, there is no ordering information, so the search will just use the site default.
The Eprints software features the ability for people to deposit new versions of eprints in the archive. They do this by entering the ID code of the previous version into the relevant box in the depositing interface. This (automatically validated) ID code is then stored in the succeeds field in the eprint's metadata record. (succeeds is a core metadata field that is always present; there is no need to configure this in your site configuration.) No information is stored with the eprint being succeeded.
This usually results in a chain of versions, but may result in a `tree' of different versions. A set of linked versions of an item are refered to as a ``version thread.'' By default, only the user who deposited the eprint may deposit a later version and link it to the previous version.
This linking information is then used by the default configuration to display links to all versions of the eprint being viewed. The core code will automatically work out which abstract pages need to be updated; any eprint in the thread will have its abstract page re-rendered.
This version information is stored in the deletions table if an eprint is removed from the archive. The ``404 Not Found'' error document handler then uses this information to see if a missing document is in fact a previously removed document, and can then direct the user to a more recent version. Note: For this to work, the newer version of the eprint must be installed into the main archive before the old version is removed.
Exactly the same techniques are used to link commentaries and responses. If a user is submitting an eprint that is a commentary on another eprint, or another eprint responding to such a commentary, they enter the ID code in the relevant box on the depositing interface. This is then stored in the commentary field; again the commentary field is always present and doesn't have to be added to your site configuration.
Trees of commentaries and responses are also displayed on the abstract pages of all relevant items in the default site configuration; you can of course remove this. Unlike the versioning trees, any user can submit a commentary on (or response to) another eprint, regardless of who deposited the original eprint. Information about commentaries and responses are lost (the tree is broken) if an eprint in the tree is removed.
A vital component of the software is the interoperability component. Version 1.1 of the software is compliant with the Open Archives Protocol 1.0.
Using the protocol, an Eprints archive can export several metadata sets. By default only Dublin Core metadata is exported.
Two methods in the SiteRoutines.pm module are used to configure the metadata your instance of Eprints will export. The oai_list_metadata_formats method returns a hash, mapping short metadata format names to their XML namespaces. By default, this just returns a hash defined in SiteInfo.pm:
%EPrintSite::SiteInfo::oai_metadata_formats
Code in the oai_get_eprint_metadata in SiteRoutines.pm is used to map the site's internal metadata set to these formats.
For version 1.1, the base URL of the Open Archives protocol will be:
http://your.eprints.server.edu/perl/oai
More information on Open Archives is available from:
http://www.openarchives.org/
More information on the Open Archives Protocol is available from:
http://www.openarchives.org/OAI/openarchivesprotocol.htm
The Eprints system must be able to send and receive e-mail automatically. E-mails received in the Eprints archive automatic account (known as the autoadmin account) must be piped to the standard input of (a newly run) process_mail. It's a good idea to have your mail system setup such that bounced error messages and messages sent to mailing lists (spam) are filtered out, for example using a procmail recipe.
When the software sends a mail, it should appear that it has been sent from the human-read admin account, so that replies can be directed appropriately.
Eprints understands two types of email automatically:
USERNAME username PASSWORD abcdef
Provided that the username and password are valid, the e-mail address associated with that user account will be changed to the address in the From: line of the e-mail. The reason address changes are handled this way is that if a user entered an incorrect e-mail address into a box, you have no way of contacting them to correct it.
If a user wishes e-mail to be directed to an address they cannot send from, they need to contact the site administration, who can use the user search and editing facilities in the staff area to change the address for them.
If the email isn't recognised, a return is sent with instructions on how to use the Eprints automatic mail processing correctly. You can change the wording of this text, including the ``introduction to the archive'' and the ``email successfully changed'' by changing the template.* files in the cfg directory.
It is not unlikely that you will want to run the Eprints software as an archive for papers, or other media, produced by a department or group with a specific set of members.
Key modifications you may wish to make might be:
You should try and avoid editing the core files (those in bin, EPrints and cgi) as these will be overwritten when you next upgrade.
We are planning more support for this in the next release, including example scripts to import users and a method for building static ``view'' pages for each member. This would mean that your members could link from their homepage to a page listing all their papers (or whatever your archive archives).
It's likely that you have a large amount of data in another database or format that you want to populate your Eprints database with. The Eprints Perl modules make this importing trivial.
The best place to find information about methods is in the comments in the source files themselves. To help you get started, a simple example importing script is given below.
#!/usr/bin/perl -I/opt/eprints/perl_lib
# Note: Important you change the -I parameter appropriately
use EPrints::Database;
use EPrints::EPrint;
use EPrints::Session;
use EPrints::User;
# The 1 makes this an offline script
my $session = new EPrints::Session( 1 );
# Make a user (if any "create" type function returns undef,
# there was an error)
my $bbs_user = EPrints::User->create_user(
$session,
"auto",
"Automated Import User",
"User" );
# Make an eprint
my $new_eprint = EPrints::EPrint::create(
$session,
$EPrints::Database::table_archive,
$bbs_user->{username} );
# Fill out the fields
$new_eprint->{title} = "This is the title";
# This is the internal name format. Or you can use EPrints::Name
# module to handle it for you
$new_eprint->{authors} = ":Tansley,Robert:Harnad,Stevan:";
# Puts the changes in the database
$new_eprint->commit();
# Make a document file
my $document = EPrints::Document::create(
$session,
$new_eprint,
"HTML" );
# If you think a recursive web suck will work (make sure it
# doesn't grab the whole site!)
my $success = $document->upload_url(
"http://www.foo.ac.uk/Archive/blah.html" );
# or for more fine tuned uploading
open INPUT, "blah.html" or die "Error opening blah: $!\n";
$document->upload( \*INPUT, "blah.html" );
$document->set_main( "blah.html" );
# Put changes in database
$document->commit();
# Terminate the session
$session->terminate();
In this section we will give a few examples of simple but useful modifications to the files in /opt/eprints/perl_lib/EPrintSite/ .
In this example we have a data format which has no "start file". Our example format is "Multiple Images". It is for an archive which expects you to submit multiple versions of the same image in different formats - ie. a 10meg TIF, a 2meg TIF and a small JPEG. On the abstract page for this record we want to list and link to all the files in this format, not just the "start" file.
The code which creates the link to each format from the abstract page is in SiteRoutines.pm in the subroutine called eprint_render_full.
$html .= "<A HREF=\"".$_->url()."\">$description</A><BR>";
If we replace this line with the following code we should get a list of all files when displaying this format:
if ( $_->{format} eq "MULTIPIC" )
{
# MULITPIC: Multiple Images Format
# Special display for this format.
$html .="<P><B>Multiple Images</B></P><UL>";
my %files = $_->files();
my $file;
foreach $file ( sort keys %files )
{
$html .= "<LI><A href=\"".$eprint->url_stem() . $_->{docid} .
"/" . $file."\">".$file."</A>\n";
}
$html.="</UL>";
}
else
{
# Display a normal format
$html .= "<A HREF=\"".$_->url()."\">$description</A><BR>";
}
You will need to run /opt/eprints/bin/update_laf to have this take effect on static pages. To make this take affect on pages generated on the fly (eg. Viewing item in submission buffer) you will need to restart the webserver.