Basic
Properties for Biological Databases: Character Development and Support
J. DIEDERICH
Department of Mathematics, University
of
California-Davis
Davis, CA 95616, U.S.A.
diedereucdmath.ucdavis.edu
Mathl Comput. Modelling, 25(10) :
109-127 (1997)
(Received November 1995; accepted January
1996)
Abstract—in
this paper, we
examine problems
and solutions for building a large set of characters for
descriptive data derived from published species
descriptions. The ideas presented lead in the direction of
creating a kind of BioDBMS that can be
used to support large integrated biological databases.
Keywords—Biological
databases,
Biological
characters, Data modeling, Basic property, Schema design.
1. INTRODUCTION
Among
the types of taxonomic databases [1], i.e., curatorial including
geographical,
nomenclatural, bibliographical,
and morpho-anatomical
(descriptive), descriptive data presents the greatest challenges
to the database designer and software developer, with far less
supporting
software for managing the data than
for the others [2]. One
explanation for this may be the structure and nature of
biological research. Institutions such as horticultural museums and
herbaria
have large collections
of specimens. Consequently, they have the greatest need for
creating formal electronic databases
to
manage their collections to locate specimen, track loans to
researchers, add
new specimens, and the
like. The
structure of their data is, generally, not as complex as the
structure
of descriptive data. On the other hand, descriptive data has, generally
speaking, been the province of
individual researchers who more
often than not, focus on a small number of species with a limited
number of
external features. There is little uniformity across these databases
and sharing
of electronic data can be difficult though some software has been
developed to
help [3,4]. While the complexity
and the semantics of the
data can be managed for small databases used by experts
on the few species represented, it becomes much less so when the number
of
species expands
into the thousands and the database is to be used by nonexperts too.
Biologists
have made some strides in exploiting database technology [2,5-7] and
have set
standards to allow for
data exchange for taxonomic
databases [3,4,8]. Much of this work has focused on how to
utilize relational DBMSs, to avoid duplication of effort, and to help
avoid
problems that seem to
inevitably infect many hand tailored
taxonomic databases [2]. Still, database methods and
techniques available for handling descriptive data are inadequate.
Recent
efforts have been
aimed at exploring management of taxonomic biodiversity information
using object-oriented databases and the World Wide Web [9].
The
work discussed in this paper,
to examine problems and propose some solutions in creating
large descriptive databases, is part of the
NEMISYS Project [10-12], an effort to build an identification system
for the
approximately 4000 species of plant-parasitic nematodes. It may eventually
include an equal number of
nonplant-parasitic species. For the most part, the source of
our data will be approximately 10,000 published
descriptions in various journals over the much of this
century.
Two
important aspects of creating and using a
descriptive database are data modeling and data semantics.
Difficulties arise in modeling the data because the source of
the data is published descriptions. One important goal in creating a
descriptive database is to faithfully represent the
data in the descriptions, correcting for errors and omissions
whenever possible. However, standards
have not
been set over the years for describing species. With each author
placing his or
her stamp on a given description, it remains a
challenge to capture and use the data from this rich and available
source.
Characteristics that are important for some species are not important for
others, so there is no small core of
characteristics on which to focus. Since standard modeling practice
involves determining the data structures in
advance of acquiring the data, it would be necessary
to know what characteristics of species are described in
thousands of descriptions and to know how they are described, a
difficult if
not impossible requirement. The option of forcing the
data into a predetermined structure simply will not work for large
collections if the goal is
to faithfully
record the data. Thus, it is necessary, to some extent, to create the
database structures
as the data is acquired. Consequently,
building a data model prior to data acquisition is impossible in such
cases,
and must remain a dynamic process, placing significant challenges in
maintaining a consistent and uniform model. Each
new description can cause changes in the model.
Yet, it is not possible to take the properties exactly as they
come from the literature since the resulting
database structures would
then seem too chaotic to use effectively. This seems like an impossible task with contradictory
goals. If the
data is represented faithfully, the data structures would be too disorganized, while on
the other
hand, if the data is too rigidly structured, there would be a loss of information. In this
paper, we
introduce new modelling concepts that support character set creation and make these
contradictory goals tractable. A
subsequent paper [13], will
explore
the use of these concepts in a wide variety of practical situations and
will
develop guidelines for their use.
What
we are attempting goes somewhat counter to
standard practice. Biological data is often reformulated in some
fashion prior
to its storage and use, usually in small individual databases, to
support a particular type of activity such as
identification within a particular taxon using dichotomous
keys. For example [14], in some species the manipulated data
may consist of two states,
obviously winged and
not obviously winged for identification within a certain taxon,
while
for phenetic classification it may be necessary to breakdown these into
their
various types such as narrowly
winged,
widely winged, terete, and
striate. However,
a problem arises when new species are
added to
the system since the structure of the data often needs to be modified
to remain
consistent
relative to its intended use.
Additionally, such manipulated data is difficult to use for purposes
other than that specifically intended and
often the data does not reflect what was in the
original source. Thus, data semantics have to be supported above the
level of the database structures
to allow
for a wide variety of uses. This is analogous to database views, where
different users
see the data structure
according to their needs. However, database views are not sufficiently powerful
to deal with the variety of uses of
biological data. Clearly, the manner of supporting the
semantics
will affect how the data is structured.
It
is true that some of the
complexity we have encountered is due to the nature of these microscopic
round worms, where internal as well as external parts are used in
descriptions
for differentiating taxa. In many
other
areas, only a few dozen characteristics are required for differentiating and describing species.
Nevertheless, if
large descriptive databases are to be constructed and maintained, many of the problems we
address
will have to be solved for these databases if they are to be
integrated.
In the
remainder of this paper, in Section 2, we
present some background. In Section 3 problems are discussed
that
arise in creating a large list of biological characters. In Section 4, we
introduce the major concept of basic property and its features to
facilitate
handling these problems, including
the representation and use
of state-based relationships in Section 5. In Section
6, we extend the idea of name extension introduced in Section 4 to
another
context, and in
Section 7, we briefly discuss schema changes.
2. BACKGROUND
In
conceptualizing descriptive data, the biologist
thinks in terms of what is commonly called the
"data matrix," i.e., a taxon by character array [2]. Within the
matrix are the states or numerical values of the species for that
character.
Some
clarification may be needed with regard to our meaning of character
since the
term does not have a standard
definition among biologists. Even
the standardization of the concept of a character
would go a long way towards integrating descriptive databases. At one
extreme,
in a biological key a
character used for differentiation
among species may be as complex as "esophagus with
valve-like expansion about one to two
head diameters
posterior to base of stoma; amphid unispiral."
At the other end of the spectrum, the practice is to treat a character
as a
single characteristic such as
Flower color or Leaflet presence,
each considered as an atomic unit and stored as data within
a
single field in a data table as shown in Figure 1, [2]. In some cases, the
character includes a state as well as in petal pink. Also,
characters
are definitely not fully decomposed in a
hierarchical
fashion though DELTA [4], allows a one level breakdown into character
subheadings.
Taxon |
Character |
Data value |
Vicia cracca |
flower colour |
blue |
Vicia cracca |
flower colour |
violet |
Lathyrus
aphaca |
leaflet presence |
absent |
Figure 1. Characters as a
unit in a
single
field [2].
We take
a character to be a triple: (biological
structure, property, state/value). A biological structure,
or structure, can be a system, organ, organ part, tissue, etc. A property is
usually
an attribute of the structure such a shape, length, color, and so
forth. A state is the
quality of the property such
as round or
pink, while value is
taken to be a numerical value. Also, structures can have
substructures as well, as seen in Figure 2. This is more consistent
with usage
in database design where entity (structure), attribute (property), and
value
are the main elements. At times we
will use the terminology list
of characters or characters to simply mean a list
of (structure name, property) tuples, ignoring the states, which in
examples
will be written without
the parenthesis as in body, shape. We do not
impose the restriction that a character must differentiate
taxa, though usually it does, and a bit of nondescriptive data is
usually
included such a host,
soil, and stage
(gender). Our definition of a character will become more evident when basic properties
are introduced later
in this paper. An extended discussion of the definition of a character will
be found in [13].
Biologists
often talk about the "data
matrix" for a domain even though it may in fact not exist. In
nematology for instance, a data matrix has never been constructed for
large
sets of species. The
number of characters needed for small sets of species can usually be
limited to a handful of differentiating
characters. As the number of species increases, the number of
characters needed
increases as well
and the
complexity tends to frustrate development of a data matrix of any appreciable size rendering the notion
of "data
matrix" as more of a concept than reality. While it might seem conceptually possible to
combine the
data from data matrices for small collections into a single
large one,
in practice it can be difficult to do since the problems that occur with
Structures/Substructures Property
States Digestive system
Oesophagus
Median
bulb
shape
slight swelling
fusiform
oval
round quadrangular
violin
Figure 2. Sample hierarchy for
descriptive
character data.
small
sets of characters would likely be solved in
very different ways, making integration difficult. Part
of what we present here is to help standardize this process.
3. SCHEMA CREATION PROBLEMS
In this section, we examine some
of the problems
of creating a large schema of descriptive characters.
In the NEMISYS project, the nematologist took charge of building the
original
list of characters, and
he felt with
about 125 characters, excluding states, that the list was close to complete.
At this size, the list seemed to be quite manageable.
Unfortunately,
the number
of characters continued to grow causing significant complications. The nature of the problems was
not always apparent. Often there was a
tension between the semantics and the structure, as one can attempt to
capture
too much of the semantics via the structure.
It also became clear that a tool incorporating new concepts and
standards would
be needed to manage the
list of
characters. It may be somewhat surprising that the initial estimate of 125 characters would be so
different from the
current number of over 700, even with techniques designed to consolidate the character
set. However, this illustrates the
point discussed earlier that descriptions
reflect their authors' tastes as well as the fact that
standards have
not been set with an eye towards electronic storage and
retrieval.
Testis
Anterior part of the testis
Spermatocytes
shape
cuboidal
circular Posterior
part of the testis
Spermatocytes
shape
cuboidal
circular
Figure 3. A possible
decomposition.
There
are some well-known,
inherent problems in creating any hierarchical decomposition. A simple
example is having to choose among
decompositions according to function, i.e., should one make digestive
muscles part of the digestive system or part of the
muscular system?
Likewise, one might wish other
groupings such as by
region. These multiple views could easily be handled in a
"schema tool" that supports alternative groupings of structures.
While views can be quite complicated in some domains, i.e., viewing an
airplane's structure, aerodynamics, or electronics [15],
this degree of complexity does not appear to be needed in descriptive
databases, though it is easy to
envision more complex ones that
include data on physiology, ecology, and the like where they might be needed.
Even
within a structural decomposition problems can arise. For example, testis
contain
spermatocytes, which in
some cases have different shapes depending on whether they are
contained in
the anterior end or posterior end. In choosing the decomposition as
shown in
Figure 3, there is an advantage
whenever the difference is
exhibited in a species, but this is not desirable when the
shape
is the same at both ends since the result would have to be stored
twice. There
are many other plausible
and seemingly reasonable
alternative decompositions, but all tend to lead to
problems in creating and utilizing the database, and would of course
create
difficulties in integrating
even small databases as shown in [13].
73.
Numero de estambres (cuando el perianto es presente)
1.
diferente
del numero de petalos o de sepalos
2.
igual
al numero de pet alas o de sepalos
74. Posicion de estambres (cuando el numero
de estambres numero de
tapalos)
1.
no
opuestos a los petalos
2.
opuestos
a los petalos (o alternos a los
sepalos)
75.
Estambres - numero
1.
no
mas de quince
2.
mas
de quince
76. Numero de estambres
1.
uno
2.
dos
3.
tres
10.
din
11.
mayor
de once
12.
de
quince en adelante
77. Anteras fertile - numero
1.
1
2.
2
3.
3
10.
10
11.
11 o mas
Figure 4. Some characters from
the flora of
Veracruz database [16].
CONSISTENCY
AND UNIFORMITY. As the
size of the list of characters expanded, the biologist felt a sense of
loss of
control. One of the main problems was trying to maintain a uniform and consistent
set. While one can solve this problem just by being consistent, it is
not so
easy to remember how similar
characters were expressed in
other parts of the list. For example, the property presence with
possible states present/absent is one way to represent whether a part is
present or not. However, the property visibility with states absent/faint/clear/conspicuous
is another way, with the
latter three implying the
presence of the part. As another example, consider the characters and states
found in Figure 4 from [16]. Here we see four ways of representing numerical values: as integers
(1,2,3, etc.) in
character 77, as strings (uno, dos, tres, etc.) in character
76, as ranges in character 75, and as comparators in character 73,
though the
latter two are attempts to deal
with a problem discussed below.
One also may confront mixed expressions in the
same character such as (1, 2, 3, 4, 5, or more) as in
character 77 or
(1, 2, 3, 4, 5, half a dozen, about a
dozen, many)
similar to character 76. These are all characters that appear with
one another, where consistency should be easy to
observe, yet even here
it has not been maintained.
Also
note that the expression of the character name is
sometimes numero de and at other times—numero.
While this may seem a trivial concern, it can have implications for
supporting
and using the characters.
In other cases, useful information may be embedded in the character name
which makes it difficult to process the character properly. For
example, the
properties shape of the female and shape
of the male hide the fact from the system that one character is
for females
and the other is for males, yet characters found in the female genital
system
and the male genital system would
easily be detected as gender
based and could thus be used appropriately by the system as in an
identification where an unknown is known to be female.
IMPLICIT
PROPERTIES. For some
biological characters the property is implicit and it may be difficult
to determine what the correct property is. For example, in "body 200pm,
smooth" one implicit property is
the length of the
body, but it is unclear what the correct property is for the state smooth.
One
solution is to create a property out of one of the states, in this case
it
could be smoothness, with smooth one of its states.
We observed
that when property names were not naturally associated with a character
the
biologist tended to use artificial property names
such as aspect, type, situation, or nature. They were often
used
interchangeably and at times inconsistently.
For
instance, in one character the property nature may be used to
indicate whether
the part is faint or well marked while in another aspect
is
used for the same states, and in another visibility is used.
It becomes
even more problematic when there are multiple distinguishing
implicit properties for the same part. For example, "hair curly, thick,
and coarse" has three implicit
properties, but it may not be
easy to identify what the property names should be among
the choices curliness, thickness, coarseness, body, texture, etc. This
may
explain in part why in practice petal
pink is treated as a
character, since one does not have to deal with naming the property (obviously
color in this case) either when it is obvious or unknown.
PROPERTY
EXPLOSION. In a
descriptive database,
knowing the data is necessary just to correctly specify
the properties, that is, you need to know what the data is going to be
prior to
creating its storage
structure. Unfortunately, this is
difficult when there are descriptions of thousands of species,
each bearing the stamp of its author. One result is that an explosion
of
properties which are very similar as
new data is added to
the database. For example, the diameter of the body may
appear to be a reasonable property to use. But the literature may
incrementally
reveal that the diameter of
the body is measured at
the vulva, at midbody, at the stylet base, or at several other
positions on the body, or may indeed
be the maximum diameter without
specifying where the measurement is
made, though an expert on the
species will probably understand what is intended. Creating a new
property each time is clearly not the best resolution.
CHARACTER
STATE SYNONYMY. In
biology, states represent an important part of the domain expertise
thus are properly part of the schema development. It may be difficult
to
determine if a state is a separate
state
or a synonym for another state, or determine
if a group of states
should be logically
combined into one property or kept separate in two. In particular, the
language
used in published
descriptions is quite rich. For example, it is not
obvious if widely
open C, open C, very open C
are synonymous with weak C or are
distinct states. Experts could disagree.
No standards generally exist for
published descriptive
terminology and with thousands of descriptions it takes significant
effort to determine valid properties and states, to determine which are
new states
and which are
synonymous with existing states. Thus, creation of a
schema is not a short term activity, it can only develop over time as
the
literature is reviewed, a time consuming task for experts.
Therefore, the design must be sufficiently
flexible to change as new data
is acquired, since the
addition of new data can add to the list of states, which play a
critical role
in relationships expressed in the design, particularly with state-based
relationships discussed below.
GENERAL AND SPECIFIC STATES. Another problem, similar to synonyms, involves
general and specific states for
a property. For example, if round,
circular, and elliptical
are states, the first
is a general state
encompassing the latter two,
which are more specific. An example from our domain are the
states ellipsoidal,
oval, almost round, almost circular, subspherical, round, spherical, and quadrangular from the character median bulb, shape.
In some
descriptions, the author may only give the
general state while other authors give specific states. If the general
specific
state relationships are not
represented,
then the queries find all taxa with median bulb,
shape =
round and find ail
taxa with median bulb, shape =
almost round will fetch different sets of species. If
the relationships are represented, then the first query returns all
species for
those states for which round is
considered
a general state, while the latter query would return all those with the state almost
round, but should
also return
those with the state round as "maybe" results. Alternatively, if two
properties were used, one for the general states and another for the specific
states, one called median bulb, shape and
another called median bulb, general
shape, then what would be returned
would depend on the formulation of the queries.
General and specific states may
arise due to
different uses of the data. For example [14], for phenetic
classification a
stem has finer distinctions terete, striate, narrowly winged, or widely winged while
for identification there may be only two categories not obviously winged and obviously winged. In essence, the latter two can be
considered general states for terete, striate and narrowly winged,
widely winged, respectively.
STATE-BASED RELATIONSHIPS. In standard database design methods
such as the Entity-Relationship
Model [17], relationships are expressed
between entities. However, relationships between attributes or between attributes and
states are not considered. We call
these relationships state-based. Some methods allow certain state-based
relationships such as value-determined classes, where the value
restricted in one class is the basis for creation of a subclass. For
example, restricting the class
SHIP to those with cargo
= 'oil' would create a
subclass
DANGEROUS SHIP [18].
Until we recognized the
existence of state-based relationships, there
were endless rounds of revisions in the character set, where the
biologist
would propose characters and the computer scientists would suggest
alternatives
or point out problems. Generally, the difficulty stemmed from the biologist's attempt to
capture these
relationships implicitly within the characters. This embedding of information in the
characters makes
the information subsequently difficult to work with. In fact, many of our criticisms
of current approaches in building
characters sets stems from the fact that too much information
is
improperly embedded in the structures, properties, and states. If these state-based
relationships are not identified and
remain implicit, then it is likely that the resulting set of
characters
will be poorly designed.
Body
behind the neck shape kidney
pear
irregularly swollen spheroid
depends on
Body kind -vermiform-
-intermediate
nonvermiform
Figure 5. A dependent character.
Synonymy,
general, and specific states are simple
forms of state-based relationships. One could
classify them as intra-character
state-based relationships since they can be handled within a
property, as discussed below. Examples of inter-character state-based
relationships will be presented next.
DEPENDENT
CHARACTER. It has
been observed that presence or absence of a character affects its
usage within the system [6,7]. For example, petiole hair length depends
on petiole hairiness, which depends on petiole
presence, and on leaf presence. Dependency need not be
restricted to
the presence or absence of another character, but can be based on one
or more
states in a property. In Figure 5,
the property body
behind the neck, shape is only applicable if the body, kind =
nonvermiform.
SUMMARY
CHARACTER. In the
literature, one often sees characters that summarize a number of
others. A summary character is a high level abstract characterization
or
shorthand used by the experts for other
characters. For example, Stylet,
type = hoplolaimid implies that other properties have their states
as shown in Figure 6.
Stylet,
type
=
hoplolaimid signifies
Stylet,
size
medium to long
Stylet,
kind
= robust
Cone,
size
Shaft, size
Cone,
shape
conoid
Knobs,
kind
true knobs
Knob,
size
medium to large
Figure 6. A summary character.
REDUNDANT
CHARACTER. An
example of a redundant character is hemizonid, distance to the phasmid,
which may or may not
be represented in the phasmid
as the distance to the hemizonid. A more
delicate situation exists when the redundancy is conditional. For
example, if
knobs, shape = circular, then
the anterior and posterior parts of the knobs will
be circular
too, actually semicircular.
FUZZY
STATES. In the
literature, it is not unusual to find that measurements and quantities
are given
imprecisely. Instead of the stylet, length given as a numeric
value or range as 20.5-25.1µm, it is given as
stylet, length =
short. The Annuli, number may be given as many rather
than as a
specific number when
it is too tedious to count. These
are intra-character relationships between qualitative and quantitative
data. There are several other types of fuzzy properties representing inter-character relationships for
comparison of
states between properties such as bigger, smaller, equal
to,
longer, shorter, etc.
Finally,
we briefly mention one other aspect of
biological data that we have observed in this project
and discussed in detail elsewhere [19]. Character semantics, what we
call
metadata, play a central role in the
underlying understanding of
the domain and its uses. For instance, with large
character sets it is important to know which characters are easy to use in an
identification and which are not, which can be relied on as input from
observers and which cannot depending on their expertise. The difficulty
arises
when the metadata changes from taxon to taxon. This is unlike
most metadata found in data models where the metadata is independent of
the
instances stored in
the database.
4. BIOLOGICAL DATABASE DESIGN
AND SUPPORT
There
are
efforts by database researchers to examine requirements in a variety of
scientific areas
[20] and the new capabilities in newer generation DBMSs such as
user-defined data types, stored procedures, triggers, and rules, have made it possible
to address the database
requirements of scientific
research,
where file systems have been the principal means for data management in
the
past [21]. However, these new capabilities alone are insufficient since
biologists do not have the
expertise,
time, or money to effectively exploit them [22]. It is a
challenge to the database community
to create database management tools [22] that can be tailored to
typical
scientific endeavors and to construct for each domain a unifying model
[23]. In
this section, we present several concepts that
address the
problems stated above and provide a model that is easy to work with, both for the
designer and the user.
We have
discussed a variety of problems that arise in
creating a large set of characters for a descriptive
database. While it is possible to attack each of these problems
individually,
the result may be a complex
system that is difficult to support
and makes integration extremely difficult. Our effort
will be to create a framework that provides the necessary
expressiveness, while
at the same time is
reasonably uniform and simple, both for
the designers and for the users. Given the wide
variety of problems discussed above, this is a significant challenge,
but we
feel that the concept
of basic property and its features takes a major step in
achieving our goals.
Basic Property
In the course
of analyzing several early versions of the list of characters that the
biologist had created, we observed
many problems with
consistency and uniformity. While many properties had
certain features in common, the concept of a property was not
sufficiently well
defined to aid in producing a uniform
and consistent set of
characters. Subsequently, we developed the concept of basic property.
A basic property is a property satisfying the
following four general conditions.
CONDITION
I. A basic
property is domain independent, that is, it is not peculiar to one
domain such
as nematology, or ichtheology, or entomology. A basic property should
be useful
in multiple domains. For example, shape
is domain
independent while shape of the stylet is not, so is flow rate, while flow rate
of blood is not.
CONDITION
II. A basic
property is specific to the type of data, i.e., descriptive,
behavioral, ecological,
etc.
For example, length is specific for descriptive data while flow
rate is
specific to physiological.
APPEARANCE
MEASUREMENT
presence
length
shape
height
kind
{distinctive trait) width
color
diameter
texture
depth
arrangement
weight
symmetry
ratio
of* size
PLACEMENT
QUANTITY
position
relative to* number
orientation quantity
angle with*
distance
toe
*Properties that are
generally relational.
Figure 7. Basic properties for
descriptive data
by semantic category.
For
descriptive data, there are four broad
semantic categories in which basic properties can be placed:
APPEARANCE, MEASUREMENTS, PLACEMENTS, AND QUANTITIES as seen in Figure
7. (While commercial DBMSs support basic business data types like date
and
money, they do not support
certain basic properties for
domains like order processing. There, one might find that order
number, dropship address, line item, etc., would
be part of a set of basic properties that could be used in a wide variety of order
processing systems.)
CONDITION
III. A basic
property is independent of structures and states. For example, shape of
the wing is not a
basic property as it
contains a structural
reference, wing, and is
circular-shaped is not a basic
property as it contains a state reference, circular.
CONDITION
IV. A basic
property is a template to be used in creating characters. When a character
is created, an instance of
a basic property is created, i.e., copied and modified, to form the character.
The
advantages of creating basic properties lies in
promoting uniformity and in ease of use in
building the set of characters. Once basic properties are created for
one area
and type of data, they do not have
to be recreated for each
of the domains with the same type of data, thus eliminating some of the
redundant effort seen in creating biological databases [2].
Additional Aspects of Basic
Properties
Given
the four general aspects of a basic property, we
now extend its definition by examining specific aspects that hold
whenever a
basic property is instantiated (used) to create a character, which
we will refer to as an "instantiated basic property." (We continue
the numbering in the definition.)
CONDITION
V. RELATIONAL AND NONRELATIONAL
PROPERTIES. A basic property that
can be
instantiated to form a relationship with another character is called
relational, otherwise it is nonrelational. Basic
properties
that are typically instantiated in this way are marked with a "*" in Figure 7. For
example, the basic properties distance to and ratio of would be meaningless when instantiated
as is in a
list of characters since they inherently relate structures and
characters. Sometimes
these relationships are referred to as landmarks
(for the definition see
the glossary of
[12]).
We
emphasize that with relational basic properties the
relationship is established when the property is
instantiated to
create a character. At that time, the system should prompt for the
related character, i.e., structure or structure and property. For
example, when
creating a character using
distance to, the
system would prompt for a
structure name to complete the name of the property
and when
creating a character using ratio of
, the system would
prompt for two
other properties such as height and
width. The
relationships can be maintained via the mechanism for
state-based relationships discussed below. Note that with nonrelational
basic properties,
such as length, the
system would not prompt
for additional information, though the names of instantiated properties
can be
modified. This discourages creating properties such as length
of the wing, which is
really a combined property and structure, a poor design choice since a
better decomposition would be wing,
length. Thus, the very
definition of basic properties aids
in uniformity and consistency within a character set, even if they are
not
implemented and supported.
CONDITION
VI. STATES IN BASIC PROPERTIES. Basic
properties may have specified states as part of
their definition. For
instance, presence has two states (present, absent}, which are
automatically
included in any instantiation. Additional states and synonyms can be
added in
each instantiation as appropriate.
Basic properties classified as measurements and quantities also
have fuzzy states included. For example,
the basic property length includes the fuzzy states {very
short, short, intermediate, long, and very long}. Quantities have fuzzy
states
such as {a
couple,
a few, several, many, about a dozen}. Upon instantiation, changes to
the list
can be made. We have not
addressed the
complexity of issues related to the use of fuzzy states. We have only provided for fuzzy states in
the list of
characters and for storing fuzzy values in the data acquisition.
CONDITION
VII. QUALITATIVE AND QUANTITATIVE
PROPERTIES. There are essentially
two types
of data required for descriptive data: qualitative states and
quantitative
values. Other data types may be required
for ecological, physiological,
and geographical data. Storage structures are created
based on the instantiated properties and the type of storage structures
desired: records, relations, objects,
etc. In this discussion, we
will assume the descriptive data is stored in records as shown
in Figure 8, that is, each record will represent data for a given
character.
Included but not shown are the
fields to store taxon related
information obtained directly from published descriptions. We do not
address the complex area of nomenclature in this paper.
1. Structure I
2 Property |
3. Name extension |
4. State I 5.
Qualifier I 6. Version* |
|||
|
Hemizonid I position relative to |
excretory pore l anterior I
slightly
q |
|
||
(a)
Qualitative
data record format and example.
I1-6 as above
7. Value I 8.
Low Range I9. High Range I 10. Stdev I
I Median bulb |
length |
13.0
I
10.6 |
15.8 |
I
1.4 |
(b)
Quantitative data
record
format and example. Figure
8. Storage structures.
Field 1
in Figure 8a identifies the structure for the
property in field 2. Field 3 will be discussed in
condition IX. Qualitative data have a field for a state, and a field
for the
frequency of occurrence or for other
qualifications,
fields 4 and 5, Figure 8a, as states are often given with qualifiers
such as "always, usually, sometimes."
Using
records to represent character data presents a problem
whenever character
data needs to be linked, which
happens occasionally, but not
always predictably, in the data we are working with.
This may be different states for the same character (see VIII below) or
for
different characters. Field 6
stores a version number, with
version number 0 for characters that are not linked and a different
version
number each time there is a link for the same set of characters. Application
programs would have to handle this whenever necessary. The alternative
approach
of storing multiple
characters per record and normalizing would be more
difficult to achieve given the uncertainty of the data in the literature
or in
adding new species.
Quantitative values can be
separated into measurements (reals) and
quantities (integers). Each measurement
has a
field for a value, field 7 of Figure 8b, representing a measurement of
an individual or an
average for a population. In the
latter case, there would be data for high range, low range, standard deviation as these
are the terms most frequently
encountered in descriptions, fields 8-10, Figure 8b. There are
alternative ways
in which measurements are given such as when the variance, standard error,
confidence interval, or normal and extreme
ranges are used in place of
a
standard deviation. Often a conversion can be made to a standard
deviation
representation. A more
elaborate
representation of measurements may be necessary though to accommodate these alternatives. Quantitative data
records also
have fields for states to accommodate fuzzy states.
For
integer data, usually a single numeric value
is given or a range is given. This can be handled using
the
same fields 7-9 in Figure 8b, though the data type would be integer
instead of
real. Integer ranges with
gaps can be
handled by multiple records. Occasionally, an average value for integer
data is given, an example is average family size. In this case, the
value would
have to be a real as well,
representing an average. The
resulting ambiguity is resolved by specifying the range and scale of a property.
An
instantiated basic property may be designated
having a scale, one of {nominal, ordinal, interval,
ratio}, and a range, one of {binary, discrete, continuous}. Scale
indicates
whether the states or values are
unordered, ordered, ordered
with measurable differences (a – b), and ordered with measurable
differences (a – b) and ratios (a/b), respectively.
Range
differentiates between discrete/binary,
i.e.,
integers or states, and continuous data, i.e., reals. Basic properties
classified under appearance and position have defaults of nominal or
ordinal,
and binary or discrete, those in
measurements have defaults of continuous and ratio, and those in
quantities
have defaults of interval
and
discrete. One normally accepts the defaults, but can make changes for a
given character. Thus,
properties designated continuous
would have all of the fields for a measurements, those with interval and discrete would
have those for quantities, and
those that are ordinal or nominal
and
discrete would have fields for qualitative states. Those that are
continuous
and interval would have integer ranges and a real value and
stdev.
CONDITION
VIII. IMPLICIT PROPERTIES AND MULTIPLE
STATE LISTS. To address the problem
of implicit properties
mentioned earlier and to avoid the use of multiple artificial
properties such as
nature, aspect, situation, etc., we use the basic
property kind.
(The property type, which is not a
descriptive property, is a term that has special meaning in biology and
should
be reserved for states that
indicate a biological type as
shown in the example of the summary character in Figure 6.) This
presents a
problem though when there are multiple implicit kinds in the same structure.
Instead of using kind1 kind2,
kind3, etc., we introduce the
concept of multiple state lists
for
instances of basic properties. That is, associated with a qualitative
property
we allow more than a single
list of states. For example, we
could have one list for general states and another
list for specific states within the same character, though we opt for a
different approach described below, or we
could have separate state
lists for the character such as hair, kind, one state
list including states {thin, thick}, another state list including
states
{coarse, smooth}, and another including
states {curly,
straight}. One advantage in using multiple state lists is that it provides
a mechanism for decomposing complex states into more atomic states. For
example, the states {low
contiguous, high contiguous, high separated}
might be decomposed into separate state lists {high,
low},
and {separated, contiguous}. Each state list is maintained in a
lexicon, defined
below, and forms the basis of supporting many of the concepts discussed
including intra- and
inter-character state-based relationships.
CONDITION
IX. NAME EXTENSIONS IN BASIC
PROPERTIES. A simple mechanism,
called a name extension, is a
modifier of a property name. As such it can be stored as a separate field
as shown in Figure 8, field 3. One example of property
explosion where
name extensions are appropriate is given when length is measured in
different
ways for the same structure such as length along the
axis,
length along the outer boundary, or length directly taken. Rather
than creating multiple
properties, we can create one
property, length, with three name extensions along the
axis, along the outer boundary, directly taken. Instances
of basic properties maintain a list of name extensions.
In the
course of a query, the user can decide which, if any, of the name extensions to
enforce.
There may
appear to be a conceptual resemblance between name extensions
and relational properties. And
indeed, name extensions are used
to represent relational properties. For instance, in the
example of property explosion diameter at
the vulva, diameter at midbody, diameter at the stylet
base, etc., this can be
represented by the property diameter
and have the name extensions at the vulva, at midbody,
at the stylet base, etc., see Figure 9 where the name extensions
have a
" – " prefix, at
the same time it represents relationships with other
characters.
There
are several
advantages of name extensions. The first, and obvious one is the
convenience of
consolidating
multiple properties into a single property. As new name extensions are
encountered for a property, they can be added to an existing property
rather
than having to create a new one
Body
diameter
- at
stylet base = SBW -
at
median bulb
- at
nerve ring
- at excretory pore
- at oesophago-intestinal
junction
- at
beginning anterior ovary - at midbody MBW - at
vulva VBW = VB - at end posterior
ovary - at
anus
- maximum = breadth
Figure 9. Name extensions in a
property.
A field in the storage record of each property, whether
it initially
has name extensions or not, can be set aside for a name extension
value.
Whenever a value is stored, the name extension can be stored as well. It is a simple
matter in a
query to enforce (or ignore) a name extension via a simple "and" condition.
Operationally,
the name extension could be used like a modifier that may or may not factor in a query. For
example, in
a query with the condition body, diameter < 20.0 it would not be important to
enforce a condition like body,
diameter.name extension = at
the vulva, indeed, it
would be
most desirable not to. Needless to say, in an interactive session the
user would be able to make appropriate choices.
CONDITION
X. LEXICONS IN BASIC PROPERTIES. The
principal mechanism in an instance of a basic property to
handle
many of the concepts discussed including intra- and inter-character state-based
relationships is called a lexicon. A lexicon L is
a 5-tuple (S, P, CS, DS, M), where S is a
structure, P is a property including specified name
extensions, CS is
a set of cited states, DS is a set of display states, and M
is a
correspondence or mapping, not necessarily a function. M maps CS to DS.
The
basic rationale for a lexicon is straightforward. In the literature
there may be a wide variety of states for
a property. One example
set of states is {straight, weak C, C, circle, closed circle, open
circle,
widely open C, spiral, question mark, tight spiral}. This set is
designated as a
set of cited states CS, since they are as cited in the literature and
we assume
these are the values that are
stored in the database. However,
for a number of reasons, this set CS may not be the
set
of states we would choose to display as a list of states for the
character. For
example, some of the terms in CS
may be outdated, some
may be nonstandard terminology, some could even be
wrong, i.e., a bad synonym, or some might be synonyms that are less
frequently
used than other terms.
Thus, we form a set DS, display
states, that represents a set of distinct states to be displayed
whenever the
character is viewed. An example of a set DS for the example CS
above might be
{straight, weak C, C, circle, question mark, spiral, tight spiral}.
Since
the set of cited states is from the literature, the set would not
change over
time except to add new states, unless
the character were to
change in some fundamental way such as dividing a character in two. On
the
other hand, the set of display states would change. The intention is
that by
changing the set DS, the stored data would remain unaffected
and would
allow individual users to change DS
as needed. A correspondence
or mapping M between CS and DS is needed to
relate elements of CS that are synonymous with and general
states for
those in DS. An example of the correspondence is shown in
Figure 10.
On the left-hand side are elements of DS,
S = body
D habitus
DS M - CS
straight
= straight
weak C
= weak C = widely open C C
C
= C = open circle ** circle
circle
= circle
spiral
= spiral
tight
spiral
= tight spiral
spiral
Figure 10. A lexicon.
on
the right-hand side the corresponding elements of CS with "="
showing the state itself and synonyms and " "
showing
general states. Note that C is a general state for weak C, and
it is also a
display
state itself, and widely open C is a synonym for weak C.
A
lexicon can easily be represented and implemented as a table. A
property
may have one or more lexicons, thus
multiple state-lists directly
correspond to multiple lexicons. Most importantly, the lexicons form the basis
for inter-character state-based relationships.
5. STATE-BASED RELATIONSHIPS
The
concepts described above, including basic
properties, name extensions, lexicons, and state-based
relationships have made it significantly easier to build a large set of
characters in the NEMISYS project in a more consistent and uniform
fashion than
would have otherwise been possible. Lexicons,
which provide
a means for representing synonymy and general states, also provide a
convenient
and straightforward way of representing inter-character state-based
relationships
such as dependent, redundant, and summary characters. Again
the basic
idea is that these
relationships are represented via correspondences between
lexicons.
Representation
of State-Based Relationships
The
basic unit in representing inter-character state-based relationships
is a triple (L1, L2, C12) representing
a correspondence C12 between display states DS1
and
DS2 of two lexicons L1 and L2. The
mapping of a state in DS1 to multiple states in DS2
is interpreted disjunctively. One or more
triples can be grouped into a collection G and is interpreted
conjunctively. A
relationship is then defined by
its type T and a
collection {G1, G2, ,Gn} of groups of triples, with the collection
interpreted disjunctively.
An
example of the simplest relationship is a dependent relationship shown
in
Figure 5, with T
= 'Dependent' and one grouping
consisting of a single triple, i.e., G1 = {(Li, L2
C12)},
where
L1 the lexicon for Si
= body, D1 kind, DS1 {vermiform,
intermediate, nonvermiform}, and
L2 the
lexicon for 52 body behind the neck, D2 =
shape, DS2
= {kidney,
pear, irregularly swollen, spheroid},
and C12(nonvermiform)
= {kidney, pear, irregularly swollen, spheroid}. The other two values, intermediate
and vermiform are mapped to the empty list { }.
Summary
characters using the example in Figure 6, require multiple groups of
multiple
triples. With L1
the lexicon for the
character stylet, type G1 includes (Li, L2, C12), where C12 (hoplolaimid)
= {robust} in the lexicon L2 for stylet, kind; includes (L1,
L3, C13), where C13 (hoplolaimid)
= {conoid} in the lexicon L3 for cone, shape; includes
(L1,
L4, C14), where C14 (hoplolaimid)
= {intermediate, long} in the lexicon L4
for stylet, size,
and so forth. (Note that mappings
are interpreted disjunctively when mapping a state to two or more
states.) The
condition
that cone, size = shaft, size can be handled by
placing the mappings from hoplolaimid to each of shaft, size = small
and cone,
size = small in G1, with the other cases of equal shaft sizes and cones
sizes
using additional sets Gn},. Alternatively, one could include features like wild card elements to
simplify and
reduce the number of G, needed in a representation. The
representation of
state-based relationships can be thought of as
query conditions that can be used to modify queries, as
discussed
next.
Utilization
of State-Based Relationships
Proper
use of the data is very much dependent on the
accuracy in the selection of the correct characters.
Given the complexity of any moderate sized character set, independent
of the
concepts used to build the set,
the user will need assistance
in selecting the appropriate characters in order to
operate
on the data. Usually, the more concepts involved the more difficult the
task
will be in building a
mechanism to assist the user.
However, the concepts we have introduced are represented
using lexicons and name extensions in a fairly straightforward manner.
This
would simplify building
such a mechanism, which for purposes
of discussion we call a character list processor.
The
types of state-based relationships we have
presented, i.e., synonymy, general states, dependent,
summary, redundant, and fuzzy characters certainly not do not exhaust
the
possibilities. The addition of new
relationship types should not
require specialized code or changes in existing code for a character
list
processor. Our approach in using state-based relationships simplifies
the addition of new
kinds of relationships.
To
illustrate the need for a character list processor, we consider an
identification session where an observer enters
some initial
observations C1, a set of characters with states/values that are connected
by the usual logical connectives `and' and 'or.' Without a character
list
processor candidates would be
retrieved based on C1. However,
the set C1 may not be the best set to use. One would not
expect
the set of observations C1 by the user to translate directly into the best
set of characters for the retrieval. For example, the user may not have
been
sufficiently general or specific.
General or specific states
may be needed to clarify, for example, that a posture
designated as C includes the specific state weak C. The
user may
specify inconsistent characters without
realizing it as
could be the case if the body, kind is given as intermediate
but
the body behind
the neck, shape is observed as
kidney, Figure 5. The user may not be aware that an
observation represented by one character may indeed be represented for
different taxa in different ways. For
example, if the stylet,
type hoplolaimid is in C1, it may be necessary retrieve
also based on the summarized characters shown in Figure 6 for those
taxa where
the stylet,
type has not
been specified.
There
may be circumstances other than a retrieval, where a set of
characters C1 needs to be modified before an
action takes
place. For example, C1 might be a set of observations that need to be verified or
C1 may be used to update the database.
Clearly,
state-based relationships play a central role
in overcoming the problem of formulating or
modifying C1 caused by the structure of the data. Each context requires
proper
selection of the characters. In
utilizing the state-based
relationships, we can view a character list processor, Figure
11,
as a set of characters C1 is taken as input and a modified set
of
characters C2 is produced as output. Other
input includes the set of
state-based relationships, a
context and a table
of operations based on the context.
By
'context,' we mean a user
designated name representing the kind of activity taking place with the
data.
Typical names for contexts could be "Retrieval," "Query,"
or "Update," but there may be many more that
are suitable
for other contexts as well, such as verification of observations. We will discuss the
situation for retrievals since they are the most common context.
By
'operation,' we mean the modification made to
C1. Some generic operations are given in Figure 12.
For example, if C1 contains body, habitus = C, in a retrieval
C1 could
be EXPANDed using
specific states to include the condition body, habitus = C or body, habitus = weak C.
CI
Character list
processor
C2
A
A
state-based
context
table of
relationships
operations
Figure
11. Character list
processor.
Likewise,
if CI contains body, kind
= nonvermiform, C1 could have ADDed
the
condition body behind
the neck, shape = kidney, or pear,
or irregularly swollen, or spheroid.
The ADVISE operation
is used to alert the user of an existing relationship and allows the
user to
selectively modify C1. A
SUBSTITUTE operation
replaces one or more characters by other characters. Thus, the user could EXPAND, ADD, or
SUBSTITUTE,
and within each, add or delete disjuncts and conjuncts.
EXPAND,
modify CI by disjunctively adding
conditions to Cl. ADD,
modify Cl by conjunctively adding
conditions to C1.
SUBSTITURE,
modify C1 by substituting conditions for conditions in Cl. ADVISE, alert the user to
existing relationships.
Figure 12. Operations on
character sets.
Whenever a new context arises it
is necessary to
specify what operations to carry out on a character set C1 relative to
each
kind of relationship in order to produce C2. The specification should not be based on instances of
state-based
relationships between two specific properties, but should only be based on the type of
relationship,
i.e., redundant, dependent, summary, etc., and on the context.
Figure 13 shows some entries in
the table of
operations. We emphasize that it is up to the user, initially the system designers, to specify the
entries in the table. A relationship can be expressed in two ways. For example,
for a state
s1 with a synonym s2 we can
express this as "s1 has
synonym s2" or alternatively "s2 is a synonym for
s1."
Retrieve Update
Has
synonyms |
EXPAND |
No Op |
Is a
synonym |
EXPAND |
ADVISE |
Is
general state |
ADVISE |
ADVISE |
Has
general state |
ADVISE |
No Op |
Is
dependent |
ADD |
ADD |
Has
dependent |
No Op |
No Op |
Has
summary |
ADVISE |
ADVISE |
Is
summary |
EXPAND |
ADVISE |
Is
redundant |
EXPAND |
ADVISE |
Has
redundant |
EXPAND |
ADVISE |
Name
extension |
ADVISE |
ADVISE |
Figure 13. Specifying operations
on Cl.
In
Figure 13, we see two types of contexts and
the operations taken for each relationship we
have considered. If a set of characters Cl is proposed to retrieve a
set of
candidates, we call this a
"Retrieve" in this
example, and the column below it in Figure 13, indicates our choice
of operations on C1 for each relationship. For example, the first entry
shows
that if a state "Has Synonyms,"
as in body, habitus
= weak C, then on a "Retrieve,"
the query should be "EXPAND"-ed to
include conditions body,
habitus = weak C or body, habitus open C. Likewise,
on a "Retrieve" that includes a state that "Is A Synonym,"
then the state for which it is a synonym should be "EXPAND"-ed. For
"Is General State" on a "Retrieve," the user would be
"ADVISE"-ed of the general state and choose to modify conditions in
the query or not. Note that when a character
"Is
Dependent" on another character, then the condition is "ADD"-ed
to contain the primary
character as well, while a
character that "Has Dependent" characters would yield
no operation on Cl. If a character in Cl is a summary character "Is
Summary," then the summarized characters
will be
"EXPAND"-ed, however, if a character is in Cl and is one of several
that has a summary character "Has Summary," then the user will be
"ADVISE"-d and can
choose whether or not C1 is modified and how it is
done.
Conceptually
speaking, we can consider state-based relationship as conjunctive and disjunctive conditional fragments that
can be used to modify a set Cl. In fact, our state-based relationships could
be directly represented and stored in this manner for simplicity and
efficiency.
We do not
address the implementation of character list
processors in this paper, but we do point out that
whenever
each character in C1 is modified in producing C2, additional
relationships
may arise. This would typically occur in chaining of dependent
relationships or
when states with synonyms or
general states are added. We
assume that the implementation of the processor maintains a list of
dependencies used to produce C2 in order to avoid undoing or redoing
operations.
The user should have the option of allowing the character list
processor to
continue until all
changes are made or to review the current C2 as each change is made.
6. STRUCTURAL DECOMPOSITION AND NAME EXTENSIONS
Basic
properties constitute the main focus of this
paper, however, some of these ideas can carry
over to the structural level as well. In particular, the
concept of name
extension can be used with
structures to improve the structural
decomposition. The example in Figure 3 of spermatocytes in the
anterior
and posterior end of the testis can be handled by
making these positional
modifiers name extensions of testis; see [13] for an
extended
discussion of alternative decompositions including the use of name
extensions.
Unlike
the case of name extensions in properties, the semantics and
implementation issues of name extensions in
structures
is less straightforward, though we believe there are important advantages
in using them there. A field for each structure/substructure name is
needed in
the record and a field for
a possible name extension must
be provided as well. Given that a record consists
of a single character and its state or value as assumed in Figure 8,
then the
name extension field for each
structure/substructure would
apply to that datum. However, if one opted for a record format that has
multiple properties per record, then another field would be necessary to
indicate which properties the name extensions applied to. This would
overly
complicate the storage mechanism and
the character list
processor. This is another reason we opt for the record format of Figure 8.
7. SCHEMA
CHANGES
Schema
evolution has been addressed extensively. The
focus has been primarily on structural changes
such as adding and deleting attributes and classes. We will
limit our
discussion to schema changes
in the context of
state-based
relationships since changes to the hierarchy at the property level and above would be
analogous to structural changes in a schema.
If
a property participating in a state-based relationship were moved from
one
structure to another, for affected
relationships we would
change the name of the structure S in any lexicon L for that property.
Splitting a property into two properties would be straightforward if
the
property contained multiple
lexicons and the
lexicons remained unchanged. More complex changes, where a lexicon is split into two or more
lexicons, would
be similar to adding and deleting states in a lexicon as discussed
below. As one might expect, some changes can be handled automatically
by the
system, but some would require intervention. In all cases, the system
should
issue a warning if
relationships will be affected by the changes.
In the
course of using the schema, the most frequent changes involve
adding states and adding properties. In the
latter case,
no state-based relationships exist for a new property and no changes in
existing relationships are needed, though a new relationship may need
to be
established as would be the case for redundant or certain dependent
relationships such as presence.
Adding states
to a lexicon is a very common situation. In some cases, the system
would either
do nothing such as when the new
state is merely added to CS, the set
of cited states, but not to DS, the
set of display
states. In other cases, the system
could automatically extend the relationship. For instance,
if a new state is added to the DS,
and the DS
is dependent on
another property, then the
relationship could automatically extend to the new state. For example,
if
another state is added to the property shape in Figure
5, which is a
dependent property, then the new state along with the
existing states can be assumed to be dependent on the state
nonvermiform in
property kind in
the part body. The
same would hold if a state were inserted into or outside of, but not at
the boundary of, an ordered subset {dsi, dsi+1,...,dsi+m} of DS. If it
were placed at the boundary,
then automatically determining whether it participated in the
relationship
would be difficult
as would be the case if the states were unordered.
Deleting
states is less likely to occur from CS since
these are taken from the
literature. Some deletions from a DS would
occur since the role of a
state may change. Generally speaking, deletions can be done
automatically, though warning of existing relationships should be given
in
case a lexicon is being divided as mentioned above. Also, a
relationship could
be rendered obsolete
by deleting the last element in its domain or range
necessitating a warning as well.
8.
CONCLUSION
In this
paper, we have presented the key concept of basic
property and its features. Whether or not a system
implements
basic properties per se, the concept itself can aid in creating more
uniform
and consistent character sets. Additionally, basic properties provide a
mechanism for representing
and utilizing state-based relationships, which are
difficult to avoid in any large set of characters.
While new generations of relational and object-oriented database
systems may
make it possible to
implement these
ideas more efficiently, it would be too much to
expect each group interested
in building a biological database to do its own implementation. Thus,
if large
scale biological databases
are to exist, be used
effectively, and be integrated, the problems discussed here will
have to be addressed with the goal of creating a kind of generic
BioDBMS. We
believe our ideas will
contribute to this goal, though actual
design and implementation will require a large
scale effort. Though much remains to be investigated in what we have
presented,
we believe this is
a solid start in the right direction.
REFERENCES
1.
R.J.
Pankhurst, Database design for monographs and floras, Taxon 37,
733-746, (1988).
2.
R. Allkin, R.J.
White and P.J. Winfield,
Handling the taxonomic structure of biological data, Mathl Comput. Modelling 16 (6/7), 1-9, (1992),
3.
M.J. Dallwitz, DELTA
and INTKEY, Advances
in Computer Methods for Systematic Biology, Chapter 18, (Edited by R. Fortuner), Johns
Hopkins University Press, Baltimore,
(1993).
4.
M.J.
Dallwitz and T.A. Paine, User's guide to
the DELTA system: A general system for processing taxonomic
descriptions, 3rd
edition, CSIRO
Aust. Div.
Entomol. Rep. No. 13, pp. 1-106, (1986).
5.
R. Allkin and F.A.
Bisby, Editors, Databases in
Systematics, Systematics Association, Vol. 26, Academic Press, London, (1984).
6.
R.J. White, R. Allkin
and
J.P. Winfield, Systematic databases: The BAOBAB design and the Alice
system, In Advances
in Computer Methods for Systematic Biology,
Chapter 19, (Edited by R. Fortuner), Johns Hopkins University
Press,
Baltimore, (1993).
7.
R.J.
Pankhurst, Taxonomic databases: The Pandora system, In Advances in Computer Methods for
Systematic Biology, Chapter 14, (Edited by R. Fortuner), Johns
Hopkins University Press,
Baltimore, (1993).
8.
R.J. White and R.
Allkin, A language for the
definition and exchange of biological data sets, Mattel.
Comput.
Modelling 16
(6/7), 199-223, (1992).
9.
H.
Saarenmaa, S. Leppäjärvi, J. Perttunen and 3. Saarikko, Object-oriented taxonomic
biodiversity
databases
on
the World Wide Web, from an international workshop: Internet
Applications and
Electronic Information
Resources
in Forestry and Environmental Sciences (1-5 August 1995, European
Forest
Institute, Joensuu,
.
Finland) and available through the web at http: //www. of
joensuu. f i ",saaren.ma/oobdwww-nature‑
latest . htm.
10.
J. Diederich and J. Milton, Expert workstations:
A
tool-based approach, In Advances in Computer Methods
for Systematic Biology, Chapter
7, (Edited by R. Fortuner), Johns Hopkins
University Press, Baltimore, (1993).
11, Diederich
and
Milton, NEMISYS: A computer perspective, In
Advances in Computer
Methods for
Systematic
Biology, Chapter
10, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore,
(1993).
12.
R. Fortuner, The
NEMISYS solution to problems
in nematode identification, In Advances
in Computer
Methods
for Systematic Biology, Chapter 9, (Edited by R.
Fortuner), Johns Hopkins University Press, Baltimore, (1993).
13.
J.
Diederich, J. Milton and R. Fortuner, Construction and integration of
large character sets for nematode morpho-anatomical data, Fundamental and Applied
Nematology 20
(to appear).
14.
R.
Allkin and F.A. Bisby, The structure of monographic databases, Taxon 37, 756-763, (1988).
15.
G.
Wiederhold, Views, objects, and databases, Computer 19 (12), 37-44, (Dec. 1986).
16.
R.
Allkin, N.P. Moreno, L. Gam& Campillo and T. Mejia, Multiple uses
for computer-stored taxonomic descriptions: Keys for Veracruz, Taxon 41 (3), 413-435, (1992),
17.
P.
Chen, The entity-relationship model: Toward a unified view of data, ACM TODS 1 (1), 9-36, (Mar. 1976).
18.
M. Hammer and D.
McLeod, Data description with
SDM: A semantic data model, ACM TODS 6 (3), 351-386, (Sept. 1981).
19.
J.
Diederich and J. Milton, Creating domain specific metadata for
scientific data and knowledge bases, IEEE
Trans. on Know and Data Eng. 3 (4), 421-434,
(Dec. 1991).
20.
Special Issue on
Scientific Databases, Bulletin of the technical committee on data
engineering,
IEEE Computer Society,
Volume 16, No. 1, Washington, DC,
(1993).
21.
A.
Shoshani, A layered approach to scientific data management projects
at Lawrence Berkeley laboratory, In Data Engineering, IEEE Computer Society, Volume
16, No. 1, pp. 4-8, Washington, DC, (1993).
22.
Y. Ioannidis, Desktop
experiment management, In Data Engineering, IEEE
Computer Society, Vol. 16, No. 1, pp. 19-23, Washington, DC, (1993).
23.
J.B. Cushing, D.
Hansen, D. Maier and C. Pu,
Connecting scientific programs and data using object databases,
In Data Engineering, IEEE
Computer Society, Volume 16, No. 1, pp. 9-13,
Washington, DC, (1993).