Diederich

J. DIEDERICH
Department of Mathematics, University of California-Davis
Davis, CA 95616, U.S.A.
diedereucdmath.ucdavis.edu

Abstract—in this paper, we examine problems and solutions for building a large set of characters for descriptive data derived from published species descriptions. The ideas presented lead in the direction of creating a kind of BioDBMS that can be used to support large integrated biological databases.

Keywords—Biological databases, Biological characters, Data modeling, Basic property, Schema design.

Among the types of taxonomic databases [1], i.e., curatorial including geographical, nomenclatural, bibliographical, and morpho-anatomical (descriptive), descriptive data presents the greatest challenges to the database designer and software developer, with far less supporting software for managing the data than for the others [2]. One explanation for this may be the structure and nature of biological research. Institutions such as horticultural museums and herbaria have large collections of specimens. Consequently, they have the greatest need for creating formal electronic databases to manage their collections to locate specimen, track loans to researchers, add new specimens, and the like. The structure of their data is, generally, not as complex as the structure of descriptive data. On the other hand, descriptive data has, generally speaking, been the province of individual researchers who more often than not, focus on a small number of species with a limited number of external features. There is little uniformity across these databases and sharing of electronic data can be difficult though some software has been developed to help [3,4]. While the complexity and the semantics of the data can be managed for small databases used by experts on the few species represented, it becomes much less so when the number of species expands into the thousands and the database is to be used by nonexperts too.

Biologists have made some strides in exploiting database technology [2,5-7] and have set standards to allow for data exchange for taxonomic databases [3,4,8]. Much of this work has focused on how to utilize relational DBMSs, to avoid duplication of effort, and to help avoid problems that seem to inevitably infect many hand tailored taxonomic databases [2]. Still, database methods and techniques available for handling descriptive data are inadequate. Recent efforts have been aimed at exploring management of taxonomic biodiversity information using object-oriented databases and the World Wide Web [9].

The work discussed in this paper, to examine problems and propose some solutions in creating large descriptive databases, is part of the NEMISYS Project [10-12], an effort to build an identification system for the approximately 4000 species of plant-parasitic nematodes. It may eventually include an equal number of nonplant-parasitic species. For the most part, the source of our data will be approximately 10,000 published descriptions in various journals over the much of this century.

Two important aspects of creating and using a descriptive database are data modeling and data semantics. Difficulties arise in modeling the data because the source of the data is published descriptions. One important goal in creating a descriptive database is to faithfully represent the data in the descriptions, correcting for errors and omissions whenever possible. However, standards have not been set over the years for describing species. With each author placing his or her stamp on a given description, it remains a challenge to capture and use the data from this rich and available source. Characteristics that are important for some species are not important for others, so there is no small core of characteristics on which to focus. Since standard modeling practice involves determining the data structures in advance of acquiring the data, it would be necessary to know what characteristics of species are described in thousands of descriptions and to know how they are described, a difficult if not impossible requirement. The option of forcing the data into a predetermined structure simply will not work for large collections if the goal is to faithfully record the data. Thus, it is necessary, to some extent, to create the database structures as the data is acquired. Consequently, building a data model prior to data acquisition is impossible in such cases, and must remain a dynamic process, placing significant challenges in maintaining a consistent and uniform model. Each new description can cause changes in the model. Yet, it is not possible to take the properties exactly as they come from the literature since the resulting database structures would then seem too chaotic to use effectively. This seems like an impossible task with contradictory goals. If the data is represented faithfully, the data structures would be too disorganized, while on the other hand, if the data is too rigidly structured, there would be a loss of information. In this paper, we introduce new modelling concepts that support character set creation and make these contradictory goals tractable. A subsequent paper [13], will explore the use of these concepts in a wide variety of practical situations and will develop guidelines for their use.

What we are attempting goes somewhat counter to standard practice. Biological data is often reformulated in some fashion prior to its storage and use, usually in small individual databases, to support a particular type of activity such as identification within a particular taxon using dichotomous keys. For example [14], in some species the manipulated data may consist of two states, obviously winged and not obviously winged for identification within a certain taxon, while for phenetic classification it may be necessary to breakdown these into their various types such as narrowly winged, widely winged, terete, and striate. However, a problem arises when new species are added to the system since the structure of the data often needs to be modified to remain consistent relative to its intended use. Additionally, such manipulated data is difficult to use for purposes other than that specifically intended and often the data does not reflect what was in the original source. Thus, data semantics have to be supported above the level of the database structures to allow for a wide variety of uses. This is analogous to database views, where different users see the data structure according to their needs. However, database views are not sufficiently powerful to deal with the variety of uses of biological data. Clearly, the manner of supporting the semantics will affect how the data is structured.

It is true that some of the complexity we have encountered is due to the nature of these microscopic round worms, where internal as well as external parts are used in descriptions for differentiating taxa. In many other areas, only a few dozen characteristics are required for differentiating and describing species. Nevertheless, if large descriptive databases are to be constructed and maintained, many of the problems we address will have to be solved for these databases if they are to be integrated.

In the remainder of this paper, in Section 2, we present some background. In Section 3 problems are discussed that arise in creating a large list of biological characters. In Section 4, we introduce the major concept of basic property and its features to facilitate handling these problems, including the representation and use of state-based relationships in Section 5. In Section 6, we extend the idea of name extension introduced in Section 4 to another context, and in Section 7, we briefly discuss schema changes.

In conceptualizing descriptive data, the biologist thinks in terms of what is commonly called the "data matrix," i.e., a taxon by character array [2]. Within the matrix are the states or numerical values of the species for that character.

Some clarification may be needed with regard to our meaning of character since the term does not have a standard definition among biologists. Even the standardization of the concept of a character would go a long way towards integrating descriptive databases. At one extreme, in a biological key a character used for differentiation among species may be as complex as "esophagus with valve-like expansion about one to two head diameters posterior to base of stoma; amphid unispiral." At the other end of the spectrum, the practice is to treat a character as a single characteristic such as Flower color or Leaflet presence, each considered as an atomic unit and stored as data within a single field in a data table as shown in Figure 1, [2]. In some cases, the character includes a state as well as in petal pink. Also, characters are definitely not fully decomposed in a hierarchical fashion though DELTA [4], allows a one level breakdown into character subheadings.

Taxon	Character	Data value
Vicia cracca	flower colour	blue
Vicia cracca	flower colour	violet
Lathyrus aphaca	leaflet presence	absent

We take a character to be a triple: (biological structure, property, state/value). A biological structure, or structure, can be a system, organ, organ part, tissue, etc. A property is usually an attribute of the structure such a shape, length, color, and so forth. A state is the quality of the property such as round or pink, while value is taken to be a numerical value. Also, structures can have substructures as well, as seen in Figure 2. This is more consistent with usage in database design where entity (structure), attribute (property), and value are the main elements. At times we will use the terminology list of characters or characters to simply mean a list of (structure name, property) tuples, ignoring the states, which in examples will be written without the parenthesis as in body, shape. We do not impose the restriction that a character must differentiate taxa, though usually it does, and a bit of nondescriptive data is usually included such a host, soil, and stage (gender). Our definition of a character will become more evident when basic properties are introduced later in this paper. An extended discussion of the definition of a character will be found in [13].

Biologists often talk about the "data matrix" for a domain even though it may in fact not exist. In nematology for instance, a data matrix has never been constructed for large sets of species. The number of characters needed for small sets of species can usually be limited to a handful of differentiating characters. As the number of species increases, the number of characters needed increases as well and the complexity tends to frustrate development of a data matrix of any appreciable size rendering the notion of "data matrix" as more of a concept than reality. While it might seem conceptually possible to combine the data from data matrices for small collections into a single large one, in practice it can be difficult to do since the problems that occur with

small sets of characters would likely be solved in very different ways, making integration difficult. Part of what we present here is to help standardize this process.

In this section, we examine some of the problems of creating a large schema of descriptive characters. In the NEMISYS project, the nematologist took charge of building the original list of characters, and he felt with about 125 characters, excluding states, that the list was close to complete. At this size, the list seemed to be quite manageable.

Unfortunately, the number of characters continued to grow causing significant complications. The nature of the problems was not always apparent. Often there was a tension between the semantics and the structure, as one can attempt to capture too much of the semantics via the structure. It also became clear that a tool incorporating new concepts and standards would be needed to manage the list of characters. It may be somewhat surprising that the initial estimate of 125 characters would be so different from the current number of over 700, even with techniques designed to consolidate the character set. However, this illustrates the point discussed earlier that descriptions reflect their authors' tastes as well as the fact that standards have not been set with an eye towards electronic storage and retrieval.

There are some well-known, inherent problems in creating any hierarchical decomposition. A simple example is having to choose among decompositions according to function, i.e., should one make digestive muscles part of the digestive system or part of the muscular system? Likewise, one might wish other groupings such as by region. These multiple views could easily be handled in a "schema tool" that supports alternative groupings of structures. While views can be quite complicated in some domains, i.e., viewing an airplane's structure, aerodynamics, or electronics [15], this degree of complexity does not appear to be needed in descriptive databases, though it is easy to envision more complex ones that include data on physiology, ecology, and the like where they might be needed.

Even within a structural decomposition problems can arise. For example, testis contain spermatocytes, which in some cases have different shapes depending on whether they are contained in the anterior end or posterior end. In choosing the decomposition as shown in Figure 3, there is an advantage whenever the difference is exhibited in a species, but this is not desirable when the shape is the same at both ends since the result would have to be stored twice. There are many other plausible and seemingly reasonable alternative decompositions, but all tend to lead to problems in creating and utilizing the database, and would of course create difficulties in integrating even small databases as shown in [13].

CONSISTENCY AND UNIFORMITY. As the size of the list of characters expanded, the biologist felt a sense of loss of control. One of the main problems was trying to maintain a uniform and consistent set. While one can solve this problem just by being consistent, it is not so easy to remember how similar characters were expressed in other parts of the list. For example, the property presence with possible states present/absent is one way to represent whether a part is present or not. However, the property visibility with states absent/faint/clear/conspicuous is another way, with the latter three implying the presence of the part. As another example, consider the characters and states found in Figure 4 from [16]. Here we see four ways of representing numerical values: as integers (1,2,3, etc.) in character 77, as strings (uno, dos, tres, etc.) in character 76, as ranges in character 75, and as comparators in character 73, though the latter two are attempts to deal with a problem discussed below. One also may confront mixed expressions in the same character such as (1, 2, 3, 4, 5, or more) as in character 77 or (1, 2, 3, 4, 5, half a dozen, about a dozen, many) similar to character 76. These are all characters that appear with one another, where consistency should be easy to observe, yet even here it has not been maintained.

Also note that the expression of the character name is sometimes numero de and at other times—numero. While this may seem a trivial concern, it can have implications for supporting and using the characters. In other cases, useful information may be embedded in the character name which makes it difficult to process the character properly. For example, the properties shape of the female and shape of the male hide the fact from the system that one character is for females and the other is for males, yet characters found in the female genital system and the male genital system would easily be detected as gender based and could thus be used appropriately by the system as in an identification where an unknown is known to be female.

IMPLICIT PROPERTIES. For some biological characters the property is implicit and it may be difficult to determine what the correct property is. For example, in "body 200pm, smooth" one implicit property is the length of the body, but it is unclear what the correct property is for the state smooth. One solution is to create a property out of one of the states, in this case it could be smoothness, with smooth one of its states. We observed that when property names were not naturally associated with a character the biologist tended to use artificial property names such as aspect, type, situation, or nature. They were often used interchangeably and at times inconsistently. For instance, in one character the property nature may be used to indicate whether the part is faint or well marked while in another aspect is used for the same states, and in another visibility is used. It becomes even more problematic when there are multiple distinguishing implicit properties for the same part. For example, "hair curly, thick, and coarse" has three implicit properties, but it may not be easy to identify what the property names should be among the choices curliness, thickness, coarseness, body, texture, etc. This may explain in part why in practice petal pink is treated as a character, since one does not have to deal with naming the property (obviously color in this case) either when it is obvious or unknown.

PROPERTY EXPLOSION. In a descriptive database, knowing the data is necessary just to correctly specify the properties, that is, you need to know what the data is going to be prior to creating its storage structure. Unfortunately, this is difficult when there are descriptions of thousands of species, each bearing the stamp of its author. One result is that an explosion of properties which are very similar as new data is added to the database. For example, the diameter of the body may appear to be a reasonable property to use. But the literature may incrementally reveal that the diameter of the body is measured at the vulva, at midbody, at the stylet base, or at several other positions on the body, or may indeed be the maximum diameter without specifying where the measurement is made, though an expert on the species will probably understand what is intended. Creating a new property each time is clearly not the best resolution.

CHARACTER STATE SYNONYMY. In biology, states represent an important part of the domain expertise thus are properly part of the schema development. It may be difficult to determine if a state is a separate state or a synonym for another state, or determine if a group of states should be logically combined into one property or kept separate in two. In particular, the language used in published descriptions is quite rich. For example, it is not obvious if widely open C, open C, very open C are synonymous with weak C or are distinct states. Experts could disagree. No standards generally exist for published descriptive terminology and with thousands of descriptions it takes significant effort to determine valid properties and states, to determine which are new states and which are synonymous with existing states. Thus, creation of a schema is not a short term activity, it can only develop over time as the literature is reviewed, a time consuming task for experts. Therefore, the design must be sufficiently flexible to change as new data is acquired, since the addition of new data can add to the list of states, which play a critical role in relationships expressed in the design, particularly with state-based relationships discussed below.

GENERAL AND SPECIFIC STATES. Another problem, similar to synonyms, involves general and specific states for a property. For example, if round, circular, and elliptical are states, the first is a general state encompassing the latter two, which are more specific. An example from our domain are the states ellipsoidal, oval, almost round, almost circular, subspherical, round, spherical, and quadrangular from the character median bulb, shape. In some descriptions, the author may only give the general state while other authors give specific states. If the general specific state relationships are not represented, then the queries find all taxa with median bulb, shape = round and find ail taxa with median bulb, shape = almost round will fetch different sets of species. If the relationships are represented, then the first query returns all species for those states for which round is considered a general state, while the latter query would return all those with the state almost round, but should also return those with the state round as "maybe" results. Alternatively, if two properties were used, one for the general states and another for the specific states, one called median bulb, shape and another called median bulb, general shape, then what would be returned would depend on the formulation of the queries.

General and specific states may arise due to different uses of the data. For example [14], for phenetic classification a stem has finer distinctions terete, striate, narrowly winged, or widely winged while for identification there may be only two categories not obviously winged and obviously winged. In essence, the latter two can be considered general states for terete, striate and narrowly winged, widely winged, respectively.

STATE-BASED RELATIONSHIPS. In standard database design methods such as the Entity-Relationship Model [17], relationships are expressed between entities. However, relationships between attributes or between attributes and states are not considered. We call these relationships state-based. Some methods allow certain state-based relationships such as value-determined classes, where the value restricted in one class is the basis for creation of a subclass. For example, restricting the class SHIP to those with cargo = 'oil' would create a subclass DANGEROUS SHIP [18].

Until we recognized the existence of state-based relationships, there were endless rounds of revisions in the character set, where the biologist would propose characters and the computer scientists would suggest alternatives or point out problems. Generally, the difficulty stemmed from the biologist's attempt to capture these relationships implicitly within the characters. This embedding of information in the characters makes the information subsequently difficult to work with. In fact, many of our criticisms of current approaches in building characters sets stems from the fact that too much information is improperly embedded in the structures, properties, and states. If these state-based relationships are not identified and remain implicit, then it is likely that the resulting set of characters will be poorly designed.

Synonymy, general, and specific states are simple forms of state-based relationships. One could classify them as intra-character state-based relationships since they can be handled within a property, as discussed below. Examples of inter-character state-based relationships will be presented next.

DEPENDENT CHARACTER. It has been observed that presence or absence of a character affects its usage within the system [6,7]. For example, petiole hair length depends on petiole hairiness, which depends on petiole presence, and on leaf presence. Dependency need not be restricted to the presence or absence of another character, but can be based on one or more states in a property. In Figure 5, the property body behind the neck, shape is only applicable if the body, kind = nonvermiform.

SUMMARY CHARACTER. In the literature, one often sees characters that summarize a number of others. A summary character is a high level abstract characterization or shorthand used by the experts for other characters. For example, Stylet, type = hoplolaimid implies that other properties have their states as shown in Figure 6.

REDUNDANT CHARACTER. An example of a redundant character is hemizonid, distance to the phasmid, which may or may not be represented in the phasmid as the distance to the hemizonid. A more delicate situation exists when the redundancy is conditional. For example, if knobs, shape = circular, then the anterior and posterior parts of the knobs will be circular too, actually semicircular.

FUZZY STATES. In the literature, it is not unusual to find that measurements and quantities are given imprecisely. Instead of the stylet, length given as a numeric value or range as 20.5-25.1µm, it is given as stylet, length = short. The Annuli, number may be given as many rather than as a specific number when it is too tedious to count. These areintra-character relationships between qualitative and quantitative data. There are several other types of fuzzy properties representing inter-character relationships for comparison of states between properties such as bigger, smaller, equal to, longer, shorter, etc.

Finally, we briefly mention one other aspect of biological data that we have observed in this project and discussed in detail elsewhere [19]. Character semantics, what we call metadata, play a central role in the underlying understanding of the domain and its uses. For instance, with large character sets it is important to know which characters are easy to use in an identification and which are not, which can be relied on as input from observers and which cannot depending on their expertise. The difficulty arises when the metadata changes from taxon to taxon. This is unlike most metadata found in data models where the metadata is independent of the instances stored in the database.

There are efforts by database researchers to examine requirements in a variety of scientific areas [20] and the new capabilities in newer generation DBMSs such as user-defined data types, stored procedures, triggers, and rules, have made it possible to address the database requirements of scientific research, where file systems have been the principal means for data management in the past [21]. However, these new capabilities alone are insufficient since biologists do not have the expertise, time, or money to effectively exploit them [22]. It is a challenge to the database community to create database management tools [22] that can be tailored to typical scientific endeavors and to construct for each domain a unifying model [23]. In this section, we present several concepts that address the problems stated above and provide a model that is easy to work with, both for the designer and the user.

We have discussed a variety of problems that arise in creating a large set of characters for a descriptive database. While it is possible to attack each of these problems individually, the result may be a complex system that is difficult to support and makes integration extremely difficult. Our effort will be to create a framework that provides the necessary expressiveness, while at the same time is reasonably uniform and simple, both for the designers and for the users. Given the wide variety of problems discussed above, this is a significant challenge, but we feel that the concept of basic property and its features takes a major step in achieving our goals.

In the course of analyzing several early versions of the list of characters that the biologist had created, we observed many problems with consistency and uniformity. While many properties had certain features in common, the concept of a property was not sufficiently well defined to aid in producing a uniform and consistent set of characters. Subsequently, we developed the concept of basic property.

A basic property is a property satisfying the following four general conditions.

CONDITION I. A basic property is domain independent, that is, it is not peculiar to one domain such as nematology, or ichtheology, or entomology. A basic property should be useful in multiple domains. For example, shape is domain independent while shape of the stylet is not, so is flow rate, while flow rate of blood is not.

CONDITION II. A basic property is specific to the type of data, i.e., descriptive, behavioral, ecological, etc. For example, length is specific for descriptive data while flow rate is specific to physiological.

For descriptive data, there are four broad semantic categories in which basic properties can be placed: APPEARANCE, MEASUREMENTS, PLACEMENTS, AND QUANTITIES as seen in Figure 7. (While commercial DBMSs support basic business data types like date and money, they do not support certain basic properties for domains like order processing. There, one might find that order number, dropship address, line item, etc., would be part of a set of basic properties that could be used in a wide variety of order processing systems.)

CONDITION III. A basic property is independent of structures and states. For example, shape of the wing is not a basic property as it contains a structural reference, wing, and is circular-shaped is not a basic property as it contains a state reference, circular.

CONDITION IV. A basic property is a template to be used in creating characters. When a character is created, an instance of a basic property is created, i.e., copied and modified, to form the character.

The advantages of creating basic properties lies in promoting uniformity and in ease of use in building the set of characters. Once basic properties are created for one area and type of data, they do not have to be recreated for each of the domains with the same type of data, thus eliminating some of the redundant effort seen in creating biological databases [2].

Given the four general aspects of a basic property, we now extend its definition by examining specific aspects that hold whenever a basic property is instantiated (used) to create a character, which we will refer to as an "instantiated basic property." (We continue the numbering in the definition.)

CONDITION V. RELATIONAL AND NONRELATIONAL PROPERTIES. A basic property that can be instantiated to form a relationship with another character is called relational, otherwise it is nonrelational. Basic properties that are typically instantiated in this way are marked with a "*" in Figure 7. For example, the basic properties distance to and ratio of would be meaningless when instantiated as is in a list of characters since they inherently relate structures and characters. Sometimes these relationships are referred to as landmarks (for the definition see the glossary of [12]).

We emphasize that with relational basic properties the relationship is established when the property is instantiated to create a character. At that time, the system should prompt for the related character, i.e., structure or structure and property. For example, when creating a character using distance to, the system would prompt for a structure name to complete the name of the property and when creating a character using ratio of , the system would prompt for two other properties such as height and width. The relationships can be maintained via the mechanism for state-based relationships discussed below. Note that with nonrelational basic properties, such as length, the system would not prompt for additional information, though the names of instantiated properties can be modified. This discourages creating properties such as length of the wing, which is really a combined property and structure, a poor design choice since a better decomposition would be wing, length. Thus, the very definition of basic properties aids in uniformity and consistency within a character set, even if they are not implemented and supported.

CONDITION VI. STATES IN BASIC PROPERTIES. Basic properties may have specified states as part of their definition. For instance, presence has two states (present, absent}, which are automatically included in any instantiation. Additional states and synonyms can be added in each instantiation as appropriate. Basic properties classified as measurements and quantities also have fuzzy states included. For example, the basic property length includes the fuzzy states {very short, short, intermediate, long, and very long}. Quantities have fuzzy states such as {a couple, a few, several, many, about a dozen}. Upon instantiation, changes to the list can be made. We have not addressed the complexity of issues related to the use of fuzzy states. We have only provided for fuzzy states in the list of characters and for storing fuzzy values in the data acquisition.

CONDITION VII. QUALITATIVE AND QUANTITATIVE PROPERTIES. There are essentially two types of data required for descriptive data: qualitative states and quantitative values. Other data types may be required for ecological, physiological, and geographical data. Storage structures are created based on the instantiated properties and the type of storage structures desired: records, relations, objects, etc. In this discussion, we will assume the descriptive data is stored in records as shown in Figure 8, that is, each record will represent data for a given character. Included but not shown are the fields to store taxon related information obtained directly from published descriptions. We do not address the complex area of nomenclature in this paper.

1. Structure I 2 Property		3. Name extension		4. State I 5. Qualifier I 6. Version*
	Hemizonid I position relative to		excretory pore l anterior I slightly q

Field 1 in Figure 8a identifies the structure for the property in field 2. Field 3 will be discussed in condition IX. Qualitative data have a field for a state, and a field for the frequency of occurrence or for other qualifications, fields 4 and 5, Figure 8a, as states are often given with qualifiers such as "always, usually, sometimes."

Using records to represent character data presents a problem whenever character data needs to be linked, which happens occasionally, but not always predictably, in the data we are working with. This may be different states for the same character (see VIII below) or for different characters. Field 6 stores a version number, with version number 0 for characters that are not linked and a different version number each time there is a link for the same set of characters. Application programs would have to handle this whenever necessary. The alternative approach of storing multiple characters per record and normalizing would be more difficult to achieve given the uncertainty of the data in the literature or in adding new species.

Quantitative values can be separated into measurements (reals) and quantities (integers). Each measurement has a field for a value, field 7 of Figure 8b, representing a measurement of an individual or an average for a population. In the latter case, there would be data for high range, low range, standard deviation as these are the terms most frequently encountered in descriptions, fields 8-10, Figure 8b. There are alternative ways in which measurements are given such as when the variance, standard error, confidence interval, or normal and extreme ranges are used in place of a standard deviation. Often a conversion can be made to a standard deviation representation. A more elaborate representation of measurements may be necessary though to accommodate these alternatives. Quantitative data records also have fields for states to accommodate fuzzy states.

For integer data, usually a single numeric value is given or a range is given. This can be handled using the same fields 7-9 in Figure 8b, though the data type would be integer instead of real. Integer ranges with gaps can be handled by multiple records. Occasionally, an average value for integer data is given, an example is average family size. In this case, the value would have to be a real as well, representing an average. The resulting ambiguity is resolved by specifying the range and scale of a property.

An instantiated basic property may be designated having a scale, one of {nominal, ordinal, interval, ratio}, and a range, one of {binary, discrete, continuous}. Scale indicates whether the states or values are unordered, ordered, ordered with measurable differences (a – b), and ordered with measurable differences (a – b) and ratios (a/b), respectively. Range differentiates between discrete/binary, i.e., integers or states, and continuous data, i.e., reals. Basic properties classified under appearance and position have defaults of nominal or ordinal, and binary or discrete, those in measurements have defaults of continuous and ratio, and those in quantities have defaults of interval and discrete. One normally accepts the defaults, but can make changes for a given character. Thus, properties designated continuous would have all of the fields for a measurements, those with interval and discrete would have those for quantities, and those that are ordinal or nominal and discrete would have fields for qualitative states. Those that are continuous and interval would have integer ranges and a real value and stdev.

CONDITION VIII. IMPLICIT PROPERTIES AND MULTIPLE STATE LISTS. To address the problem of implicit properties mentioned earlier and to avoid the use of multiple artificial properties such as nature, aspect, situation, etc., we use the basic property kind. (The property type, which is not a descriptive property, is a term that has special meaning in biology and should be reserved for states that indicate a biological type as shown in the example of the summary character in Figure 6.) This presents a problem though when there are multiple implicit kinds in the same structure. Instead of using kind1 kind2, kind3, etc., we introduce the concept of multiple state lists for instances of basic properties. That is, associated with a qualitative property we allow more than a single list of states. For example, we could have one list for general states and another list for specific states within the same character, though we opt for a different approach described below, or we could have separate state lists for the character such as hair, kind, one state list including states {thin, thick}, another state list including states {coarse, smooth}, and another including states {curly, straight}. One advantage in using multiple state lists is that it provides a mechanism for decomposing complex states into more atomic states. For example, the states {low contiguous, high contiguous, high separated} might be decomposed into separate state lists {high, low}, and {separated, contiguous}. Each state list is maintained in a lexicon, defined below, and forms the basis of supporting many of the concepts discussed including intra- and inter-character state-based relationships.

CONDITION IX. NAME EXTENSIONS IN BASIC PROPERTIES. A simple mechanism, called a name extension, is a modifier of a property name. As such it can be stored as a separate field as shown in Figure 8, field 3. One example of property explosion where name extensions are appropriate is given when length is measured in different ways for the same structure such as length along the axis, length along the outer boundary, or length directly taken. Rather than creating multiple properties, we can create one property, length, with three name extensions along the axis, along the outer boundary, directly taken. Instances of basic properties maintain a list of name extensions. In the course of a query, the user can decide which, if any, of the name extensions to enforce.

There may appear to be a conceptual resemblance between name extensions and relational properties. And indeed, name extensions are used to represent relational properties. For instance, in the example of property explosion diameter at the vulva, diameter at midbody, diameter at the stylet base, etc., this can be represented by the property diameter and have the name extensions at the vulva, at midbody, at the stylet base, etc., see Figure 9 where the name extensions have a " – " prefix, at the same time it represents relationships with other characters.

There are several advantages of name extensions. The first, and obvious one is the convenience of consolidating multiple properties into a single property. As new name extensions are encountered for a property, they can be added to an existing property rather than having to create a new one

- at beginning anterior ovary - at midbody MBW - at vulva VBW = VB - at end posterior ovary - at anus

A field in the storage record of each property, whether it initially has name extensions or not, can be set aside for a name extension value. Whenever a value is stored, the name extension can be stored as well. It is a simple matter in a query to enforce (or ignore) a name extension via a simple "and" condition. Operationally, the name extension could be used like a modifier that may or may not factor in a query. For example, in a query with the condition body, diameter < 20.0 it would not be important to enforce a condition like body, diameter.name extension = at the vulva, indeed, it would be most desirable not to. Needless to say, in an interactive session the user would be able to make appropriate choices.

CONDITION X. LEXICONS IN BASIC PROPERTIES. The principal mechanism in an instance of a basic property to handle many of the concepts discussed including intra- and inter-character state-based relationships is called a lexicon. A lexicon L is a 5-tuple (S, P, CS, DS, M), where S is a structure, P is a property including specified name extensions, CS is a set of cited states, DS is a set of display states, and M is a correspondence or mapping, not necessarily a function. M maps CS to DS.

The basic rationale for a lexicon is straightforward. In the literature there may be a wide variety of states for a property. One example set of states is {straight, weak C, C, circle, closed circle, open circle, widely open C, spiral, question mark, tight spiral}. This set is designated as a set of cited states CS, since they are as cited in the literature and we assume these are the values that are stored in the database. However, for a number of reasons, this set CS may not be the set of states we would choose to display as a list of states for the character. For example, some of the terms in CS may be outdated, some may be nonstandard terminology, some could even be wrong, i.e., a bad synonym, or some might be synonyms that are less frequently used than other terms. Thus, we form a set DS, display states, that represents a set of distinct states to be displayed whenever the character is viewed. An example of a set DS for the example CS above might be {straight, weak C, C, circle, question mark, spiral, tight spiral}.

Since the set of cited states is from the literature, the set would not change over time except to add new states, unless the character were to change in some fundamental way such as dividing a character in two. On the other hand, the set of display states would change. The intention is that by changing the set DS, the stored data would remain unaffected and would allow individual users to change DS as needed. A correspondence or mapping M between CS and DS is needed to relate elements of CS that are synonymous with and general states for those in DS. An example of the correspondence is shown in Figure 10. On the left-hand side are elements of DS,

on the right-hand side the corresponding elements of CS with "=" showing the state itself and synonyms and " " showing general states. Note that C is a general state for weak C, and it is also a display state itself, and widely open C is a synonym for weak C.

A lexicon can easily be represented and implemented as a table. A property may have one or more lexicons, thus multiple state-lists directly correspond to multiple lexicons. Most importantly, the lexicons form the basis for inter-character state-based relationships.

The concepts described above, including basic properties, name extensions, lexicons, and state-based relationships have made it significantly easier to build a large set of characters in the NEMISYS project in a more consistent and uniform fashion than would have otherwise been possible. Lexicons, which provide a means for representing synonymy and general states, also provide a convenient and straightforward way of representing inter-character state-based relationships such as dependent, redundant, and summary characters. Again the basic idea is that these relationships are represented via correspondences between lexicons.

The basic unit in representing inter-character state-based relationships is a triple (L₁, ^L2, C12) representing a correspondence C₁₂between display states DS₁ and DS₂ of two lexicons L₁ and L2. The mapping of a state in DS₁ to multiple states in DS₂ is interpreted disjunctively. One or more triples can be grouped into a collection G and is interpreted conjunctively. A relationship is then defined by its type T and a collection {G1, G2, ,Gn} of groups of triples, with the collection interpreted disjunctively.

An example of the simplest relationship is a dependent relationship shown in Figure 5, with T = 'Dependent' and one grouping consisting of a single triple, i.e., G₁ = {(Li, L2 C12)}, where

L₁ the lexicon for S_i = body, D₁ kind, DS₁ {vermiform, intermediate, nonvermiform}, and

L2 the lexicon for 52 body behind the neck, D2 = shape, DS₂ = {kidney, pear, irregularly swollen, spheroid}, and C₁₂(nonvermiform) = {kidney, pear, irregularly swollen, spheroid}. The other two values, intermediate and vermiform are mapped to the empty list { }.

Summary characters using the example in Figure 6, require multiple groups of multiple triples. With L₁ the lexicon for the character stylet, type G₁ includes (Li, L2, C12), where C12 (hoplolaimid) = {robust} in the lexicon L2 for stylet, kind; includes (L₁, L3, C13), where C13 (hoplolaimid) = {conoid} in the lexicon L₃for cone, shape; includes (L₁, L₄, C₁₄),where C₁₄(hoplolaimid) = {intermediate, long} in the lexicon L4 for stylet, size, and so forth. (Note that mappings are interpreted disjunctively when mapping a state to two or more states.) The condition that cone, size = shaft, size can be handled by placing the mappings from hoplolaimid to each of shaft, size = small and cone, size = small in G1, with the other cases of equal shaft sizes and cones sizes using additional sets Gn},. Alternatively, one could include features like wild card elements to simplify and reduce the number of G, needed in a representation. The representation of state-based relationships can be thought of as query conditions that can be used to modify queries, as discussed next.

Proper use of the data is very much dependent on the accuracy in the selection of the correct characters. Given the complexity of any moderate sized character set, independent of the concepts used to build the set, the user will need assistance in selecting the appropriate characters in order to operate on the data. Usually, the more concepts involved the more difficult the task will be in building a mechanism to assist the user. However, the concepts we have introduced are represented using lexicons and name extensions in a fairly straightforward manner. This would simplify building such a mechanism, which for purposes of discussion we call a character list processor.

The types of state-based relationships we have presented, i.e., synonymy, general states, dependent, summary, redundant, and fuzzy characters certainly not do not exhaust the possibilities. The addition of new relationship types should not require specialized code or changes in existing code for a character list processor. Our approach in using state-based relationships simplifies the addition of new kinds of relationships.

To illustrate the need for a character list processor, we consider an identification session where an observer enters some initial observations C1, a set of characters with states/values that are connected by the usual logical connectives `and' and 'or.' Without a character list processor candidates would be retrieved based on C1. However, the set C1 may not be the best set to use. One would not expect the set of observations C1 by the user to translate directly into the best set of characters for the retrieval. For example, the user may not have been sufficiently general or specific. General or specific states may be needed to clarify, for example, that a posture designated as C includes the specific state weak C. The user may specify inconsistent characters without realizing it as could be the case if the body, kind is given as intermediate but the body behind the neck, shape is observed as kidney, Figure 5. The user may not be aware that an observation represented by one character may indeed be represented for different taxa in different ways. For example, if the stylet, type hoplolaimid is in C1, it may be necessary retrieve also based on the summarized characters shown in Figure 6 for those taxa where the stylet, type has not been specified.

There may be circumstances other than a retrieval, where a set of characters C1 needs to be modified before an action takes place. For example, C1 might be a set of observations that need to be verified or C1 may be used to update the database.

Clearly, state-based relationships play a central role in overcoming the problem of formulating or modifying C1 caused by the structure of the data. Each context requires proper selection of the characters. In utilizing the state-based relationships, we can view a character list processor, Figure 11, as a set of characters C1 is taken as input and a modified set of characters C2 is produced as output. Other input includes the set of state-based relationships, a context and a table of operations based on the context.

By 'context,' we mean a user designated name representing the kind of activity taking place with the data. Typical names for contexts could be "Retrieval," "Query," or "Update," but there may be many more that are suitable for other contexts as well, such as verification of observations. We will discuss the situation for retrievals since they are the most common context.

By 'operation,' we mean the modification made to C1. Some generic operations are given in Figure 12. For example, if C1 contains body, habitus = C, in a retrieval C1 could be EXPANDed using specific states to include the condition body, habitus = C or body, habitus = weak C.

Likewise, if CI contains body, kind = nonvermiform, C1 could have ADDed the condition body behind the neck, shape = kidney, or pear, or irregularly swollen, or spheroid. The ADVISE operation is used to alert the user of an existing relationship and allows the user to selectively modify C1. A SUBSTITUTE operation replaces one or more characters by other characters. Thus, the user could EXPAND, ADD, or SUBSTITUTE, and within each, add or delete disjuncts and conjuncts.

EXPAND, modify CI by disjunctively adding conditions to Cl. ADD, modify Cl by conjunctively adding conditions to C1.

SUBSTITURE, modify C1 by substituting conditions for conditions in Cl. ADVISE, alert the user to existing relationships.

Whenever a new context arises it is necessary to specify what operations to carry out on a character set C1 relative to each kind of relationship in order to produce C2. The specification should not be based on instances of state-based relationships between two specific properties, but should only be based on the type of relationship, i.e., redundant, dependent, summary, etc., and on the context.

Figure 13 shows some entries in the table of operations. We emphasize that it is up to the user, initially the system designers, to specify the entries in the table. A relationship can be expressed in two ways. For example, for a state s1 with a synonym s2 we can express this as "s1 has synonym s2" or alternatively "s2 is a synonym for s1."

Has synonyms	EXPAND	No Op
Is a synonym	EXPAND	ADVISE
Is general state	ADVISE	ADVISE
Has general state	ADVISE	No Op
Is dependent	ADD	ADD
Has dependent	No Op	No Op
Has summary	ADVISE	ADVISE
Is summary	EXPAND	ADVISE
Is redundant	EXPAND	ADVISE
Has redundant	EXPAND	ADVISE
Name extension	ADVISE	ADVISE

In Figure 13, we see two types of contexts and the operations taken for each relationship we have considered. If a set of characters Cl is proposed to retrieve a set of candidates, we call this a "Retrieve" in this example, and the column below it in Figure 13, indicates our choice of operations on C1 for each relationship. For example, the first entry shows that if a state "Has Synonyms," as in body, habitus = weak C, then on a "Retrieve," the query should be "EXPAND"-ed to include conditions body, habitus = weak C or body, habitus open C. Likewise, on a "Retrieve" that includes a state that "Is A Synonym," then the state for which it is a synonym should be "EXPAND"-ed. For "Is General State" on a "Retrieve," the user would be "ADVISE"-ed of the general state and choose to modify conditions in the query or not. Note that when a character "Is Dependent" on another character, then the condition is "ADD"-ed to contain the primary character as well, while a character that "Has Dependent" characters would yield no operation on Cl. If a character in Cl is a summary character "Is Summary," then the summarized characters will be "EXPAND"-ed, however, if a character is in Cl and is one of several that has a summary character "Has Summary," then the user will be "ADVISE"-d and can choose whether or not C1 is modified and how it is done.

Conceptually speaking, we can consider state-based relationship as conjunctive and disjunctive conditional fragments that can be used to modify a set Cl. In fact, our state-based relationships could be directly represented and stored in this manner for simplicity and efficiency.

We do not address the implementation of character list processors in this paper, but we do point out that whenever each character in C1 is modified in producing C2, additional relationships may arise. This would typically occur in chaining of dependent relationships or when states with synonyms or general states are added. We assume that the implementation of the processor maintains a list of dependencies used to produce C2 in order to avoid undoing or redoing operations. The user should have the option of allowing the character list processor to continue until all changes are made or to review the current C2 as each change is made.

Basic properties constitute the main focus of this paper, however, some of these ideas can carry over to the structural level as well. In particular, the concept of name extension can be used with structures to improve the structural decomposition. The example in Figure 3 of spermatocytes in the anterior and posterior end of the testis can be handled by making these positional modifiers name extensions of testis; see [13] for an extended discussion of alternative decompositions including the use of name extensions.

Unlike the case of name extensions in properties, the semantics and implementation issues of name extensions in structures is less straightforward, though we believe there are important advantages in using them there. A field for each structure/substructure name is needed in the record and a field for a possible name extension must be provided as well. Given that a record consists of a single character and its state or value as assumed in Figure 8, then the name extension field for each structure/substructure would apply to that datum. However, if one opted for a record format that has multiple properties per record, then another field would be necessary to indicate which properties the name extensions applied to. This would overly complicate the storage mechanism and the character list processor. This is another reason we opt for the record format of Figure 8.

Schema evolution has been addressed extensively. The focus has been primarily on structural changes such as adding and deleting attributes and classes. We will limit our discussion to schema changes in the context of state-based relationships since changes to the hierarchy at the property level and above would be analogous to structural changes in a schema.

If a property participating in a state-based relationship were moved from one structure to another, for affected relationships we would change the name of the structure S in any lexicon L for that property. Splitting a property into two properties would be straightforward if the property contained multiple lexicons and the lexicons remained unchanged. More complex changes, where a lexicon is split into two or more lexicons, would be similar to adding and deleting states in a lexicon as discussed below. As one might expect, some changes can be handled automatically by the system, but some would require intervention. In all cases, the system should issue a warning if relationships will be affected by the changes.

In the course of using the schema, the most frequent changes involve adding states and adding properties. In the latter case, no state-based relationships exist for a new property and no changes in existing relationships are needed, though a new relationship may need to be established as would be the case for redundant or certain dependent relationships such as presence. Adding states to a lexicon is a very common situation. In some cases, the system would either do nothing such as when the new state is merely added to CS, the set of cited states, but not to DS, the set of display states. In other cases, the system could automatically extend the relationship. For instance, if a new state is added to the DS, and the DS is dependent on another property, then the relationship could automatically extend to the new state. For example, if another state is added to the property shape in Figure 5, which is a dependent property, then the new state along with the existing states can be assumed to be dependent on the state nonvermiform in property kind in the part body. The same would hold if a state were inserted into or outside of, but not at the boundary of, an ordered subset {dsi, dsi+1,...,dsi+m} of DS. If it were placed at the boundary, then automatically determining whether it participated in the relationship would be difficult as would be the case if the states were unordered.

Deleting states is less likely to occur from CS since these are taken from the literature. Some deletions from a DS would occur since the role of a state may change. Generally speaking, deletions can be done automatically, though warning of existing relationships should be given in case a lexicon is being divided as mentioned above. Also, a relationship could be rendered obsolete by deleting the last element in its domain or range necessitating a warning as well.

In this paper, we have presented the key concept of basic property and its features. Whether or not a system implements basic properties per se, the concept itself can aid in creating more uniform and consistent character sets. Additionally, basic properties provide a mechanism for representing and utilizing state-based relationships, which are difficult to avoid in any large set of characters. While new generations of relational and object-oriented database systems may make it possible to implement these ideas more efficiently, it would be too much to expect each group interested in building a biological database to do its own implementation. Thus, if large scale biological databases are to exist, be used effectively, and be integrated, the problems discussed here will have to be addressed with the goal of creating a kind of generic BioDBMS. We believe our ideas will contribute to this goal, though actual design and implementation will require a large scale effort. Though much remains to be investigated in what we have presented, we believe this is a solid start in the right direction.

1. R.J. Pankhurst, Database design for monographs and floras, Taxon 37, 733-746, (1988).

2. R. Allkin, R.J. White and P.J. Winfield, Handling the taxonomic structure of biological data, Mathl Comput. Modelling 16 (6/7), 1-9, (1992),

3. M.J. Dallwitz, DELTA and INTKEY, Advances in Computer Methods for Systematic Biology, Chapter 18, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

4. M.J. Dallwitz and T.A. Paine, User's guide to the DELTA system: A general system for processing taxonomic descriptions, 3^rdedition, CSIRO Aust. Div. Entomol. Rep. No. 13, pp. 1-106, (1986).

5. R. Allkin and F.A. Bisby, Editors, Databases in Systematics, Systematics Association, Vol. 26, Academic Press, London, (1984).

6. R.J. White, R. Allkin and J.P. Winfield, Systematic databases: The BAOBAB design and the Alice system, In Advances in Computer Methods for Systematic Biology, Chapter 19, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

7. R.J. Pankhurst, Taxonomic databases: The Pandora system, In Advances in Computer Methods for Systematic Biology, Chapter 14, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

8. R.J. White and R. Allkin, A language for the definition and exchange of biological data sets, Mattel. Comput. Modelling 16 (6/7), 199-223, (1992).

9. H. Saarenmaa, S. Leppäjärvi, J. Perttunen and 3. Saarikko, Object-oriented taxonomic biodiversity databases

on the World Wide Web, from an international workshop: Internet Applications and Electronic Information

Resources in Forestry and Environmental Sciences (1-5 August 1995, European Forest Institute, Joensuu,

. Finland) and available through the web at http: //www. of joensuu. f i _",saaren.ma/oobdwww-nature‑

10. J. Diederich and J. Milton, Expert workstations: A tool-based approach, In Advances in Computer Methods for Systematic Biology, Chapter 7, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

11, Diederich and Milton, NEMISYS: A computer perspective, In Advances in Computer Methods for
Systematic Biology, Chapter 10, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

12. R. Fortuner, The NEMISYS solution to problems in nematode identification, In Advances in Computer Methods for Systematic Biology, Chapter 9, (Edited by R. Fortuner), Johns Hopkins University Press, Baltimore, (1993).

13. J. Diederich, J. Milton and R. Fortuner, Construction and integration of large character sets for nematode morpho-anatomical data, Fundamental and Applied Nematology 20 (to appear).

14. R. Allkin and F.A. Bisby, The structure of monographic databases, Taxon 37, 756-763, (1988).

15. G. Wiederhold, Views, objects, and databases, Computer 19 (12), 37-44, (Dec. 1986).

16. R. Allkin, N.P. Moreno, L. Gam& Campillo and T. Mejia, Multiple uses for computer-stored taxonomic descriptions: Keys for Veracruz, Taxon 41 (3), 413-435, (1992),

17. P. Chen, The entity-relationship model: Toward a unified view of data, ACM TODS 1 (1), 9-36, (Mar. 1976).

18. M. Hammer and D. McLeod, Data description with SDM: A semantic data model, ACM TODS 6 (3), 351-386, (Sept. 1981).

19. J. Diederich and J. Milton, Creating domain specific metadata for scientific data and knowledge bases, IEEE Trans. on Know and Data Eng. 3 (4), 421-434, (Dec. 1991).

20. Special Issue on Scientific Databases, Bulletin of the technical committee on data engineering, IEEE Computer Society, Volume 16, No. 1, Washington, DC, (1993).

21. A. Shoshani, A layered approach to scientific data management projects at Lawrence Berkeley laboratory, In Data Engineering, IEEE Computer Society, Volume 16, No. 1, pp. 4-8, Washington, DC, (1993).

22. Y. Ioannidis, Desktop experiment management, In Data Engineering, IEEE Computer Society, Vol. 16, No. 1, pp. 19-23, Washington, DC, (1993).

23. J.B. Cushing, D. Hansen, D. Maier and C. Pu, Connecting scientific programs and data using object databases, In Data Engineering, IEEE Computer Society, Volume 16, No. 1, pp. 9-13, Washington, DC, (1993).