Originally published in : Fortuner, R., 1993a (Editor). Advances in computer methods for systematic biology: artificial intelligence, databases, computer vision. (C) 1993,  The Johns Hopkins University Press, pp. 103-123. Published on this website with the permission of The Johns Hopkins University Press

Copyright California Department of Food and Agriculture. Use of this material is limited to publication on this website. If this website is linked to or by another website, then the linking site shall attribute the material to the California Department of Food and Agriculture. Published on this website with the permission of the California Department of Food and Agriculture.
 

THE NEMISYS SOLUTION TO PROBLEMS IN NEMATODE IDENTIFICATION

 Renaud Fortuner


The present chapter gives an example of an identification problem and its solution by modern computing techniques.  This example, using NEMISYS (NEMatode Identification SYStem), illustrates and expands previous discussion on this subject (Pankhurst, Chapter 8, this volume).  NEMISYS is described from the computer science point of view by Diederich and Milton (Chapter 10, this volume).
 

THE IDENTIFICATION OF NEMATODES

There is a need to identify all plant-parasitic nematodes, not just a few common species, because even a rare species may be destroying cultures.  This is also true for animal- and man-parasitic nematodes.  Predatory nematodes and insect-parasitic nematodes must be identified for possible applications in biological control.  Even the so-called free-living nematodes (mycophagous, saprophagous forms) that have no obvious economical effect must be identified because they are the most numerous animals on Earth and any study of biological diversity that does not address the identity of nematodes would be incomplete.
 Identifying nematodes, particularly the minute plant-parasitic species, is a difficult prospect and few people have the required expertise.  Even these few experts need identification aids, but the difficulty of the subject makes the design of such aids a formidable task.  Some characteristics of nematode identification will now be reviewed to define the requirements for a successful identification aid.

Circumstances of Identification

The design of a system will depend on its projected use by nematologists identifying nematodes.  From my experience, I have characterized several general situations for nematode identification.

Field Survey.  Typically, a general field survey is conducted on plants or geographical areas where the nematode fauna is unknown.  Field surveys are often conducted for the practical purpose of identifying existing nematode attacks on cultivated plants.  Here, only plant-parasitic nematodes need to be identified, and often identification is made only for the species in genera that are dangerous parasites.  More general surveys attempt to identify all species present, including forms that are not plant-parasitic.
 A priori selection of the most likely species is impossible if nothing is known of the circumstances (host/location) where the survey is conducted.  Sometimes the surveyor has some idea on the species likely to be present (based on experience with similar plants/areas) but surprises can and do occur.  Focusing the identification process on the most likely species can be used with prudence if a previous survey has been conducted under similar circumstances.
 Consultation done by a systematist who identifies slides received from colleagues is a special case within this first category.  Depending on the origin of the slide and the experience of the systematist, focusing on likely species may or may not be used.

Field Test Follow-up.  Nematological field tests are used for the study of chemical control,varietal resistance, population dynamics, etc.  The field where a test is established must be checked for the species it harbors.  After this initial survey, the nematode population is well known.  Focusing on these species can offer valuable identification short-cuts for the following checks of the nematode populations.

Regulatory.  Evaluation of plant samples sent to a regulatory agency presents special circumstances.  Typically, the samples have been treated with nematicides and they should be free of any nematodes.  Specimens are sometimes observed because of unsuccessful treatment and they must be identified.  The samples come from many different countries and plants.  Focusing on likely species is possible when the observer knows what nematodes are usually found on the particular host and origin of the sample.  The regulatory agency may have a list of species under quarantine.  If a plant-parasitic nematode shows up in the sample despite treatment, identification should make certain that it does not belong to any of the species in that list.

Nematode Identifiers

Across-the-board Experts.  There are several types of persons who identify nematodes.  Across-the-board experts are able, or claim that they are able, to identify any nematode they observe.  Such expertise may have existed fifty or a hundred years ago when only a few nematode species had been described, but nobody today can truthfully claim to belong in this category.

Experts with Limited Expertise.  Most expert identifiers have limited expertise.  They can identify species easily, rapidly, and accurately in certain groups of nematodes, for example from some genera or some families.  For example, experts in plant-parasitic nematodes can identify common species in the genera that are most damaging to cultivated plants.  They can identify on sight maybe half to two-thirds of the genera from a total of one to two hundred genera and perhaps a hundred out of the 3,700 nominal species in this category.  Those experts are doing most of the total of plant-parasitic nematode identifications, but they are fast disappearing, as traditional systematists are not being replaced as rapidly as they are retiring, or are being replaced by so-called molecular systematists.  Molecular identification is a promising technique for selected species, but it is questionable whether molecular probes can be defined or used for thousands of species.

General Practitioners in Nematology.  I call general practitioners in nematology people who study nematode control, nematode biology, population dynamics, etc.  They have a broad knowledge of nematodes, including their gross morphology, and they can recognize the most important characters.  Some work on only one or a few species, others (ecologists for example) are interested in more species.  Most do not do their identifications.  They can recognize the few species they see frequently during their work, but they cannot identify a form they have never encountered before.  They rely entirely on experts for any identification problem.  With the rapid disappearance of the experts, general nematologists are forced to start doing identification.

Students.  Students are a category apart.  At the beginning they have no knowledge of nematodes, but as they work toward their PhD they learn the state-of-the-art in identification techniques.  By the time they obtain their degree, most can be classified as expert identifiers.  Yet, as they enter the professional field, they quickly become general practitioners in nematology.  They cannot maintain and update their skills and soon are unable to identify.

Users of an Identification Aid.  Nematode identifications traditionally are made by a small group of experts, knowledgeable in only a fraction of existing nematode species.  These people do not really need identification aids but just a quick reminder of the diagnostic features of particular taxa.  For decades, they have been using dichotomous keys and notebooks with copies of published species descriptions.
 As opposed to traditional identifiers, non-experts are not interested in identification per se, as they are not taxonomists.  Rather their interest is only in finding a name for the forms they observe.  They do not want to waste any more time than is absolutely necessary for an activity they regard as secondary, and they insist in getting an answer in a few seconds.
 Traditional experts also want speed from an identification aid.  Most of their identifications are done by instant recognition.  When they must identify a form out of the area of their expertise using an identification aid, they want to reach the answer as quickly as they do on their own.
 Speed is an important requirement for the identification process.  This means that the system hardware and software response be quick, but also that the identification procedures do not require the identifier to follow laborious and detailed operations.  An important point is the amount of data required by the system to reach an answer.  An expert accustomed to using half a dozen characters to identify species will resent having to enter twenty characters to do the same job.  Casual users also want as few requirements as possible for data entry.  This means using few characters and easy data entry.  For example, while frequent users maybe willing to learn codes for data entry, occasional users hate to waste time learning that tail shape is character F or that conoid tail is state 3.
 Data entry is easier with natural language than with codes, but even this can stymie occasional users.  While experts may understand the difference between tail shape more curved dorsally and tail shape dorsally convex-conoid, general practitioners may be baffled by the unfamiliar terminology and thus be unable to make a choice.  For them the easiest form of data entry will be accomplished by picking up the right shape out of a table of illustrations.

Nematode Domain

Biological Material.  A sample typically includes the roots of a single plant, with the attached soil (about one liter) called the rhizosphere.  Above-ground plant parts (either vegetative parts, leaves, stem, or reproductive parts, seeds) are collected separately where the presence of above-ground nematodes is suspected.  Above-ground parasites are exceptionally found in the ground; ecto-parasitic species are found only in the rhizosphere, never in the root tissues; endo-parasitic nematodes typically found in the roots may also be present in the rhizosphere in large numbers at certain points of the life cycle, but be almost absent at other times.  Often, a species is represented in a field sample by only a few specimens, occasionally by a single specimen.
 The nematodes are extracted from the sample by an appropriate method.  Then, they are examined under the dissection microscope for the identification of broad categories (typically at the family to genus levels).  Finally, examination with the compound microscope is necessary for identification at genus and species level.  This last step can use freshly heat-killed specimens in temporary mounts, or specimens killed, fixed, and mounted in glycerine in permanent mounts.  Mounted specimens can be distorted by the fixation process or old age.  Identification is made typically on one to half a dozen specimens, and rarely on more than ten specimens.  These few specimens are often in poor condition and will not allow observation of all the diagnostic characters.
 There are few expert identifiers, because it is particularly difficult to identify nematodes.  Identification characters are inconspicuous, ambiguous, variable, and are often very difficult to record.

Conspicuity.  Plant-parasitic nematodes are very small animals, typically 0.5 to 1.5 mm long, and 20 to 30 µm in diameter.  Their body envelopes are transparent and the internal organs can be seen without dissection.  Yet, most features are seen only under the highest powers of optical microscopes, with an oil immersion objective, 1,000X magnification, and phase microscopy.  Even in the best conditions of observation, most features are almost at the limit of the separating power of the optical microscope.  Some external (cuticular) features are seen only with a scanning electron microscope (SEM), under magnifications of thousands or tens of thousands.
 The inconspicuousness of many morphological structures makes it very difficult for non-experts to discover their characteristics.  For example, in the genus Pratylenchus the first character used in all published keys is the number of annuli in the lip region, two, three, or four.  Two annuli are easy to see, but three or four annuli crowded together to fit on the low lip region are inconspicuous.  Experts know that when you can see the annuli, there are two; when you cannot see them, there are three or four (Loof in Fortuner 1989).  Experts have to make good guesses for the character they are trying to capture, and there is a high risk of error attached to this guessing game, even higher for casual users.
 
Ambiguity.  An ambiguous character is understood here in both senses of the expression as 1) doubtful or uncertain, especially from obscurity or indistinctness, and 2) capable of being understood in two or more possible senses.
 Characters attached to an organ with low conspicuity will have high ambiguity.  High ambiguity also occurs with highly conspicuous organs.  For example, spicules are highly conspicuous male copulatory organs, often rose-thorn shaped.  Spicule length can be measured along the dorsal edge, along the ventral edge, or along the longitudinal axis or this organ, each with very different results.  In another example, found in the genus Trilineellus, the very conspicuous lateral fields are composed of two longitudinal ridges running along the lateral sides of the body.  In lateral view, the edges of each ridge are seen as two well marked lines.  In most forms with two ridges the lateral fields are usually seen with four lines, but in Trilineellus the ridges touch over most (but not all) of their length.  It is not clear, in spite of the name of this genus, whether it should be described with three or with four lines.

Variability.  Besides the usual normal distribution of measurements found in most biological characteristics, there is a high variability caused by the environment.  For example, the length of the body of some Ditylenchus species can vary by a factor of two with different host plants (many observations reviewed by Fortuner 1982).  Also, when the progeny of a single parthenogenetic female of Helicotylenchus dihystera was raised under ten different host plants, all body measurements varied, and the variation was statistically highly significant (Fortuner and Quénéhervé 1980).  Qualitative characters also vary, and the variation itself is variable.  For example, the shape of the fusion of the lateral field lines is constant in H. dihystera (all specimens have a Y-shaped fusion), but it is highly variable in H. pseudorobustus, a closely related species (Fortuner et al. 1984).  The type population of this species had 20% of specimens with a Y-shaped fusion and 80% with a U-shaped fusion.  These percentages varied in other populations from 0% Y-shape / 100% U-shape to 100% Y-shape / 0% U-shape (Fortuner 1986).  The character shape of fusion is diagnostic for H. dihystera but it is useless for H. pseudorobustus, although these two species are so close that they can be differentiated best by multivariate statistics (Fortuner et al. 1984).
 The extent of the variability is known only for a small fraction of described species.  A sampling of the database NOMEN revealed that almost two-thirds (61%) of the species of plant-parasitic nematodes are described from their type population alone and that another 25% of the species have been redescribed from only one other population.  Only 5.6% of the species in the database sample are known from more than four populations (including their type population).  In the genus Helicotylenchus, type populations of half the species published since 1972 have been described from ten or fewer paratypes (Fortuner 1984).  Many species are known from only a few specimens found in one or two populations.  Because of the small sample sizes, characters are often described as constant, whereas well described species often show extensive variation.  Variability exists, even when its existence cannot be revealed by limited studies.

Easy Characters.  Easy characters are characters that are easy to record.  They are attached to conspicuous organs.  They are not ambiguous and they can be recorded with little risk of error, they are characters that do not vary or that have limited variation in the sense that a definite gap exists between the values in related species.  There are very few easy characters in nematode morphology, particularly for identification to the species level.  This means that difficult characters, i.e.  characters that do not meet all or some of these requirements, must be used for species identification and that a high risk of error exists in recording the character values, particularly by non-expert identifiers.
 Because of this, anyone attempting to identify an unfamiliar form will feel some anxiety when reaching an answer.  Is this the correct answer, or was a wrong decision made that affected subsequent steps in the process that led to the wrong species?  A good identification system must reduce the risks of reaching the wrong answer by a high built-in robustness.  Then, one wrong answer will not jeopardize the entire identification session.  An even better system will calm the fears of the user by offering several independent means to verify the correctness of the answer.
 The list of easy characters varies from taxon to taxon, and a character that is easy in one species or genus may be difficult in a related taxon.  A general system for identification of all the species in a category (e.g., all plant-parasitic nematodes, or all species in the order Tylenchida) must include all the easy characters for all the forms in the group.  In Tylenchida, the order containing 90% of the plant-parasitic nematodes, we found that more than 400 characters have been used so far in the literature.  It is not yet clear how many of these characters happen to be easy for at least one species in the order.  It is expected that their number will reach at least several dozen, if not hundreds of characters.  This is at variance with the traditional identification aids for a particular genus that use at most one to two dozen characters.
 As a result of inconspicuous features, ambiguous characters, and large intra-specific variability, there is a high risk of missing data, when the identifier cannot see an organ or cannot decide what is the true state of a character, and of errors, when the identifier disregards these difficulties and tries to provide an answer at any cost.  Of course, no system can give an accurate answer when fed totally erroneous data, but a good identification system must have what is called graceful degradation.  There are several meanings attached to this expression and here it will be taken to mean that the system should loose its ability to provide an accurate identification only gradually as the number of missing or erroneous characters increases.

Requirements

To summarize the requirements resulting from these considerations, an identification system should be:
(1)  a universal system, valid for all nematodes, or at least all species in a particular group.  For example, a system could include all plant-parasitic nematode species, or a particular order;
(2)  a flexible system, that can be used by experts and generalists alike and that is consistent with the different ways each type of user works;
(3)  an easy to use system, and thus accessible to persons who are not experts in identification or not experts in computer use;
(4)  a fast system, requiring minimum data entry and delivering a speedy answer; and
(5)  a reliable system, taking into account the reliability of the data depending on the group observed.
 In other words, the system must be quick, easy and reliable.  These requirements are more fundamental than the usual bells and whistles of user friendliness.  An identification system that does not meet them all will be unable to offer reliable identification, and it will not be used, particularly by occasional users.  As we will now see, traditional identification systems fail to meet at least one of these requirements.
 

EXISTING AIDS TO IDENTIFICATION

Dichotomous keys

Dichotomous keys are deterministic in the sense that their authors assume that occurrences in nature (here, the identification of a particular species) are completely determined by antecedent causes (here, the presence of a particular state or value of a character).  At each line of the key, the presence of a particular character state in the unknown specimen determines that the answer will be found in a particular section of the rest of the key.  On the other hand, the probabilistic approach assumes that the observation of a particular character state or value only makes the identification of a particular species more or less probable.  This statement may surprise some readers because morphological descriptions are typically given in a non-probabilistic manner.  The true nature of biological data needs to be examined in more detail.
 Most quantitative characters, measurements (real numbers) and most counts (integers), are normally distributed and they are probabilistic in nature.  For example, if the length of an organ had mean M and standard deviation s in taxon T, 95% of the specimens in T have organ length between M - 2s and M + 2s.  In other words, each value within the range M ± 2s only has a certain probability to occur in a specimen of the species.  Even if the entire range of values is counted as one state, an individual value has only 95% chance to belong to this state.
 Some quantitative characters (integer) are constant in a given taxon.  Others can take more than one value in a given taxon, but these values are not normally distributed.  Such quantitative characters behave like qualitative characters.  In published descriptions, most qualitative characters are given as character C with state S.  This type of statement seems to support the belief that taxonomic data are deterministic.  Some will argue that deterministic statements are only an extreme of probabilistic statement with probability P = 100%.  State S always present might be read as state S has 100% chance of being present.  However, most biologists will probably look at such arguments as hair splitting.
 Still, even those biologists would agree that apparent determinism is often caused by insufficient data.  For example, note the variability of the shape of lateral field line fusion as described above for H. pseudorobustus.  This fusion is described as Y-shaped in H. inifatis, but this species was described from four specimens found in a single population (Fernandez et al. 1980).  There is no assurance that other populations of this species (or even other specimens from its type population) will never show the U-shaped fusion.
 Still unconvinced, those biologists would then argue that there exist true deterministic statements that are valid in every specimen of the taxon.  All Helicotylenchus do have four lines in their lateral field.  I prefer to say that deterministic statements are valid until specimens are found that contradict them.  For example, the genus Pratylenchus has always been described, among other characters, with lip area low, flattened anteriorly and with oesophageal glands overlapping the intestine for a medium distance.  Most Pratylenchus have tail short, cylindroid, with broadly rounded end.  The species P. morettoi was placed into this genus because it has most of its other diagnostic characters, but P. morettoi has lip area dome-shaped, oesophageal overlap elongate and tail conoidal with a somewhat pointed end and a terminal projection (Luc et al. 1986).  These characters were deterministic in the genus until the discovery of P. morettoi proved that they were probabilistic.  Now, it can only be said that 99.999% of the specimens in Pratylenchus have low lip, medium sized overlap, and short cylindroid tail.  In this optic, no statement will ever be given as completely deterministic, i.e., with a 100% probability.  We should say that all Helicotylenchus have 99.99999% chance to have four lines in their lateral field.
 When intra-specific variability is given for qualitative characters, it is most often reported in a very imprecise manner.  Each state observed in the taxon is said to be mostly present, sometimes present, rarely present, etc.  The data in such cases are definitively probabilistic, and it is possible to attach probabilities to them.  For example, the nematode taxonomists participating in the NEMISYS International Project were asked what percentages they would attach to such qualifiers.  It was found that their answers are normally distributed.  There were 22 respondents out of the group of 58 expert identifiers in the project.  From their answers, a mean percentage of, e.g., 76.9% can be attached to the term mostly with standard deviation of 8.5, and range of individual answers from 60 to 90%.
 Biological data are probabilistic, whether biologists accept that fact of not.  It is true that many descriptions do not describe the variability, but that does not mean it does not exist.  I have shown elsewhere (Fortuner 1986; 1987) that builders of dichotomous keys have to force intra-specifically variable data into the straight-jacket of these deterministic identification devices by pretending that intra-specific variability does not occur.
 Experts have been using dichotomous keys for 200 years because their knowledge of variability allows them to take this phenomenon into account as they search for the right answer.  Also, they often use keys as a quick reminder of diagnostic characters for the species in a well known genus.  When I suspect that I have the species Pratylenchus penetrans for example, I can quickly go through the well-thumbed key of Loof (1978) and verify that the specimen does have all the right characters for this species.
 At the opposite, non-experts and experts out of their field of expertise do not know the extent of intra-specific variability in the group covered by the key.  They are unable to supplement the intrinsic limitations of dichotomous keys and in their attempts to use these devices for identification they either get stuck (no answer) or they run a great risk of reaching the wrong answer.  Dichotomous keys offer absolutely no graceful degradation as all the characters in the key must be provided by the user (no missing data) and as a single wrong answer will send the user to the wrong end of the key.

Tabular Keys

Tabular keys offer a better approach to identification because they are polytomous (all characters are considered together instead of a single character at a time in dichotomous keys).  This will allow for some errors, as the rest of the data will still point to the right answer.  The tabular arrangement of the data makes it possible to note intra-specific variability.  Finally, an identification can proceed in spite of missing data, as the identifier can rely on the characters that are not missing.  However, graceful degradation in a tabular key depends entirely on the expertise of the user, who must decide how many mismatches, and how many missing characters can be overlooked in the final identification.  Intrinsically, tabular keys do not have graceful degradation.  Printed tabular keys are useful for species identification in a small genus (with no more than two or three dozen species).  They are too cumbersome for larger groups of species.  The only way to use tabular keys for identification of the 3,700 species of plant-parasitic nematodes would be to offer a set of tables, each with a few taxa, at the order, superfamily, family, genus, and species levels.  An occasional user would quickly get lost in such a maze, and when an answer is reached, there would be no way to assure the user that he or she did not make a wrong choice somewhere.
 Another disadvantage of tabular keys is that they force the user to gather data for far too many characters.  A tabular key at the genus level includes at least a dozen characters.  Experts can identify most species in a well-known genus by looking at less than half a dozen characters, typically three or four.  They will resent having to do twice to four times the amount of work when using a tabular key.

Coefficient of Similarity

Computation of similarity is another polytomous approach.  It offers the added advantage of considering all the taxa together, instead of one at a time in the sequential devices described above.
 Typically, a coefficient of similarity scores each couplet of characters found in the unknown and in one species as similar (score 1) or dissimilar (score 0).  The average score, with or without weighting, gives a measure of the morphological similarity between the unknown and the species.  The same computation is made for all the species in the group considered, and a list of the most similar species is presented to the user.
 There are many different ways to compute the coefficient of similarity (Sneath and Sokal 1973).  One that fits all kinds of characters is the coefficient of general similarity of Gower (1971b).  I have modified Gower formula to account for intra-specific variability (Fortuner and Wong 1984; Fortuner 1986).  The modified algorithm is used in a custom-made computer identification application, NEMAID.  The current NEMAID program (version 3) uses different algorithms for qualitative and for quantitative characters.
 With quantitative characters, the algorithm compares the mean value Cu of the character in the unknown to the mean value Cs of this character in a species.  Also used in the algorithm are the values R and p.  The value R is the difference between the highest and the lowest specific values for this character in the genus, p is a percentage of correction that depends on the intra-specific variation observed in the genus.  The percentage p is proposed by an expert, but it can be modified by the user.  It is used to compute a correction value P for the couplet unknown/species equal to:

P = [(Cu + Cs) / 2] p

The similarity S for quantitative characters is equal to:

S = 1 - [(|Cu - Cs| - P) / (R - P)]

The difference Cu - Cs is taken into account only when it is larger than the threshold value P.  The similarity s is taken as equal to 1 when the difference |Cu - Cs| is smaller than P; it starts decreasing as the difference becomes larger than P.  It would be equal to 0 if the smallest and the largest species in the genus were compared (|Cu - Cs| = R).
 Qualitative characters are represented by two or several character states.  The percentages of specimens in a given species that exhibit each state of a qualitative character are recorded in the database.  For example, the fusion of lateral field lines for H. dihystera is recorded as: state Y-shape: 100%; state U-shape: 0%.  For the type population of H. pseudorobustus it is recorded as: Y-shape: 20%, U-shape: 80%.  All described populations of a species are entered individually in this manner in the database.  As explained above, actual percentages are rarely given and the imprecise terms (sometimes, often, rarely, etc.) that are found in descriptions have to be translated into percentages.
 When several populations have been described for one species, two values M (the mid range) and A (the average) can be calculated for each character state.  This computation uses K1 and K2, the smallest and the highest percentages observed for this state in the described populations of the species.

M = (K2 + K1) / 2
and
A = (K2-K1) / 2

The values M and A are computed and stored for all the species in the database.  The percentage U of specimens that have the same character state is recorded for the unknown population and the coefficient of similarity s for this stage is computed by:

s = 1 - (|U - M| - A)

The same process is followed for all the states of this character.  If the character has n states, the final similarity is computed as:

s = sigma s / n

Here again, the intra-specific variability of the character is taken into account as far as it has been recorded for each species.  Species described from a single population have M = A.  The computation of the similarity does not include the true variability, but only the published variability.  This is unavoidable and this situation will be improved only if and when authors describe additional populations of existing species.
 The similarities S for both types of characters are given a weight w by the experts or by the user.  The successive Sw for all the characters are averaged for the coefficient of general similarity.  Missing characters are neutralized.
 Similarity provides true graceful degradation.  When erroneous data are entered, the accuracy of the coefficients of similarity diminishes gradually, not suddenly.  For example, during an identification using ten characters, if an error is made on one character, the coefficients of similarity still are 90% accurate.  The species to which belongs the specimen will be listed as 90% similar to the specimen.  Also, missing data are neutralized, and the coefficients are computed only from the characters that are known in both the specimen to be identified and the species to which it is compared.
 This approach is safe in the sense that no taxon is ever eliminated.  All the species are listed in order of decreasing similarity and it is up to the user to examine them and to pick one for the identification.  Obviously, this requires some expertise and the non-expert user may make a wrong choice, particularly if too few characters were used.  To work well, the NEMAID algorithm requires entering many characters, typically a dozen or more.  Hurried users may be tempted to enter only a few characters and they may end with the wrong species.

Expert Systems
 
Rule-based expert systems have limitations from the computer science aspect, but from the user point of view their main drawback is the excessive amount of work they require.  A traditional rule-based expert system looks much like a fanciful dichotomous key.  The system asks for character after character, and it takes far too much time to gather and enter all these data.  Experts like to take shortcuts, and this is difficult when the system is in charge.

Probabilistic Identification Systems

Adapted to biological identification, Bayes rule gives the probability of having a taxon T if a particular character state C is present.  This probability P(T|C) is:

 P(T|C) = P(C|T).P(T) / P(C)

or, the probability of observing this character state given the taxon, P(C|T), times P(T), the prior probability of the taxon, divided by P(C), the probability of observing the character state.  When we have several taxa, T1, T2, . . . Tk, that are exclusive and exhaustive, the probability of C in the denominator can be written as a weighted sum of the conditional probabilities P(C|Ti) where the weights are P(Ti), and the formula becomes:

 P(Ti|C) =  P(C|Ti).P(Ti) / P(C|T1).P(T1)+ . . . + P(C|Tk).P(Tk)

The probability of observing a character state C given taxon T,  P(C|Ti) for taxon Ti,  P(C|Tk) for taxon Tk, can be found in a data matrix, provided character C has been described for taxon T.  As for NEMAID, it is necessary that, besides the straight data, the database include the frequency of observation of each character state.
 The prior probability of each taxon, P(Ti), P(Tk), is the probability of observing the taxon Ti or Tk before any data are known to us.  Prior probabilities ultimately rely on observations made on field populations of nematodes.  These observations may have been published or not published.  If no publication is available, experts familiar with local conditions may be interviewed for their estimate of probabilities of observing the various species in these conditions.  Published observations may or may not include the following indices:
(1)  the absolute density (= abundance) of a species in one sample, or the number of specimens of this species per unit (in weight or volume) of soil or roots in this sample;
(2)  the absolute frequency (= constancy) of a species, or the percentage of samples where this species was observed (for example during a survey of the fields with a particular crop in a particular region);  or
(3)  the prominence of a species, or its absolute density multiplied by the square root of its absolute frequency.
The prominence index is an absolute value, and it can be made relative by dividing it by the sum of the prominence indexes of the species present.
 Bayes rule is fundamental to identification strategies.  All identification methods either are explicitly based on Bayes rule, or they can be formulated as variations of this rule with different assumptions.  For example, in dichotomous keys all species are presumed to have an equal chance of being present in the sample (all prior probabilities are equal), and characters are presumed to be either present or absent (probability of the evidence is either 1 or 0).  For another example, in NEMAID, based on a coefficient of general similarity, all species also are presumed to have an equal chance of being present, but the NEMAID algorithm uses the actual probability of the characters, as found in species descriptions.  The question remains whether these different assumptions are necessary or using Bayes rule in its entirety, including actual estimates of prior probabilities, is more practical and realistic.
 A prior probability expresses the analyst's opinion of a population parameter before any new data are available (Iversen 1984).  In the Bayes formula above, the probability of observing a species, P(T), is supposed to be valid for the whole universe.  Actually, prior probabilities are always conditioned by some background information and they are really just conditional probabilities (Horvitz, personal communication).  It remains to be decided where to begin the process.  The following three cases will be examined below: 1) proposing probabilities valid only in narrowly defined circumstances; 2) proposing a single probability for each species, valid for nematode identifications in all circumstances; and 3) taking all probabilities equal.  I will discuss the practical feasibility of proposing probabilities, and how realistic they are in an actual identification environment, in all three cases being considered.
 Studies of nematode populations are always made under restricted circumstances.  Samples are taken in a particular geographical region, on a particular host, and from a particular part of the plant (rhizosphere, roots, above ground vegetative parts, and above ground reproductive parts).  For example, I know from past experience that in root samples taken from flooded rice fields of northern Senegal, there is a 98.8% chance of observing the species Hirschmanniella oryzae, a 1.2% chance of observing H. spinicaudata, and almost no chance of observing any of the 3,700 other species of plant-parasitic nematodes (Fortuner and Merny 1974).  Such a precise and intimate knowledge of nematode populations is possible only because circumstances have been very narrowly defined.
 In southern Senegal, these probabilities become 20.5% for H. oryzae and 79.5% for H. spinicaudata.  In the same area, samples of upland rice instead of flooded rice would not yield these two species of Hirschmanniella.  Instead, there would be a 64.8% chance of observing Pratylenchus brachyurus and a 32.2% chance of observing P. sefaensis (Fortuner 1975).  Thus, depending on circumstances <region, type of rice), the probability for observing H. oryzae in Senegal is 98.8%, 20.5%, or 0%.  When I was based in Senegal and I was studying rice nematodes, I was aware of these different probabilities, and my expectations of observing H. oryzae varied greatly depending on the type of sample.  In any given set of circumstances there is only one realistic probability.
 Proposing probabilities valid only in narrowly defined circumstances is possible and relatively easy when species populations under these circumstances have been studied.  Hundred of articles have been published with this type of information.  Probabilities of observing species may be extracted from these studies and stored in a database in which rows would be sets of circumstance and columns would be the 3,700 species.  However, there are four different plant parts that can be sampled, there are dozens of major crops and dozens more of wild plant associations, and there are hundreds of significantly different geographical regions.  Consequently, there are tens of thousands of possible sets of circumstances.  Only a small fraction of these possible sets of circumstances have been studied and probabilities are unknown in most cases.
 Proposing a single prior for each species, valid for nematode identifications in all circumstances would require estimating the probability of observing this species in each possible set of circumstances and the probability of observing each set.  Because only a few hundreds sets of circumstances have been studied, all the relevant knowledge that has been accumulated since research on plant-parasitic nematodes began would still be insufficient.
 How realistic the resultant probability would be is also in question.  Making very broad estimates and generalizations, I could propose a worldwide probability for a few species.  For example, H. oryzae is widely distributed in paddy rice but is almost unknown on other plants.  Let's suppose that the probability of observing H. oryzae in root samples from paddy rice is 0.9 worldwide and let's pretend that this species has no other host at all.  There are about 140 000 hectares of paddy rice in the world, out of 1 360 million ha cultivated land, out of 14 900 million ha of total land surface on our planet, and out of a planet surface of 50 000 million ha.  Paddy rice represents only 1% of cultivated land, 0.0009% of total land mass, 0.00028 of the total planetary surface.  The prior probability for H. oryzae would be 0.9 times 0.01, or 0.009, for sampling on cultivated areas only; it would be 0.9 times 0.0009, or 0.0008, for sampling anywhere on land; and it would be 0.00028 times 0.9, or 0.00025 for the whole Earth, including oceans.  All these numbers are far smaller than the probability of 0.9 that would be more appropriate if we knew that sampling was made from paddy rice roots, and than the probability of 0.98 if the sampled field happened to be in northern Senegal.  These small numbers would look unrealistic to nematologists.
 Taking all probabilities equal would reflect the fact that we have very limited prior knowledge for all species of nematodes on all possible sets of circumstances.  For the 3,700 species of plant-parasitic nematodes it is easy to calculate that all prior probabilities would be equal to 1/3700 = 0.00027.  This number is almost identical to the prior probability for H. oryzae for the whole Earth (0.00025), it is not too far from the prior estimated above for sampling anywhere on land (0.0008), but it is far too low for the more realistic assumption of sampling limited to agricultural lands (0.009).  Also, presuming that all species have an equal chance of being observed eliminates from the general Bayes rule the information contained in the relative distribution of each species.  The only probability that remains is the probability of observing each character in the various taxa, but this corresponds to the assumption for similarity systems such as NEMAID.  Taking all probabilities equal simplifies the general Bayes rule to a point where it is undistinguishable from some other identification strategies.
 We are left in a quandary.  Estimating prior probabilities valid for the whole planet, or taking all prior probabilities equal, are perfectly valid operations but they yield numbers that are far removed from the everyday experience of nematode identification, that is usually performed with at least some knowledge of the origin of the specimens.  If this knowledge is taken into consideration, the probability of observing a species may change from less than one thousandth to almost one.
 Realistic prior probabilities can be proposed that are valid for specified circumstances of observation only.  Yet, nematode populations have been studied only in a relatively small fraction of all possible sets of circumstances.  In particular, nematode populations are unknown in most uncultivated areas.  This strongly limits the value of calculating Bayesian prior probabilities in an identification system for the study of biological diversity because uncultivated areas usually harbor a richer fauna than cultivated lands (Fortuner and Couturier 1983).  Conditional probabilities also cannot be established and used for a nematology laboratory newly established in an area where no sampling has ever been done, for example in a developing country (Doucet 1989).  In such cases, the identifier has very little prior knowledge, and only the simplified strategies may be used.
 Bayesian probabilities offer graceful degradation.  When erroneous data are entered, the accuracy of the probabilities diminishes only gradually.
 
 
 
 

NEMISYS

In 1987, the NEMISYS International Project (NIP) was launched, in collaboration with two computer scientists from UC Davis, Jim Diederich and Jack Milton.  They were looking for a suitable domain for the practical application of their ideas on expert workstations and recommended implementation of NEMISYS using object-oriented methodology (see Chapter 7 for a discussion of expert workstations in this context and Chapter 10 for a presentation of the computer science aspects of NEMISYS).  Over the next couple of years more than seventy participants became formally associated with NIP.  These participants include nematode taxonomists, computer scientists, and others interested in nematode identification.
 In the rest of this chapter I show the NEMISYS approach to satisfying the requirements presented above for a complete nematode identification system.  I discuss here only features that are special to NEMISYS.  Obviously this system also includes the features found in many modern systems such as help functions, windows, images, etc.
 The prototype system completed to date, NEMISYS 1.0, incorporates the basic functionality of the system.  A few important tools have been implemented, the Basic ID tool for typical identifications, the Show Me tool to gather information about the different taxa, and the Promorph Tool, a graphical tool for focusing identifications.  The attached database is very limited, with information for a few plant-parasitic nematodes at the genus level only, primarily because: (1) rapid prototyping of tools can be done on a limited set of data, and (2) the building of the NEMISYS database, or Nembase, will be a major undertaking.  We are developing computer-based tools to assist with building Nembase.

General identification
Typically, identification aids are prepared by one author for a small group of species, often the species in one genus.  This allows complete control of the data.  The author is often able to redescribe all the species considered, making sure there are no missing characters and that the characters are described consistently from one description to the next.
 A general system is needed for all the species of plant-parasitic nematodes.  We have seen that about 3,700 species have been described in this category alone, and it would be impractical, or even impossible, to redescribe them all.  Besides the enormous amount of time and money this would require, many species have been found from localities that are no longer accessible.  For example, the type locality of Hirschmanniella belli, a species common on rice in California, is now the parking lot of a shopping mall!  As discussed above, the descriptions of several populations for each species are needed to give a realistic account of the intra-specific variability.  This would require an even greater effort.
 It is imperative that a general system obtain its data from the literature, where about 8 000 descriptions and redescriptions of plant-parasitic nematode species can be found.  Extracting data from published descriptions encounters other difficulties, such as descriptions written in foreign languages, the sheer number of characters used by the authors (more than 400 in Tylenchida alone), the size of the database needed for storing these characters, the presence of inconsistencies in the data, etc.  A tool under construction, The Terminator, will help reduce this task to a manageable one.  The process first requires scanning published descriptions with optical character recognition (OCR) software.  The tool is in the testing stage, and data entry from published sources can begin in earnest when funds become available.
 The data from the literature must be processed in the sense that redundant characters must be eliminated, fuzzy and qualitative characters must be linked (if possible) to exact measurements, etc.  Work has started on rules for doing this translation semi-automatically.  This will come handy in later years when the time comes to incorporate image analysis and automatic data capture into NEMISYS.  It will become the first step for translating traditional morphological characters into computable characters, i.e., characters that can be automatically extracted by current computer techniques.

Flexibility

The requirement for flexibility of the system for accommodating different users working under different circumstances is met by the concept of a set of tools in an expert workstation (Diederich and Milton, Chapter 7, this volume).  The different tools will be used depending on the user's expertise, depending on what the user wants to do, from a quick routine verification to the exact identification of a quarantined species, and depending on the point reached in the identification process as different tools may be used at different times during a session.
 NEMISYS tools help experts do what they do traditionally, but faster and more comprehensively.  They help non-experts approximate the work of an expert if one had been available.
 So far, we have listed about 50 basic functions that will be available through a dozen tools and more will be defined as the NEMISYS project develops.  This may seem bewildering to a prospective user, but in the final version of NEMISYS only a few tools will initially be active, only a few functions will be accessible, and only default settings will be available.  This will still be enough for reaching a fast and reliable answer, but not enough for the user to get lost in the complexity of the system.  Then, as (and if) the user gains confidence, he or she can start more tools, access more functions, and perhaps change some settings, until the frequent user will enjoy the full functionality of the system.
 Even with the limited facilities available at first, the user will control the process, including the ability to request or reject the help offered by the system.  A fortiori, the fluent user having access to the complete set of tools and functions will enjoy complete control.  A few examples will give a better idea of NEMISYS flexibility.

Data Entry.  Data entry is one aspect of NEMISYS where flexibility is at its highest point.  The Basic ID window includes a pane where the user can enter a description of the characters seen in the specimens.  The user has full and unrestricted access to the English terms found in nematode descriptions.  No codes are necessary and the user enters, e.g., tail rounded without having to remember tail shape is, e.g., character number 246 and that rounded shape is character state number 5, or any such code.  The tool reads the entries, compares them to a dictionary of morphological terms used in nematology, and tries to identify the characters and states meant by the user.  The user can accept or reject the interpretation made by the system.
 The only restrictions when using this mode of data entry come from the limitations of the method, when the language of the user veers too far from normal usage.  There are no other restrictions such as fixed field length, order of character entry, etc.  The user can enter any character at will, or enter the characters suggested by the system.  Data entry can be made one or a few characters at a time, allowing the user to assess the situation reflected by each additional entry before looking for more characters if necessary.  This limits data entry to the minimum number of characters needed for the identification.
 The Basic ID tool allows another data entry method, using a hierarchical arrangement of systems, organs, organ parts (called features in NEMISYS), characters, and character states.  The user goes down this hierarchy to the desired character state, and highlights and accepts it.  It is expected that some users will prefer this direct path to the right entry, and that others will prefer the freedom of describing the data in their way, and having the system attempt to reconcile their entry with the character names in the database schema. Choosing one method versus the other is largely a matter of personal preference, and NEMISYS leaves this choice to each user.
 Graphical data entry is being investigated as an alternative to textual entry.  Limited graphical entry is already available in version 1.0, where some difficult characters such as tail shape are shown on screen.  The user can then compare the shape in the specimens to be identified to the basic shapes illustrated and pick the closest shape.  Later this will be offered for all characters, at the discretion of the user.
 Another option that may become available in later versions of the system is similar to police `mug shots' of crime suspects.  The witness composes an image of the suspect's face by choosing each feature separately.  Similarly, future NEMISYS users could build a composite image of the nematode to be identified by picking the right shapes for basic organs out of a graphical bank of possible shapes.
 We are also planning to connect NEMISYS with a commercial data capture software.  When available, this option will allow a user to see on-screen the image of the specimen, captured by a video camera installed on the microscope.  Measurements and observations will be made on-screen and immediately and automatically made available to the system.  The final version of NEMISYS will be provided with an automatic data entry facility through full use of image analysis.

Errors.  NEMISYS treatment of the problem of data reliability offers a second example of system flexibility.  A difference between experts and non-experts in identification is that experts can assess the reliability of their data while non-experts cannot.  This limits the use of many systems that accept at face value all data entered.
 Users of NEMISYS can ask the system to assess the reliability of their data.  We have characterized the factors that influence reliability, including the nature of the character observed, the circumstances of the observation, and the observer expertise.  The factors associated with the characters include conspicuity, ambiguity, and variability of the character (see above).  The factors associated with the observation include the biological material (how many specimens were observed; how well preserved they were), and the optics (type of microscope used; magnification).  The intuition of the observer, what we call Personal Intuitive Feeling or PIF, is also considered with two components: a) how clearly the observer saw the character state, and b) how consistent this observation was in all specimens observed.  The expertise of the observer comes into play, as the PIF of an expert is given more weight than the PIF of a beginner.  All these factors are used in an algorithm that computes an endorsement percentage, from 0 to 100%, giving a measure of the reliability of each piece of data.
 This endorsement algorithm can be used in several ways.  For example, with a self-professed expert the PIF will be given the greatest weight.  The PIF allows the expert observer to tell the system that some data entered may not be correct, while he may be very confident about other data.  This imitates the way an expert usually operates, by putting more confidence in some observations than in others.  Obviously, the expertise of the observer may be more wishful than real, but the basic philosophy of NEMISYS is to respect the freedom of its users.
 For a non-expert, the algorithm will rely mostly on the other factors, and it will provide an automatic evaluation of each piece of data.  This automatic evaluation emulates evaluation by a human expert.  Of course, experts and non-experts alike can turn off the endorsement option and ask the system to accept each observation at face value.
 An example of the use of the endorsement algorithm can be found in the computation of the NEMAID similarity coefficient.  The NEMAID algorithm includes weights that are pre-set by experts or modified by the user.  In NEMISYS, the weights of the NEMAID algorithm will be the endorsement scores of each piece of data.
 The consequences of errors can be reduced by other options in NEMISYS tools, as explained below.

Identification Methods

NEMISYS flexibility is most obvious in the variety of identification tools it offers depending on who the users are and what they want to do.  A routine check of a well known species by an expert is different from the verification of the identity of a regulated pest whose verified presence will cause the destruction of 25,000 pothos or 125,000 strawberries, or from the slow plodding of a beginner unable to recognize even the most basic forms.  Each tool is using one or several of the identification methods described above, as most appropriate for the tool functionality.  During an identification session, the user will be able to select one or more tools, and one or more options within each tools, and use them at will, sequentially or simultaneously, following a hand-crafted identification strategy.  Below, the various identification methods used in NEMISYS are presented in the context of some of the tools that use them.

Recognition.  It has been said (Pankhurst, Chapter 8, this volume) that the best way to identify a specimen is to know what it is already.  Obviously, there is no need for an identification aid when the observer can fully identify a species.  Still, even the best experts may want to verify their initial identification, or they may have recognized a level higher than the species level and still need to go to species level.
 This is where most existing methods fail, because there is no mechanism for the user to provide this type of information.  In NEMISYS, the Ask Me tool allows the user to enter the name of a species or a group of species, and the system responds with a list of characters to be checked for full identification.  This is a goal-directed procedure, somewhat similar to the backward chaining procedure of an expert system that also allows verification of a tentative identification.
 NEMISYS goes beyond this process in that it allows the user to use shortcuts in the data-directed procedure.  Often the user can narrow down the possible identification to a limited area at the family or genus level.  In other systems the user still has to plod slowly through many steps to enter what was obvious at first glance.  For example if you see a fish you do not want to go through a long list of questions, `Does it has vertebrae?  Does it have scales?  Does it have gills? etc.,'  before the system reaches the obvious conclusion that it is a fish.
 In NEMISYS, we call promorph (pro before; morph morphology) a form that can be recognized at low magnification powers, before observation of detailed morphology.  For example, the promorph fish would include all true fishes but also dolphins, whales, and other cetaceans.
 When a promorph is recognized, the user enters its name into the system, without having to provide its description.  Non-experts unable to recognize promorphs can ignore the short-cut provided by this option, or they can look at drawings of common promorphs in the hope of recognizing one of them.
 Promorph is a concept, it is not a classification.  Promorphs are not hierarchically arranged, they are neither exclusive (the same species may belong to two promorphs) nor exhaustive (some species may not belong to any defined promorph), and their effective use depends on the level of expertise of the observer.  That is, an expert can recognize many more promorphs than a beginner.
 Promorphs are identification concepts.  They do not attempt to account for the phylogeny.  Promorphs depend on immediate recognition, not on careful consideration of phylogenetic characters.  At first glance a whale IS a fish, as any Nantucket whaler would have told you (see chapter 31 of Moby Dick).
 In NEMISYS, promorphs are used for focusing the identification process toward the area where it is most likely that the right answer will be found.  If I see a fish, I know it can be a true fish or a cetacean, but it cannot be a sea urchin.

Deterministic Method.  It has been argued above that most morphological data are probabilistic.  Yet, the deterministic approach is the fastest way to eliminate taxa that obviously do not fit the unknown specimens.
 NEMISYS eliminates unsuitable taxa by relying only on what I call primary identification criteria.  A primary identification character is a morphological characteristic that is both useful for the differentiation of a species or a group of species and easy to observe in this group of species.  It has high conspicuity, low ambiguity, and low variability.  A primary character is one that even a beginner can identify without great risk of error.
 In NEMISYS, the elimination process relies on the ad-hoc concept of nest of species that are groups of species sharing the same set of primary identification characters.  As for promorphs, nests are not units of a classification.  They are heuristic concepts that rely entirely on clearly visible phenetic characters, and they do not follow established phylogenetic classifications.
 A primary character C for the nest N1 may not be primary for another nest, N2.  When an unknown specimen is compared to N1 and to N2 using this character C, the elimination process is different for N1 and N2.  During the comparison with N1 (C is primary for N1), two things can happen:
(1)  the specimen belongs to N1, and it will have the correct state; N1 is selected as it should be;
(2)  the specimen does not belong to N1; C may still have the correct state for N1, in which case N1 is kept by error on the list of possible candidates; or the specimen may differ from N1 in the value of character C, and N1 is rejected as it should be.
During the comparison with N2 no action is taken and N2 remains in the list of possible nests whether the specimen has the correct state or not for character C.  If the specimen does not belong to N2 after all, the wrong nest is kept on the list of possible candidates and it will be eliminated later, using other characters that are primary for N2.
 Using nests there is a risk of keeping the wrong answer, at least temporarily.  On the other hand, the risk of rejecting the right answer is very slight because elimination of taxa relies on the best, most reliable characters for each nest considered.
 A prudent user may reduce the risk of wrongful elimination even further by allowing one or two mismatches on primary characters before eliminating a nest.  This option provides some measure of graceful degradation.  It is found in the Basic ID tool, which is already available with version 1.0.  This tool is described in detail by Diederich and Milton (Chapter 10, this volume).

Rules.  There is no rule-based expert system in NEMISYS.  However little rule-based modules will be used in several ways and in various tools in the system.  For example, after an answer is reached, rules will allow a quick verification of the accuracy of the answer by pointing to the most obvious mistakes.  For each species, rules will make sure that all the diagnostic characters are confirmed and in particular that the characters that differ between that species and related species have the correct value or states.  During the building of NEMISYS, the rules will be extracted primarily from the species diagnoses (the list of differentiating characters proposed by the authors of each new species).  The experts participating in the NEMISYS International Project will be asked to check these rules and, if necessary, to provide additional rules for selected species.
 When a dead end is reached in the identification process because all taxa have been eliminated and none fit the data entered, other rules will instruct the system on the best way to get back to the last reliable part of the process, assess the situation, and suggest a different path.  When every attempt has failed, more rules will be triggered for deciding that a form represents a new taxon, never described before.

Similarity.  At times, the user may want to rank the remaining candidates.  This can be done by calculating the coefficient of similarity.  NEMISYS uses the NEMAID algorithms described above.  A variation of the algorithm has been implemented in version 1.0 for comparing nests.  The extended version of the algorithm requires access to data at the level below the level being investigated.  Species data are needed for estimating NEMAID similarity at the genus level, population data are needed for estimating similarity at the species level.  The extended algorithm will be available when the species and population database Nembase is completed.
 As explained above, the similarity approach has built-in graceful degradation.  In addition, the NEMAID algorithm used in NEMISYS takes into account the intra-specific variability, thereby reducing the risk of error.

Probabilistic Approach.  It is possible to propose realistic Bayesian probabilities only in some well defined circumstances of region, host, and plant part being sampled.  Still, most identifications are made in circumstances where the nematode populations are known from published studies.  This is due to the fact that nematode populations are known under the circumstances that are the most commonly studied (e.g., important crops, areas located near a nematology laboratory, etc.).  Many future identifications will also be made under the same circumstances.  Consequently, an identification strategy explicitly using prior probabilities is quite justified.
 We are considering incorporating Bayesian probabilities in a future version of NEMISYS.  First, a specialized database will have to be created for storing probabilities from published descriptions of nematode populations found in given circumstances.  Experts can provide additional information and offer rules that would extend our knowledge of populations in known circumstances to circumstances that have never been studied but are somewhat similar.  For example, if the probability for observing H. oryzae in northern Senegal is 0.98 and if it is 0.205 in southern Senegal, a probability of 0.59 could be proposed for central Senegal.  When circumstances are too far removed from all published experience to allow for extrapolations, all nematode species might be considered to be of equal likelihood.  The user of the Bayesian probability function would specify the circumstances of the current identification session.  The corresponding probabilities would be extracted from the specialized database and they would be used according to Bayes rule.
 Some biologists will say that, if they cannot reach a firm answer in an identification, for example because some diagnostic characters cannot be observed in the specimen, they do not want any answer at all.  This has not been my experience.  I recently had to identify a Pratylenchus sp. from California.  Using the dichotomous key of Loof (1978) as a reminder of diagnostic characters for the species in this genus, I found that the form was somewhat intermediate between P. thornei, a cosmopolitan species known from California, and P. delattrei, a species described from Madagascar thirty years ago, and never reported since.  My final identification was `Pratylenchus sp., probably P. thornei,' because I implicitly took into account the probability of observing this species in California vs. the probability of observing P. delattrei.
 Using prior probabilities does not allow the swift elimination of the obviously wrong choices, which is a characteristic of deterministic methods.  On the other hand, once deterministic elimination using safe primary characters has reached its limits, Bayesian probabilities offer an alternate strategy for pursuing the identification to its conclusion at the species level.  Other strategies, such as deterministic with unsafe characters or estimation of overall similarity can also be used to reach the species level.  If all approaches point to the same species, the user confidence in the answer would be greatly increased, provided the various methods are really independent, of course.
 NEMISYS users will have the choice of tools, each tool being based on one or the other identification methods, or a combination of these methods.  They probably will use several tools in a single identification session.  The identification process may be quickly focused toward a small group of species using a tool that allow proposing a promorph (recognition method).  Then the Basic ID tool can eliminate some nest (deterministic method).  This elimination process can be made less stringent by allowing one or two mismatches before eliminating a nest.  A browsing tool can show the user why a nest was eliminated.  Then the remaining nests can be assessed by calculating a NEMAID coefficient of similarity with the unknown specimens.  Once an answer is reached, rules can help with checking for possible errors.  What tools are used and in what order, depend on circumstances.  An expert can use a goal directed strategy, through the ASK ME tool, while a beginner may prefer step-by-step guidance by the system that will be in full control in a HELP ME tool (to be added to future versions).  Each tool relies on the identification method (or methods) that were deemed to be the most appropriate for achieving the tool functionality.  The users have entire control on tool selection, and by selecting tools, they also have access to several identification methods (whether they know which methods are used in each tool or not).  This is a unique characteristic of NEMISYS.
 Most of the other identification systems are based on a single identification method, using probabilities, similarity, deterministic elimination, or some other method.  Each system has or does not have graceful degradation depending on the method selected.  In the contrary, NEMISYS tools rely on several identification methods.  Each method can be said to have graceful degradation or not, but the system as a whole goes beyond this concept.  The rich environment provided by NEMISYS allows many options for recovering from errors.  A worse case scenario might be using the Basic ID tool set at zero tolerance (no mismatch allowed).  Then the nest to which belongs the unknown might be eliminated if the user makes one mistake on a primary character for that nest.  However, this is unlikely to happen because primary characters are easy to record.  If it does happen, the right nest is eliminated and the user may eventually be left with an empty set of nests.  He can still switch to the coefficient of similarity option in the Basic ID tool.  This would provide the intrinsic graceful degradation of similarity coefficients.  Or the system may arrive at the wrong species, but then rules would point to the closely related species, that would probably include the right answer.  The user could also go to a browser and look at descriptions and illustrations of possible taxa.  These examples show that the concept of graceful degradation is actually irrelevant for expert workstations such as NEMISYS.
 
 

CONCLUSIONS

Reviewing the requirements for a comprehensive identification system listed above,  NEMISYS possesses:
(1)  universality:  the system is valid for all plant-parasitic nematode species because of the reliance on published data;  it will later be extended to other categories of nematodes, and then to other biological groups;
(2)  flexibility:  the system can be used by experts and generalists alike and it fits with the different ways each type of user works;  flexibility is most visible in the concept of set of tools, in the different options available for data entry, and in the different identification strategies;
(3)  ease of use:  the system is accessible to persons who are not experts in identification or not expert in computer use;  the concept of expert workstations and the building of NEMISYS as a fully integrated interactive system promotes ease of use;
(4)  speed:  this is realized mostly through short cuts and the use of promorphs and host and geographical origin;
(5)  reliability:  the system estimates the reliability of each piece of data entered, and uses it accordingly.
NEMISYS helps experts do what they have been doing before, but better, faster, and more comprehensively, and it also allows non experts to do what experts do.  This philosophy will guide the future development of the system in which we will add more tools and more functions to attain the full functionality described in this chapter.

DISCUSSION

Calabrese:  I share your uneasiness about prior probabilities, but I wonder if you could not let the system accumulate them over time, from the results of the successive identifications, or find them from your nematode museum by looking at, say, slides from California.

Fortuner:  In either case, I would still be finding priors that are valid only under certain conditions.  Slides from California obviously would give priors for California only, and as for recording my identifications, I am not identifying species from all possible plants and locations.  It is easy to find prior probabilities for species in certain conditions, as they are regularly published in the nematological literature.  What I cannot do is give a prior valid in any circumstances, because the knowledge just does not exist.  A universal system will have to deal with this fact.

Marcus:  Your priors are conditional probabilities depending on region and other exogenous characters.  I think you are talking about the dependency on the priors.  I would be open-minded and let the people get the priors out of you by interview techniques.

Fortuner:  Any prior probability depends on exogenous circumstances.  At the very least, you must assume that you are on Earth!  You can define the universe to be the whole Earth, or to be paddy rice roots in northern Senegal, it does not matter.  I agree that in theory each species has a prior valid for the whole Earth, but our knowledge is so limited for most species that it would just be impossible to offer even the wildest estimate.  No interview technique can extract a knowledge that does not exist.

Jeffrey:  I strongly agree, be empirical, see if it works, but it would be a methodological mistake to assume that these priors must be somehow buried in your subconscious and then go digging for them.