Diederich & Fortuner,1996 : endorsement

Endorsement of Observations in Identification

Jim Diederich
Department of Mathematics, University of California, Davis, CA 95616, dieder@ucdmath.ucdavis.edu.

Renaud Fortuner
Scientific consultant (current address: La Cure, 86420 Verrue, France, fortuner@wanadoo.fr).

© 1996 IEEE. Reprinted, with permission, from
Proceedings of the Fifth IEEE International Conference on Fuzzy Systems - FUZZ-IEEE '96, September 8-11, 1996, New Orleans, Louisiana, pp. 175-179.
This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by sending a blank email message to info.pub.permission@ieee.org.
By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Abstract

In this paper, fuzzy logic is examined and compared to non-fuzzy methods as a means for endorsing observations made in the identification of biological specimens.

1. Introduction

Fuzzy logic has been applied in a wide variety of areas [1] with engineering applications, particularly control theory [2], attracting the most attention. Outside of the traditional areas of standard rule-based systems in medicine, little has been done to extend fuzzy concepts to biological applications, where it is likely that fuzzy concepts will play a larger role in the future given the inherent imprecision of biological systems.
Identification of plant-parasitic nematodes, microscopic round worms, is a difficult task. Most methods of identification are too trustful of the observations during an identification. If we call the degree of reliability attributed to each observation the "endorsement" [3], then most observations are automatically considered highly endorsed. Usually, only experts are able to cope with this because part of their expertise involves knowing which data have low reliability, and how to use a too trustful identification method with possibly erroneous data.

What is needed is a means for an identification system to assess the quality of the observations provided by the user, that is, the system needs to attach a level of endorsement to each observation, which may run from highly endorsed, i.e., there is not doubt this observation is correct, to not endorsed, i.e., this observation is not to be trusted.

By 'character' we informally mean an attribute and state/value of some organ of the specimen such as "Tail, shape = pointed" [4, 5].

2. Endorsement Factors

There are many factors that affect the level of endorsement of a character: 1. the expertise of the observer for the taxa being identified, from a student beginner to a taxonomist having studied this particular group for a number of years; 2. the degree of confidence of the observer in the character, from "this is my best guess" to "I would bet my life on this"; 3. the observational set-up, from many well preserved specimens from several populations observed at highest magnification of a high quality research microscope to a single squashed individual observed with a student microscope: and 4. the character itself, from an easy to see, unambiguous, non variable character to a feature at the limit of visibility fleetingly seen in some specimens [3].

With respect to expertise, data entered by the expert (i.e., a nematode systematist working with "his/her" taxa) is to be trusted more than data entered by a beginner. There are others who fall in between, such as those who are experts in a related taxon or those who are nematologists, but not taxonomists. In some diagnostic systems such as MYCIN [6] the observer has the opportunity to attach a value, a certainty factor, to each observation. Often the observer will have a sense when the observation is good or not. We call this the Pif, French slang for nose, which can also be short for "Personal intuitive feeling." The degree to which the Pif is trusted is also dependent on the level of expertise. However, a beginner or non expert may also be trusted (within limits) whenever a good job has been done in the set-up.

In microscopic observations, there an several options and conditions affecting the observations. The instrument used may be a top quality research microscope, properly maintained, an older microscope in good condition or not, down to a student microscope. The microscope may be carefully set-up with or without using oil-immersion and interference devices or it can be used at low magnification. The number of specimens may be high, several large samples of different origins, one good size sample of 15 or more specimens, or just a single specimen. The quality of specimens may be high, i.e., perfectly killed, fixed, and mounted, or have some degradation in terms of granularity, twisting, overheating, or being squashed in the slide.

The nature of the character itself can affect the trust placed in the observation. This is related to the concept of metadata in [7]. The character may be from a very conspicuous organ or from one that is very difficult to see. Furthermore, the character may be very ambiguous, that is, easy to see but also easy to misconstrue as the wrong character. The character may also be highly variable with many different values found in a particular taxon. This is complicated by the fact that the value of the metadata for conspicuity, ambiguity, and variability varies can vary from taxon to taxon.

3. Alternative approaches

Given these factors there are several ways to create a system for endorsing observations: a conventional rule-base approach, an algorithmic (formulaic) approach, and a fuzzy rule-base approach.

The advantage of a rule-base approach is that expert knowledge can be captured and used to guide the endorsement of a character. We initially had attempted to create a set of rules to govern endorsements but clearly stalled as the number of rules grew and the expert nematologist [RF] stated the he was getting lost in the rules. The endorsement of a character depends on at least nine factors and each factor has roughly four to five linguistic values, certainly more if intermediate values are allowed. Thus, the number of possible combinations is very high, in the thousands since 49 = 0.2 million. Some simplifications were introduced, as described below, that reduced the number of inputs into the system, but even then the number of rules needed would have been significantly larger than the number needed for one based on fuzzy rules.

For the algorithmic approach several formulas were considered that included an averaging of the factors in each of the endorsement categories: set-up, specimens, expertise, character, etc., then averaging these to get a result. The formulas were tested against some real life cases where the endorsement output would be acceptable to a practicing taxonomist. When the results were off, the formula was adjusted or modified until it gave reasonable results. While this might suggest a neural net, the number of characters involved is too large to create a viable training set. The final, and possibly the best formula, uses the factors described above, except for expertise and Pif, to compute what we call a "computed Pif," or c.Pif for short, using an arithmetic combination of the factors after assigning numerical values (0.0 - 1.0) to each entry under the factor. For example, the "ease" of the character and the "seriousness of the observer" can be computed using

ease = (conspicuity x (1 - ambiguity) x (1 - variability))1/3
seriousness = ((optics score) x (specimen score))1/2

with the c.Pif the average of the two,

c.Pif = (seriousness + ease)/2.

The endorsement can then computed as a weighted average of the Pif and the c.Pif , with the weights determined by the level of expertise, to give

(1) Endorsement = Expertise*Pif + ( 1 - Expertise)* c.Pif

Thus the higher the level of Expertise, the more the endorsement relies on the Pif, while the lower the level of expertise, the more it relies on the c.Pif. (Note that even with only three terms, Expertise, Pif, and c.Pif , with five values each, the number of conventional rules would still be 53 = 125, which is much larger than desirable though it is possible to reduce the number of rules somewhat.)

In the fuzzy rule-based approach, the same three factors were used to formulate 15 basic rules. The first three in Table 1 essentially state "Trust the expert", while the next three rules state "Trust the c.pif." (H = High, M = Medium, and L = Low.) The remaining 9 rules cover the intermediate cases and are shown in tabular form.

1. If Expertise is H and Pif is H,   then Endorsement is H.
2. If Expertise is H and Pif is M,   then Endorsement is M.
3. If Expertise is H and Pif is L,   then Endorsement is L.
4. If Expertise is L and c.Pif is H, then Endorsement is H.
5. If Expertise is L and c.Pif is M, then Endorsement is M.
6. If Expertise is L and c.Pif is L, then Endorsement is L.

	If Expertise is	Pif is	c.Pif is	then Endorsement is
7.	M	H	H	H
8.	M	H	M	M+
9.	M	H	L	M
10.	M	M	H	M+
11.	M	M	M	M
12.	M	M	L	M-
13.	M	L	H	M+
14.	M	L	M	M-
15.	M	L	L	L

Table 1. Fuzzy rules for endorsement

4. Comparison of methods

A small fuzzy logic system was easily and quickly implemented to test the fuzzy rules vs. the formula (1). Triangular membership functions were used to represent H-, M+, M, M-, and L+, with centers at .80, .65, .50, .35, and .20 respectively and support of length .4 for H- and L+, and support of .3 for M-, M, and M+. The other two values H and L were complementary trapezoidal functions, i.e., H = 1 - L, with the support of H starting at .7 and H = 1.0 between .9 and 1.0. Their centers, taken to be the center of area (COA), were .89 and .11, respectively. All of these membership functions were used for all of the linguistic variables Expertise, Pif, c.Pif , and Endorsement, that is, we did not modify or tailor the membership functions for the different linguistic variables. We used 100 discretized values to represent the membership functions.
Table 2 gives a basic comparison of the results for the fuzzy rules and for formula (1), for all possible combinations of H, M, and L for the three inputs of Expertise, Pif, and c.Pif .

The numerical scores for the endorsement are computed using Mamdani implication [2, 8], and computation of the resulting COA for the fuzzy rules. For the formula (1), the centers for the values of the three inputs are used. The linguistic values for the endorsement, shown to the left of the scores, are determined in both methods by talking the membership function with the nearest center to the resulting score. (Unchanged input is not repeated in a column)

Generally speaking the formula and the fuzzy results are quite similar. We note that the fuzzy endorsements and scores hold their values (H, .89 and L, .11) for the first six rules while the formula (1) scores change. There are a few instances where there are differences in the linguistic values. The greatest difference occurs when the input is (M L H) yielding an M+, .65 and M, .50 for the fuzzy and formula (1), respectively. This arises due to Rule 13 in Table 1 where the c.Pif is high and is trusted more than the Pif when the observer is not an expert. Formula (1) on the other hand treats the Pif and c.Pif equally in this case. In any event it is clearly easier to tune the fuzzy rules than it is to modify the formula to produce the desired result. Rule 13 is perhaps the one most likely to change in the future. When the expertise is allowed to range among the intermediate values {H-, M+, M-, L+} the fuzzy and formula (1) endorsements adjust in manner similar to those shown in Table 2, which is an important factor in keeping the basic rule set so small and manageable.

While formula (1) does reasonably well in the test above, it does not easily allow for handling exceptions or for easy tuning. For example, if the character is an easy character, then it should be possible to endorse the data even if other factors such as the optics and specimen are not strong. However, the computation of the c.Pif, used in both methods, will not allow a proper value for the level of endorsement. One could use fuzzy rules for computation of the c.Pif as well. Here we take another approach: we add rules to the 15 basic rules in Table 1. Initially two rules were added to handle easy characters as shown in Table 3(a).

Although the rules seemed reasonable they produced undesirable effects. Rule 16 (initial) had the side effect of decreasing the endorsements to M+ whenever the c.Pif was H. It was therefore replaced by Rules 16 (final) and 17 (final). For Rule 17 (initial) the side effect was to raise the endorsement, when the Expertise was H and the Pif was L, to M- from the desired L. Even with an easy character, the expert's Pif should be trusted as it could happen with a specimen partly damaged, obscuring this one character. Rule 17 (initial) would appear to have no effect when Pif is L. The fact that fuzzy complementation is not exclusive causes this rule indeed to have an impact in this situation. This could have been solved alternatively by modifying the membership functions and/or setting a cut-off in calculating the value of the conditionals. Rules 18 (final) and 19 (final) were used in lieu of Rule 17 (initial). The results of adding in the rules in Table 3(b) are shown in Table 4, where the input includes the fact that the character is an easy one.

In these EASY is treated as HIGH. The fuzzy results reflect the appropriate level of endorsement under the conditions considered. In formulating the fuzzy rules the expert nematologist [RF] expressed a desire to have the exceptional rules modify the endorsement determined by the standard rules. However, due to the cumulative effects of the fuzzy rules, that effect can be seen in the results in Table 4. The unchanged results for formula (1) are shown for comparison.

16. If Expertise is L and Character is EASY,
then Endorsement is M.
17. If Expertise is NOT L and Pif is NOT L and Character is EASY,
then Endorsement is H.
(a) initial

16. If Expertise is L and c.Pif is L and Character is EASY, then Endorsement is M.
17. If Expertise is L and c.Pif is M and Character is EASY, then Endorsement is M+.
18. If Expertise is M and Pif is H and Character is EASY, then Endorsement is H.
19. If Expertise is M and Pif is M and Character is EASY, then Endorsement is M+.
(b) final

Table 3. Rules for some exceptions

5. Conclusions

The verification of the fuzzy approach is that it should give results at least as good, i.e., as acceptable to a systematist, as those of the formula (1). Its validation is that it gives better results, closer to a systematist's ideal, and in a simple, straightforward manner.

The fuzzy approach has clear advantages over standard rule-based and algorithmic approaches for endorsements. The model for the fuzzy rule-based system was remarkably straightforward, simple and compact in its expression, and easy to implement using the simplest of fuzzy logic implications. Exceptions were handled fairly simply after initial testing. On the other hand, no clear cut model was available for developing a formula such as formula (1), making it a less than straightforward task in spite of its eventual simplicity. It also lacked a means for tuning the model in an intuitive and easy fashion when exceptions occurred.

The relevance in relation to systematics is that the risk of data error is great and considered to be important in the field. One of the criteria of a good identification system is that it degrades gracefully when data errors are introduced [9]. The approach so far has been to treat the symptoms in a medical manner: assuming that errors will be made, analogous to an infection, a "good" identification system attempts to handle the problem in a manner analogous to a treatment of a disease. Graceful degradation resembles a form of curative treatment. On the other hand, endorsement is a way to make sure that infection does not occur: errors are detected and handled as they occur. In a sense this is a biological approach, acting like the immune system does: antibodies (the fuzzy endorsement mechanism recognize the antigens (errors) and prevent damage (to the identification). In particular, there are several ways the information can be used once a character is given a level of endorsement. Additional information can be requested from the user related to the character, a low-endorsed character may be excluded in dichotomous identification methods, the weight the character is given in a similarity calculation can be modified, the overall confidence in the candidates can be adjusted. This fuzzy approach to endorsement should attract the attention of biologists to the potential usefulness of fuzzy methods.

References

[1] Terano, T., Asai, K. & Sugeno, M. 1994. Applied Fuzzy Systems. Academic Press, London.

[2] Yager, R. & Filev, D. 1994. Essentials of Fuzzy Modeling And Control. John Wiley & Sons, Inc., New York.

[3] Fortuner, R. 1993. The NEMISYS Solution To Problems In Nematode Identification. In: Fortuner, R. (Ed.), Advances in computer methods for systematic biology - Artificial intelligence, databases, computer vision. The John Hopkins University Press, Baltimore and London : 137-163.

[4] Diederich, J. Basic Properties for Biological Databases: Character Development And Support. In press.

[5] Diederich, J., Fortuner, R. & Milton, J. Construction And Integration Of Large Character Sets For Morpho-Anatomical Data. Submitted.

[6] Buchanan, B. & Shortliffe, E. 1984. Rule-based Expert Systems: The MYCIN Experiments Of The Stanford Heuristic Programming Project. Addison-Wesley, Reading, Mass.

[7] Diederich, J. & Milton, J. 1991. Creating Domain Specific Metadata For Scientific Data And Knowledge Bases. IEEE Trans. Know. Data Eng. 3 (4):421-434.

[8] Klir, G. & Yuan, B. 1995. Fuzzy Sets And Fuzzy Logic: Theory And Applications. Prentice Hall, New Jersey.

[9] Fortuner, R., Editor, 1993. Advances in computer methods for systematic biology - Artificial intelligence, databases, computer vision. The John Hopkins University Press, Baltimore and London