Development of a classification model for studying the acidification status of Colorado lakes

Material Information

Development of a classification model for studying the acidification status of Colorado lakes
Taylor, Lynne Ann
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:
Physical Description:
x, 90 leaves : illustrations ; 29 cm

Thesis/Dissertation Information

Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Chemistry, CU Denver
Degree Disciplines:
Committee Chair:
Meglen, Robert R.
Committee Members:
Anderson, Larry
Mikita, Michael


Subjects / Keywords:
Acid pollution of rivers, lakes, etc -- Analysis -- Colorado ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 88-90).
Submitted in partial fulfillment of the requirements for the degree of Master of Science, Department of Chemistry
Statement of Responsibility:
by Lynne Ann Taylor.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
17998158 ( OCLC )

Full Text
Lynne Ann Taylor
B.A., University of' Colorado, 1979
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Chemistry

This Thesis for the Master of Science Degree by
Lynne Ann Taylor
has been approved for the
Department of
Date 3^ ~^u
Michael Mikita

Taylor, Lynne Ann (M.S., Chemistry)
Development of a Classification Model for Studying the
Acidification Status of Colorado Lakes
Thesis directed by University Research Professor, Center for
Environmental Sciences, Robert R. Meglen
The purpose of this study is to develop a chemical charac-
terization and baseline description of Colorado lakes. By examining
smaller identified groups from a comprehensive lake set, lakes that
are acid impacted or have a sensitivity to acid deposition can be
better defined and related to other acid rain studies. The charac-
teristics of the main influences on the lake waters' chemistry are
hypothesized to be reflected in the inorganic components of the
waters. Both a state-wide, random selection of lakes, and a smaller
targeted set of lakes totalling 236 different lakes have been
sampled up to this point in this continuing study. The data
gathered on these lakes include twenty eight inorganic chemical
measurements, qualitative vegetation and soil components, and
various observational components.
Several standard multi-variate statistical or pattern recog-
nition techniques have been used in the data analysis. These in-
clude factor analysis, cluster analysis, Soft Independent Modeling
by Class Analogy (SIMCA) and K Nearest Neighbors (KNN) class-
ification analyses. In addition, a new technique combining factor
analysis and cluster analysis to assist defining categories in
classification analysis has been developed. This technique has been
used in forming a classification model based on the inorganic chemi-

chemical constituents that represent the limnological classes of
Colorado lakes in an accurate and consise form. This model
describes both chemically and limnologically the critical components
that define the lake waters' chemistry. The model is comprised of
four classes that are limnologically defined by the bedrock geology,
the surrounding vegetation, catchment basin morphology, and
anthropogenic components. This technique has also identified the
lake class and the critical chemical constituents that represent the
acid deposition sensitive lakes. Applying exiisting and new pattern
recognition techniques to a complex data base has facilitated the
interpretation and understanding of the underlying chemical com-
ponents and influences that define the lake water chemistry.

I. INTRODUCTION .............................................. 1
Study Purpose............................................. 1
Study Site................................................ 3
Data Analysis............................................. 5
Cluster Analysis ...................................... 6
Factor Analysis ....................................... 7
Classification Analysis .............................. 10
Study Objectives..........................................14
II. PROCEDURE...................................................17
Data Description..........................................17
Lake Selection....................................... 17
Sampling Technique ................................. 18
Chemical Analysis .................................... 18
Data Analysis.............................................22
Initial Data Cleanup...................................25
Data Set A...........................................25
Data Set B...........................................26
Classification Procedure

III. RESULTS.....................................................31
General Factor Analysis Results ................ .... 31
Primary Factor......................................... 31
Secondary Factor.................................... 32
Last Factor ............................................33
Specific Factor Results ............................... 34
Data Set A............................................. 34
Data Set B..............................................38
1984 Geology-Vegetation Classification Model ........... 43
Initial Cluster Analysis ............. ............. 52
1985 Limnological Model ................................ 55
Cluster Analysis on Scores from Data Set B..............55
Ultra-Oligotrophic Group ............................ 58
Oligotrophic Group.................................. 63
Mesotrophic Group ................................... 65
Lowlands Group ...................................... 66
Dendrogram Characteristics .......................... 67
Classification Results on Data Set B....................70
Training Set Reduction .............................. 71
Nine and Four Classes ................................73
SIMCA/KNN and Scores/Variables ...................... 73
Stability Check ..................................... 74
Cluster Analysis on Scores on Data Set A................77
Classification Results on Data Set A.................. 79

IV. DISCUSSION ..............
Classification Technique
Liraitaions .........
Advantages ..........
Lake Study Summary .

1. Chemical Species, Species Symbols, Analysis
Methods, and Aliquot Preservation Treatments
for the Chemical Analysis..................................20
2. Data Analysis Variable Table ..............................21
3. Non-Quantified Variables Used for Data Analysis ............ 23
4. Classification Procedures.................................. 29
5. Factor Compositions from Factor Analysis for
Data Set A........................ . ..................35
6. Factor Compositions from Factor Analysis for
Data Set B.................................................39
7. Geology/Vegetation Model ................................... 45
8. SIMCA Results for Geology/Vegetation Model on
1984 Data and Data Set A.................................46
9. SIMCA Results for Geology/Vegetation'Model on
Data Set B for 381 and 417 Sample Training Set...........50
10. Limnological Classification Model .......................... 59
11. Comparison of the Classification Techniques on
the Three Data Sets........................................72
12. Stability Tests on the Limnological Model for
Data Set B and 1984 Data .................

1. Map of Colorado showing the twelve different
drainage basins ......................................... 4
2. Example of a clustering dendrogram and the
corresponding factor score plot showing robust
clusters (A,B), a stairstep cluster (D), a
loose cluster (G), and outliers (E)...................... 8
3. Series of factor score plots showing outliers
and the results as the outliers are removed..............11
4. Block diagram of the general data analysis
procedures used in this study .........................24
5. Factor score plot of repeated samples showing
the direction of the Cu analytical bias..................27
6. Factor score plot of Data Set A showing the
separation of the alpine/subalpine lakes along
the bedrock, biological, and traces factors ............. 37
7. Factor score plot of Data Set B showing 1) the
range of the 1984 spring, 1984 Fall, and the
1985 samplings, and 2) the randomness of all
samplings in the repeated lakes ......................... 39
8. Factor score plot of Data Set B showing the
separation of the alpine/subalpine lakes along
the bedrock, biological, and traces factors ............. 42
9. Factor score plot of the alpine lakes comparing
the 1984 to the 1985 lakes, and showing their
Factor one bedrock ranges and and separations ..... 51
10. Clustering dendrogram showing the inclusion of
the 1985 samples, Data Set A, within the main
clusters of the 1984 samples.............................54
11. Clustering dendrogram of Data Set B showing the
placement of the nine Geo/Veg model classes ............. 56

12. Clustering dendrogram of the factor scores of
the Data Set B showing the robust clusters
and the corresponding main groups, subgroups,
outliers, and test set samples...........................57
13. Factor score plots of Data Set B showing the
separation of the main and subgroups along
the bedrock and biological factors ............ ..... 60
14. Factor score plots of Data Set B showing the
separation of the main and subgroups along
the bedrock and soil/morphology factors ................. 61
15. Factor score plots of Data Set B showing the
separation of the main and subgroups along
the bedrock and soil/morphology factors ................. 62
16. Factor score plots of Data Set B showing the
separation of the main and subgroups along
the bedrock and traces............................ 64
17. Clustering dendrogram of the factor scores of
the Data Set B showing the robust clusters,
R1 and R2, and their subclasses..........................68
18. Clustering dendrogram of the factor scores of
the Data Set B showing the robust clusters,
R3 and R4, and their subclasses .........................69
19. Clustering dendrogram of the factor scores of
the Data Set A showing the main and subgroups............78

Study Purpose
In June 1982, The Center for Environmental Sciences at the
University of Colorado at Denver conceived a lakes classification
study in response to acid rain concerns. Many studies have been
done in Europe for the past thirty years and the eastern U.S. in the
past fifteen years addressing the acid rain problem. These studies
range from the very specific details such as elemental transport
models (1) to broad, general acid deposition and precipitation
studies (2,3). More recently, in the past five years, there has
been an increase in the studies of the western U.S., specifically in
Colorado, that address the differences and unique characteristics of
this area of the country. The National Atmospheric Deposition
Program (NADP) has recently set up monitoring stations in Colorado
(4). Lewis and Grant have done many studies since the 1970's on
bulk deposition/acid precipitation, lake and stream chemistries in
Colorado (5,6). The Bureau of Land Management (BLM) and the United
States Geological Survey (USGS) has done a survey of many lakes in
the Flat Tops Wilderness Area in northwestern Colorado examining the
acid sensitivity of high altitude lakes (7). Several lakes in the.

Mexican Cut Nature Preserve north of Gunnison have been monitored
extensively for several years, examining the acidity and seasonal
variability of the lakes (8,9).
Many Colorado studies have been site intensive, monitoring
studies. These studies have primarily focused on pH and alkalinity
measurements. While they are informative about the selected study
sight, the generality and applicability to a broader lakes' popula-
tion are unknown. There is a need to study many chemical variables
that can well represent the lake water chemistry. Also, a broader,
more representative sampling protocol is needed to expand the
generality of a study. These are the ideas that formed the basis of
this study in 1982 for studying Colorado lakes. Similar concerns
were addressed in the Environmental Protection Agency's National
Surface Water Survey. This study, which began in 1983; sampled
several thousand lakes over the entire U.S. and examined over fifty
chemical variables..
In order to address the question, "Is Colorado being im-
pacted by acid precipitation ?" several fundamental questions must
be addressed. The present study addresses these following
questions. What is the range of Colorado lakes both chemically and
limnologically? What are the major influences on a lake and their
importance? How are these influences expressed in the lake water
chemistry and what are the most effective ways to measure them?'
Upon determining these influences, one may design effective chemical
measurements for monitoring lakes. By using these chemical measure-

ments to develop a lakes classification scheme that encompasses the
entire state, one may evaluate the applicability of site specific
studies within Colorado, and provide a means for selecting future
representative lake sites. Once the lakes chemistries and class-
ifications are well defined, one can then begin to develop a widely
based and generally applicable chemical measurement of acid impact.
Study Site
Colorado is a state of extraordinary variability in terms of
climate, geology, vegetation, and human population distribution,
with over 4000 named lakes (where lakes include both man-made reser-
voirs and natural lakes). The biogeographic zones range from dry
grasslands at the 4000 foot elevation, the coniferous montane
regions, to the alpine zones up to 13000 feet. There are eleven
major geologic bedrock types and ten different major soil types. In
order to adequately represent the entire state's lake population, a
random selection of lakes was chosen for the sampling protocol. To
ensure that the entire range of lakes would be covered in the sam-
pling, and that regions with low lake population density would be
represented, twelve drainage basins were used to divide the state.
These twelve drainage basins are shown in Figure 1. Twenty lakes
were selected at random from each of the basins for sampling. .This
sampling design allows the differences among Colorado lakes to be
studied, and provides a starting point for baseline characterization
of the state's lake population.

1. Colorado Blue
2. Yampa White
3. Eagle Roaring Fork
4. Gunnison
5. Dolores San Miguel
6. San Juan Las Animas La Plata
7. Rio Grande
8. Arkansas (above 6000 feet)
9. Arkansas (below 6000 feet)
10. North Platte - Laramie
11. South Platte (above 6000 feet)
12. South Platte (below 6000 feet)
Figure 1. Map of Colorado showing the twelve different drainage

The chemistry of a lake can be influenced by several
components. The primary influence is expected to be the bedrock
which with the lake has continual contact. The bedrock minerology,
the solubility and weatherability of the bedrock, and the soil
development will all be major influences on the lake chemistry
(10,11,12). The surrounding physical morphology of the catchment
basin which include the size, the degree of steepness, the direc-
tional orientation, and altitude may influence a lake's chemistry
(13,14). In addition, the lake may have chemical input from the
surrounding soil, vegetation, local fauna and microbial populations
(15,16). Atmospheric influences include wet and dry deposition,
reflecting both normal (natural) atmospheric reactions and
anthropogenic atmospheric inputs (2). There are also additional
anthropogenic influences from direct human (recreational) and
agricultural use, and local habitation and land use patterns
(13,16,17). Each of these influences affect the lake chemistry and
should be reflected in the inorganic constituents of the water (18).
Data Analysis
In order to identify the specific links between the lake
chemistries, the inorganic constituents, and the host of different
influences, several multivariate statistical techniques can be
applied to the data base. The use of multivariate or pattern recog-
nition techniques over the past fifteen years has expanded tremen-
dously in a wide variety of fields. Pattern recognition techniques

have become invaluable particularly in the field of analytical
chemistry due to instrumentation advances increasing the amount of
data easily generated and environmental studies of increasing
complexity. There are many precedents and applications of these
techniques. Pattern recognition has been applied to various types
of molecular spectra. Principal component analysis and factor
analysis have been used in the data analysis of infrared spectra and
pyrolysis mass spectrometry data (19,20,21). Exploratory data
analysis has been used in correlating chemical measurements to
subjective sensory evaluations (22), source identification of
geological materials (23), and interpretation of groundwater
monitoring data (24,25).
The first steps involved in the data analysis are ex-
ploratory data analysis or unsupervised learning. These techniques
make no a priori assumptions about the data and underlying variable
distributions. The purpose is to uncover the "natural'' associations
and behaviors among the variables and samples (18). The techniques
reduce and summarize the data in a form that is utilizable and
represents the important, distinguishing features of the data.
Cluster analysis. One of the most common techniques used in
exploratory data analysis is cluster analysis. This method sum-
marizes the data base by describing which samples are most like
which other samples. Cluster analysis can be viewed as a one-
dimensional representation of the similarities of the samples. This
technique begins by measuring the distance between each data point

in P-diraensional space. (P is the number of variables or columns,
and N is the number of samples or rows in the data matrix.) Then
the samples are hierarchically linked according to the similarity,
as measured by the distance, among the objects. The objects/samples
are ranked and linked according to their similarity; the most
similar objects being linked first. The most common method of
display for this method is a clustering dendrogram. Figure 2 shows
an hypothetical dendrogram indicating different types sample clus-
ters, labeled A through E. Very similar samples, such as groups A
and B, will cluster at a high similarity or low distance measurement
and will be well separated from the remaining samples; these are
called robust clusters. Samples very different from the majority of
samples will show as in group E; these are the outliers. Samples
not that similar to each other, but still separated from the remain-
ing samples, will form a characteristic loose cluster, group C, or a
stairstep cluster, group D. Cluster analysis describes what object
is most like what other object, but no reasons or explanations for
this relationship is given by this technique. This technique is
useful for a general data analysis starting point.
Factor analysis. With this technique, the associations
among the variables can be both identified and quantified. Factor
analysis is also very successful in identifying any anomalous
samples or measurements; the outliers (24). This technique begins
by forming the covariance matrix of the measured variables; the
covariance matrix summarizing the variance between all variables

Figure 2. Example of a clustering dendrogram and the
corresponding factor score plot showing robust
clusters (A,B), a stairstep cluster (D), a loose
cluster (C), and outliers (E).

(26,27). This is followed by eigenanalysis which extracts the best,
mutually independent axes (dimensions) that describe the data set
(24,25). These axes account for the greatest amount of variance in
the data set and are called the factors. The most significant part
of this mathematical process is that the information in the data
set, the variance, is concentrated into a few derived factors
(28,29,30). The factors are linear combinations of the original
measurements in a more interpretable, utilizable form (31,32). Up
to this point, the mathematical treatment of the data has been a
tool to present the data in a more humanly accessible form. At this
point, the knowledge of the scientist must be employed to interpret
the transformed data and derived factors. One method of interpreta-
tion involves examining the factor loadings. The loadings are the
contribution that each of the original measured variables makes to
each factor. A large loading on several variables in the same
factor means that these variables are associated in the system that
is being measured.
Factor score plots or factor plots allow the depiction of
all the samples with respect to selected factors. Each sample has a
location in the Q-variable space that is described by the coor-
dinates or scores of the sample. Figure 2 shows the samples from
the hypothetical dendrogram on a factor score plot as a function of
two factors. The robust clusters form a tight data group on the
factor plot (A and B) while the loose cluster (C) from a scattered
data group. These plots are particularly useful in identifying any

outlying samples. An outlying sample will be located at one extreme
of the plot with the majority of the remaining samples clustered at
the opposite end (Figure 3). Besides simply identifying an
anomalous sample, it is then important to remove this sample from
the factor extraction process as this sample may over-represent the
importance of a measured variable. This removal process is done in
an iterative fashion until all anomalous behavior is removed and one
is presented with the factors and the relative importance of the
variables, the loadings, that identify the characteristics of the
system being studied.
Classification analysis. Classification analysis methods
can be described as supervised learning techniques. In supervised
learning methods, some previous knowledge of the system must be
applied or assumed. In classification analysis, the classes or
categories must be initially specified; there must be a designated
training set of samples that represents a class. The success of the
classification can be measured by a self-classification percent
(27); the higher the number of training set samples that classify as
their designated class, the higher the self-classification percent.
In this study, two different types of classification procedures were
tested and compared. These methods are Soft Independent Modeling by
Class Analogy (SIMCA) and K-Nearest Neighbors (KNN).
SIMCA is a covariance-based method that is conceptually
similar to factor analysis. Each category is individually treated
to first compute the mean of the category (using the P variables).

Figure 3. Series of factor score plots showing outliers (a,b)
and the results as the outliers are removed (b,c).

Then principal component analysis is applied to construct a prin-
cipal component model; the number of principal components retained
is the number that is necessary to describe the variance of the
individual category (27). KNN on the other hand is a distance-based
modeling technique, analogous to cluster analysis. Each data point
is initially assigned a category number as indicated by the training
set. The distance between all the points in P-dimensional space is
calculated. The classification of a sample is then evaluated by the
previously designated classification of the other samples closest to
it; the number of other samples or neighbors being equal to K (27).
K is varied, generally from one to ten, to determine the most effec-
tive value for the training set, then is applied to unknown or test
set samples. Both methods can be evaluated with the self-
classification percent, and assuming a reasonable self-
classification success, then can be applied to a test set.
There is a weakness inherent in any of these supervised
classification methods in that they require a designated, known
training set. This method requirement is not always automatically
fulfilled in a study. In this study, a classification scheme was
hypothesized but exact classes and training sets were unknown. One
can develop a model based on hypothesis using a type of iterative,
trial training set process; if the hypothesis is correct, this
process will work, but can be very time consuming. In addition, one
may overlook a different model that was not part of the original
hypothesis that may describe the system as

well or better.
Often the combination of different classification techniques
yields more information about a system than just the single method
(21). Cluster analysis has had wide applications to general data
structure and categories; its ability to discover natural groupings
in the data make it a potential compliment to classification
analysis. However, the problems inherent in cluster analysis in-
clude difficulties in interpretation beyond the cursory level and
obtaining a significant value for the number of clusters (33).
Additional algorithms and clustering strategies have been developed
by Massart (34) dealing with these specific problems and then apply-
ing these techniques to classification analysis. Although they are
successful in applying the clustering algorithms with a class-
ification method, still there remains a problem with the chemical
interpretation of the clusters and the classes; what are the chemi-
cal reasons that these classes exist?
In this study, a technique was developed that combines the
supervised and unsupervised methods for classification analysis. In
factor analysis, the assumption is that all the important informa-
tion of a system, the covariance, lies within the first several
factors, the common factors (27,32). This number of factors is
equal to Q. The remaining factors contain the unique portion of the
variance (the noise), and the error. The unique portion does not
contribute anything of significance to the system's information, so
is disregarded in the final factor analysis compositions (29).

Using just the two-dimensional factor plots from factor analysis to
hypothesize categories by examining the separations and groupings in
a factor plot has been used successfully (18,25) However, as the
classes become more complex and have increasing overlap, this method
becomes limited.
In the technique developed in this study, the samples are
examined by pattern recognition techniques in the reduced Q-space
instead of the usual P-space. The reduced Q-space is defined by the
number of factors retained during the initial factor analysis. The
natural groupings that appear in this Q-space can then be used to
define potential classes and the training sets. Cluster analysis in
the Q-space will indicate the natural groupings that may be over-
looked by a visual examination of the Q-space factor plots. Using*
these as classes, the classification techniques can then be tested
in both the Q-space and P-space. The factor analysis basis of this
technique will provide the underlying, chemical descriptions and
definitions of these natural classes. The critical point is finally
to interpret and describe these groupings in a limnological context
and define the links between the chemistry and the major influences.
Study Objectives
The long-term objectives for the ongoing lakes class-

ification study are:
1. Establish the chemical range and baseline characteristics of
Colorado lakes.
2. Define the main influences (bedrock, soil, vegetation) in the
inorganic constituents that characterize the lakes.
3. Define an acid sensitivity indicator based upon the inorganic
4. Develop a model to classify Colorado lakes.
More specifically, the research presented here deals primarily with
the development of the classification model. The objectives of the
portion of this research are:
1. To summarize a very large and complex system in an accurate and
condensed form.
2. To represent the underlying critical components that define the
lakes' chemistries.
3. To develop a model that can be used in a predictive capacity.
One of the underlying principles of model development is that
several acceptable models or classification schemes may exist that
describe the system equally well; there is no single "best" model.
The important aspects of a "good" model are then the inter-
pretability of the model for the investigator along with the ap-

plicability, arid generality of the model.

Data Description
Lake selection. Three different sampling sorties were used
to collect the lake water samples. The first two samplings were
based on the random selection of lakes from the twelve different
drainage basins; approximately eleven lakes were sampled from each
basin. In the spring of 1984, 153 samples were taken; in the fall
of 1984, 130 samples were taken. These totals include lakes that
were sampled twice during a single sampling sortie to give repli-
cated samples. Most of the lakes from the spring of 1984 were
repeated in the fall of 1984 in order to examine the seasonal
changes in the lakes. For the 1985 sampling, a different sampling
strategy was used.. The data from the 1984 sampling had been ex-
amined and several hypotheses and questions had been developed. The
influences and distinctions of the bedrock, soil, and vegetation had
been identified in the 1984 studies; their finer characteristics and
their relative importance were still under consideration. The
morphological features of the lakes were indicated as potentially
important. The initial 1984 classification model needed to be
tested and possibly modified. Sensitive and reliable indicators of

acid sensitivity still had to be developed and supported.
The 1985 sampling therefore deviated from the original
random selection protocol by including lakes that addressed the new
questions and tested hypotheses. Using the 1984 initial class-
ification model, several lakes that were expected to represent each
classification category were chosen for the 1985 sampling. Thirty
four lakes were selected to represent the different categories. In
addition, sixty one lakes were added to test new hypotheses. A
total of 117 samples comprise the 1985 sample set. The combined
1984 and 1985 samplings brought the total to 436 samples. Both the
1985 only data set (Data Set A), and the combined 1984 and 1985 data
set (Data Set B), are considered in the data analysis procedures in
this thesis.
Sampling technique. Each lake was sampled using a helicop-
ter to hover two to three feet over the lake. A self-rinsing,
plastic/Teflon van Dorn sampling device was lowered from the
helicopter to approximately eighteen inches below the surface. The
sample was transferred to acid washed bottles immediately on board
the helicopter. Within six hours, the samples were delivered to a
mobile field laboratory where the pH, specific conductance and
fluoresence measurements were made. The samples were then preserved
with the appropriate filtering and acidification methods (18).
Finally, the samples were transported back to the central laboratory
for the completion of the analysis.
Chemical analysis. Twenty eight different chemical species

were determined in the lake waters. Each lake sample had been split
into four aliquots and treated in accordance to a predetermined
chemical analysis protocol. The major cations were determined by
Inductively Coupled Plasma Emission Spectrometry (ICP). The trace
cations were determined by Atomic Absorption Spectrometry Graphite
Furnace (HGA). The filtered, acidified sample aliquots were used
for these dissolved cation determinations. Anions were determined
by Ion Chromatography using filtered, unacidified aliquots.
Electrochemical and colorimetric methods were used for the remaining
analyses on the filtered, unacidified and unfiltered, unacidified
aliquots. A list of the analyses is shown in Table 1. A quality
control protocol was followed with all the analyses for the monitor-
ing of analytical precision and accuracy (18).
Variables. In addition to the chemical analyses, measure-
ments of lake size (surface area), altitude, and water temperature
were added to the data base. A list of the variables and their
distribution among each data set is given in Table 2. Several
variables were not used in the all data sets for any of the follow-
ing reasons: 1) the variable was not determined that year, or more
than 50% of the data was missing, 2) 80% to 100% of the values were
at detection limit, 3) an analytical bias was found through the
quality control protocols for that particular data set's variable,
A) there exists a natural bias or lake dominating characteristic
(i.e. size, temperature, altitude) in the variable. Any of these
conditions may cause a variable or several variables to dominate the

Aluminum (total) Al.t HGA UFA
Aluminum (dissolved) Al.d HGA FA
Arsenic As HGA FA
Boron B HGA FA
Barium Ba ICP . FA
Calcium Ca ICP FA
Cadmium Cd HGA FA
Chloride Cl' IC FUA
Chromium Cr HGA FA
Copper Cu HGA FA
Fluoride F COL FUA
Iron Fe ICP FA
Fluorescence Flor EM FUA
Alkalinity (HCO3) Aik COL/GRAN UFUA
Potassium K AAS FA
Lithium Li ICP FA
Magnesium Mg ICP FA
Manganese Mn ICP FA
Molybdenum Mo HGA FA
Sodium Na ICP FA
Nitrate NO3 IC FUA
Leab Pb HGA FA
Phosphate P0A IC ' FUA
Silicon Si HGA FA
Specific Conductance SpC ELEC UFUA
Sulfate SO42 IC FUA
Zinc Zn HGA FA
ICP Inductively Coupled Plasma Atomic Emission Spectroscopy
AAS Flame Atomic Absorption Spectrophotometry
HGA Graphite Furnace Atomic Absorption Spectrophotometry
IC Ion Chromatography
COL Colorimetric
GRAN Gran titration and Technicon
EM Fluorescence by molecular emission
ELEC Electrochemical/Potentiometric
FA Filtered and Acidified
UFA Un-filtered and Acidified
FUA Filtered and Un-acidified
UFUA Un-filtered and Un-acidified

-P~ LO to t-1
Alkalinity X xL xL
Altitude X X X
A1 dissolved X xL xL
A1 total 1 1 1'
As 2 2 2
B X 2 2
Ba X xL xL
Ca X xL xL
Cd 3 X 3
Cl' X xL xL
Cr 3 2 2,3
Cu xL X 3
F' X xL xL
Fe X xL xL
Fluoresc. X xL xL
K 2 xL xL
Li 2~ 2 2
Mg xL xL xL
Mn X xL xL
Mo xL xL xL
Na xL xL xL
N05 1 xL 1
Pb 3 2 2,3
pH X X X
Si 0 X xL xL
SO42 xL xL xL
Spec. Cond. xL xL xL
Sr xL xL xL
size X xL 4
Temperature 4 4 4
Zn X xL xL
x Variable retained
L Variable LN transformed
Deleted Variables Reasons:
Variables not analyzed; >60% missing values
80% 100% of values at detection limit
Analytical bias
Natural bias

results, and obscure valid information more critical and more
directly related to the lakes' chemistry. In addition, most of the
analytical data were log-transformed (log base e) for the data
analysis. Most of the variables show logarithmic distributions, and
a range greater than two orders of magnitude. There is an advantage
in transforming many variables in that the final factor analysis and
outlier identification is arrived at more.efficiently'and in the
final analysis includes a larger number of samples. In addition to
the measured variables, non-quantified, descriptive variables taken
from field notes are listed in Table 3. These descriptive variables
were used in conjunction with the quantified variables to provide
the limnological-chemical links and lakes' descriptions.
Data Analysis
All data analysis procedures were run on the University of
Colorado at Denver's Prime 9950 computer. The statistical package
ARTHUR81 (27) was used for the multivariate procedures. The data
analysis follows several steps and iterative cycles as shown in
e Figure 4. The first step involves the basics of inputting and
converting the data to an ARTHUR compatible format. Any missing
data 'is filled using a mean value, and values below detection limit
are standardized by using the value of the detection limit. The next
step begins the interaction of data manipulation and interpretation
with the data analysis techniques.

Non-Quantified Variables Used For Data Analysis
Vegetation Characteristics: Bedrock Types;
Grass meadow MPC Precarabrian metamorphic
Dry shrub IGPC Precambrian igneous
Wet shrub SP Pennsylvannian sedimentary
Ponderosa,mixed TIG Tertiary igneous
coniferous SKu Upper Cretaceous sedimentary
Lodgepole SKI Lower Cretaceous sedimentary
Spruce-fir ST Tertiary sedimentary
Aspen, deciduous
Lake Characteristics: Morphological Characteristics:
Color and opacity Basin size
Aquatic vegetation' Basin contours, shoreline
Ice characteristics Exposure direction
Usage recreational Inlet/outlet status
Sampling Characteristics:
Approximate distance from shore
Sampling depth

Figure 4. Block diagram of the general data analysis procedures
used in this study.

Initial Data Cleanup
Data Set A. In addition to the variables identified in
Table 2, temperature was also deleted. Temperature combined with
size and altitude to form a dominant factor, without describing
anything of chemical or novel interest. This factor was a combina-
tion of variables that behaved in an expected geophysical manner,
without any significant chemical information. Once temperature was
deleted, altitude and size then loaded on different factors, con-
tributing only minor amounts of variance to the factors. All vari-
ables were log transformed with the exception of pH, altitude, Cu,
and Cd. The pH data are already on a log scale, and the ranges of
altitude, Cu, and Cd are only a single order of magnitude while the
other variables' ranges are all greater than two orders of mag-
Five samples were identified as outliers by factor
analysis. McHatten and Addison Reservoirs (#407,#470) are Mo out-
liers, Wrights Reservoir (#456) is a Mn and F outlier, Haviland
Reservoir (#443) is a S0^ outlier, and Antones Lake #2 (#424) is a
Cd outlier. The four reservoirs are not so much individual outliers
but as a whole, represent a loose class of larger, lowlands reser-
voirs that are quite different from the rest of the 1985 lakes. The
other outlier, Antones #2, had a newly installed large metal pipe
visible on the edge of the lake that could account for a variety of
unusual chemical behaviors. The eleven filter blanks were not used
in the actual factor analysis, but were examined on the factor plots
for unexpected behaviors in a quality control check. Nothing un-

usual was apparent from the blanks as all were found to have expec-
tantly low and below detection limit values for all variables
determined. The final Data Set A consisted of 101 samples and
twenty four variables.
Data Set B. A total of twenty two variables were used as
indicated in the variable table. In addition, size was removed as
it behaved as a unique factor. As a unique factor, size accounted
for the majority of the variance on the factor due to only a few
samples with relatively large sizes. By removing size, additional
variable patterns were revealed that were previously obscured by the
size variable. In addition, Cu was found to contain an analytical
bias. Many of the Cu values in 1985 were at detection limit, while
in 1984, this was not noted. This became apparent during factor
analysis, where Cu formed a factor with dissolved A1 as shown in
Figure 5. When the replicated samples are traced from 1984 spring,
1984 fall, to 1985, a pattern emerges in which the Cu values are
decreasing at each instance. The same pattern also occurs with
dissolved Al, but to a much lesser extent; not enough to justify
removal of dissolved Al from the variable set. However, the Cu
behavior is probably due to an analytical error, and introduces an
artificial pattern into the data set; consequently, the Cu values
were removed from the data set. All variables were log transformed
with the exception of pH and altitude, due to their distribution.
Four samples, all of which were identified in the 1984 preliminary
data as outliers, were again identified as outliers and removed from

Figure 5. Factor score plot of repeated samples showing the
direction of the Cu analytical bias. Each arrow
procedes from 1984 spring, 1984 fall, to 1985 sample.

further analysis. Prink Reservoir (#94) had the maximum Li, SO42,
and Mg values; School Section Reservoir (#95), and Taylor Reservoir
(#101,#301) had the highest Mo, Cu, and Zn values. All four samples
are located in the Eastern plains, with high accessibility and human
usage; a variety of anthropogenic influences could cause their
outlying chemical behavior. Again, all blanks were removed from the
actual factor analysis and were examined on the factor plots for any
unusual behavior; all were found to have minimal levels for the
parameters measured. Data Set B in its final form consisted of
twenty variables and 419 samples.
Classification Procedure
Once the iterative data cleanup procedures were completed,
the final Q factors were determined, and factor plots and scores
generated. (The scores are the coordinates in Q-space that describe
the location of each sample in Q-space.) At this point the class-
ification procedures begin. Table 4 shows the different procedures
used in the specific classification methods. The 1984 initial model
was formed by an iterative process of the hypothesized categories.
Initial classes were formed on the basis on the samples' bedrock and
vegetation characteristics. Using the SIMCA classification method,
the self-classification rates of the hypothesized classes were
checked.and samples were removed and/or reclassified repeatedly
until an acceptable model was formed. (Acceptable is defined by a
self-classification rate better than 85% and the classes being
chemically and limnologically consistent).

Data Sets: Data Sets:
1984 A B B 1984 A
(N = 319 \ '(N = 117 . l(N = 436
N-f=283) ! N-f=101) 1 N-f=419) ; P=20 Q=5 P=23 Q=4 P=24 Q=4
l' i J, i X 4
9 classes 6 classes 6 classes 4 classes 4 classes 4 classes
N-t= 220 N-t= 97 N-t= 417 9 classes 9 classes 7 classes
N-t= 381
N-t= 351 N-t= 247 N-t= 95
. . N-t= 401
-1. Ll MJ 4- L_ J \_/ ^ i Table 11
N : total samples 4 classes 4 classes
N-f: samples in factor analysis N-t= 351 N-t= 247
N-t: samples in training set
Q : scores X 4- 4r X
80% training set 80% training set
60% training set 60% training set
V --- y
Table 12

The classification procedure proposed in this thesis is
based on the scores that are generated from the final factor
analysis. Cluster analysis and classification analysis can be
performed directly on the factor score values. The distance
measurement in cluster analysis was the Mahalanobis Distance of
Order N (N is equal to two, dr the Euclidean distance), and average
linkage was used for the clustering. With cluster analysis, one
gets a dendrogram that represents how those samples lie in the Q
factor score plots. One commonly looks for clusters in a single
two-dimensional factor plot that will lead to insight and informa-
tion about the system. Now in Q-space, the dendrogram shows the
natural groupings of the samples along all factors. These groupings
can then be used as a basis for a training set in classification
analysis in two different ways. Both methods use the clusters found
in cluster analysis on the Q scores to determine which samples
compose the classes. Then, one method uses the original P variables
as the input data matrix for the classification algorithms. The
other method uses the Q scores for the data matrix in the class-
ification algorithms. These techniques were performed using both
the KNN and SIMCA classification methods, and on both data sets A
and B.

General Factor Analysis Results
Factor analysis was run on each separate data base. The
results among each of the data bases were very similar, and were
also quite similar to the preliminary factor analysis on the 1984
only samples. Factor analysis generally yields around four to six
factors that account for 60% to 80% of the total variance (18,25).
The factors are ordered in decreasing amount of total variance. The
first factor, or primary factor, will have the largest variance and
describes the main chemical differences of the system. The next few
factors will have less variance on each and are the secondary
factors. The last factor(s) account for the least amount of
variance but are still considered important in the overall factor
analysis as they may show the finer data details.
Primary factor. The primary factor for each data base,
Factor one, is the major separator of the lakes; this represents the
bedrock types. The carbonic acid weathering cycle (35) has been
studied for years and is proposed to be the main influence of the
lake water chemistry (36). The common components of this cycle are
Ca, Mg, pH, and/or alkalinity (10,11,12). In this study, the

differences in the bedrock solubilities are the largest source of
variance in the lake waters, and are reflected in the composition of
Factor one. The variables in this factor include Ca, Ba,
alkalinity, Mg, specific conductivity, and Sr. (All variables dis-
cussed from now on can be assumed to be log transformed with the
exception of pH and altitude.) The older, less soluble bedrock
types, Precambrian raetamorphic and Precambrian igneous, are at the
low end of Factor one, and progress along the factor to include
Tertiary igneous and Pennsylvanian sedimentary bedrock in the middle
range, and the more recent and soluble bedrock types, Cretaceous
sedimentary and Tertiary sedimentary at the high end (37).
Secondary factors. There are two secondary factors, a
"biological input" factor and a "morphology/input-output charac-
teristics" factor. The biological factor may be explained as
reflecting the amount of vegetation and decaying biological
materials that are in contact with the lakes (15,38). The common
variables in this factor include Fe, fluorescence, and Mn. At the
high end of this factor axis are lakes with dense, coniferous
forests, often characterized by a dark colored water; at the low end
are the alpine lakes that are at and above treeline with little
vegetation and soil development. The morphological factor is an
integrator of the catchment basin characteristics that include size,
topography, average overall vegetation and soil development and the
consequential input water contact with the basin's soil, and the
inlet/outlet characteristics (permanent, intermittent or non-

existent). These characteristics are reflected in the variables F7
Si, and lake size. Fand Si originate in the bedrock and soils
(39), and are very mobile or abundant species. They are picked up
by the overland input waters, and in lakes with a low retention
time, do not have a chance to settle or precipitate out. Larger
reservoirs, with permanent inlets and outlets, are found high along
this factor, while shallow, small lakes are at the low end of this
factor axis.
Last factor. The last factor common to the two data sets is
a trace element factor. The variables on this factor include Zn,
Cd, Mo, and -pH (pH increasing as the metals decrease). There are
two groups of lakes characterized by this factor. The first group
tends to be lakes that are high on all factor axes. They are the
lowlands lakes, located on the higher solubility bedrock, with
relatively high human use and mixed inputs. The other group is the
alpine lakes. They separate from the other low-solubility bedrock
lakes at a high level along this factor axis. The lack of vegeta-
tion and soil development are some of the components that are
hypothesized as determining this behavior. The traces can be viewed
as originating from the bedrock, but residing in a bound complex in
the soil and being taken up by the vegetation and hence not as
available in lakes with developed soil and vegetation (40). Also,
the lower levels of dissolved salts in the alpine waters result in
less precipitating reactions of the traces and these elements remain
solublized (41). In the alpine lakes, where there is little else to

contribute to the lakes' chemistry, any minor bedrock characteris-
tics will appear more obvious.
Specific Factor Results
Data Set A. The exact factor compositions from Data Set A
(the 1985 only samples) are listed in Table 5. Some of the finer
characteristics of the lake chemistries are exposed in this data
base due to the more limited range of samples of this set. Most of
the lakes that lie along the higher portions of all the factor axes
were not in this data base; primarily the lakes from the middle to
low factor axes range are in this set allowing a closer examination
of these sections. Factor one (bedrock) has an additional variable,
pH, which loads in the same direction as the other variables
described above in the general factor description. This is in
agreement with the lower solubility bedrock types having a low
buffering capacity and therefor the lakes tend to have lower pH
levels (11,12).
The last factor, Factor four (traces), is as described in
the above general description, with the addition of size loading in
the opposite direction as increasing traces. Factor two (biologi-
cal) includes the additional variable, NO3, loading in the opposite
direction as Fe and fluorescence. At higher levels of Fe and
fluorescence there is a correspondingly high vegetation cover that
is expected to uptake the NO3, reducing NO^'s availability to the
lake water. At low vegetation levels, low Fe and fluorescence, any

Factor Compositions From Factor Analysis
For Data Set A
Variable % Variance Contributed
pH 14.0
Ca 13.8
alk 13.7
Mg 13.3
SpC 9.7
Ba 8.1 total = 72.6%
Fe 18.3
flor 18.0
-NO3 17.1
-alt 10.8
-SO32 10.4
Al.d 10.1 total = 84.7%
Si 18.8
SO42 15.4
F" 11.9
-Al.d 9.8
size 6.5
NO3 5.6
K 5.0
-pH 4.6 total = 77.6%
-size 28.1
Cd 26.8
Zn 13.4
Mn 10.0
-pH 5.5
N0§ 4.6 total = 88.4%

sources of NO3, atmospheric and surficial, would contribute directly
to the higher levels of NO3 in the lake water. Figure 6 shows this
separation of the alpine lakes as a function of the Factor one
(bedrock) and Factor two (biological). This plot may be viewed as a
potential indicator of lakes that would be most sensitive to the
effects of acid deposition. The most vulnerable lakes will be in
the lower right corner, where the bedrock is at its lowest
solubility, the biological influences are at a minimum, and the NO3
is already at detectable levels; primarily the alpine lakes are
located in this corner. In addition, there are four lakes in the
extreme right lower corner. All of these lakes are alpine lakes
with minimal soil and vegetation. Three of these lakes are on
metamorphic and igneous precambrian geology, and are located along
the front range; all components that point toward acid deposition
sensitivity. In addition, the plot of Factor four (traces) versus
Factor two (biological/NOg) as shown in Figure 6, may be showing
regions that are also potentially sensitive. The lower left corner
is high along Factor four (traces), which is one of the hypothesized
results of acidification (18). Also, this corner has low biological
components with higher NO3 levels; in combination with the high
traces points to high sensitivity. While there are no lakes in the
extreme left corner, indicating lack of obvious acidification in the
sample set, there are a group of samples in that area that may be
sensitive. These lakes are again the alpine and "sub-alpine" lakes,
with minimal soil and vegetation, but also include a wider range of

(pH Ca alk Mg SpC Ba) FACTOR ONE
Figure 6. Factor score plot of Data Set A showing the separation
of the alpine/subalpine lakes along the bedrock
(one), biological (two), and traces (four) factors.

bedrock types: Pennsylvanian sedimentary and Tertiary igneous along
with the Precambrian metamorphic and Precambrian igneous. In addi-
tion, three out of the four lakes identified in the other poten-
tially sensitive lakes' plot, Figure 6, are in this corner.
On Factor three, the is seen loading in the same direc-
tion as the soil/morphological components Si and F. S0^2 sources
include bedrock, soil, and atmosphere. Both Factor two and Factor
three include the SO^ variable, Factor two having SO42 as a minor
component. In general, the and NO3 load in the same direction
as Si and F, and in the opposite direction as Fe and fluorescence.
In terms of Factor three, the soil contact/morphological descriptor,
the S02j2 will be picked up from the soil with the Si and F, along
with any atmospheric sources, and will increase along this factor.
Keeping in mind the minor loading that SO^ has on Factor two, the
SO42 variance may be originating primarily from an atmospheric
source, as it increases with NO3 while the biological components are
Data Set B. Data Set B (the combined 1984 and 1985
samples), whose five factor compositions are listed in Table 6,
includes the entire range of lakes chemistries. This data set
reflects both the intra-seasonal range and the lack of any sig-
nificant intra-seasonal or annual differences in the data analysis.
In examining each of the factor plots, one sees the random direction
that the replicated samples take in each sampling session, Figure
7. Also, the first sampling, 1984 spring, is shown to have the

Factor Compositions From Factor Analysis
Factor For Data Set Variable B % Variance Contributed
1 Ba 19.7
Ca 14.1
Aik 12.2
-F 10.6
Mn 8.4
Mg 7.5
-Al.d 6.7
SpC 6.1
Sr 5.6
Si 5.1 total = 96.0%
2 F 28.0
Na 9.5
-Alt 8.1
-Ba 7.8
Mo 6.6
SO42 6.0
cr 5.6
Si 5.4 total = 77.0%
3 Al.d 36.7
Flor 24.6
Fe 23.2 total = 84.5%
4 Zn 41.1
-pH 26.0
Mn 14.5
Mo 11.6 total = 93.2%
5 Si- 57.0
-Mo 12.9
Mn 9.3
-Al.d 7.2
F 6.1 total = 92.5%

Figure 7. Factor score plot of Data Set B (top) showing the
range of the 1984 spring, 1984 Fall (circled area)
and the 1985 (shaded area) samplings. The lower
plot shows the randomness of all samplings inthe
repeated lakes and the shift in Rito Alto Lake (*)
Each"arrow procedes from 1984 spring, 1984 fall,
to 1985 sample.

widest variance range, as expected. The 1985 samples all lie within
the 1984 spring range, Figure 7, fulfilling one of the 1985 sampling
objectives. One lake shows an unusually large shift along Factor
one (bedrock), Rito Alto lake (#445). The 1984 samples are at the
moderately low end of Factor one (bedrock), while the 1985 sample is
at the high end of Factor one. In 1985 it was noted that there was
a large talus slide just on the uphill side of the lake that the
primary input water runs through, and in 1984, there was no note of
this. It is very likely that the slide occurred just this last
year, and the increased contact with the gravel and bedrock shows up
as a large shift, increasing along Factor one.
Factor one (bedrock) and Factor four (traces), are as
described in the general factor results section, and show the same
types of separations as previously noted. Factor three (biological)
is also as described previously with the addition of dissolved Al,
an expected contribution from biological sources (13,15). As with
Data Set A, the corners of the bedrock, traces, and biological plots
can be used to indicate potentially sensitive lakes. The same type
of lakes as seen and described in Data Set A as susceptible are also
seen in this data base at low bedrock solubility, low biological
component, and high trace levels; Figure 8. The soil/morphological
factor is split into two factors, Factors two and five. Factor two
is composed primarily of F7 and separates the large, channeled input
reservoirs at the high end, and the small input, shallow lakes at
the low end. Along Factor 5, composed primarily of Si, there is a

^Ba Ca all^ FACTOR ONE
Figure 8. Factor score plot of Data Set B showing the
separation of the alpine/subalpine lakes along the
bedrock (one), biological (three) and traces (four)

similar pattern. In addition, there are large diagonal spreads
among two bedrock types, tertiary igneous and precambrian igneous,
on the Factor five versus Factor one plot. This reflects the large
range of lake types that are among these groups. The other bedrock
groups, Precambrian metamorphic and Pennsylvanian sedimentary, have
a more homogeneous group of lakes and are consequently have less
spread along the factor axes, particularly Factor five.
1984 Geology-Vegetation Classification Model
The classification model developed from the 1984 preliminary
data was based on the hypothesis that the bedrock geology and sur-
rounding vegetation were major contributors to the lake water
chemistry. Thus, the bedrock and vegetation differences should
cause natural groupings or classes within the data. Using a pub-
lished geologic map (42) and field notes, lakes were assigned to six
different bedrock classes. This model represent the majority of the
different bedrock types in Colorado with the exception of glacial
deposits and alluvium which did not comprise a separate class but
were assigned to the geographic neighboring bedrock classes. These
six classes are: Precambrian metamorphic (MPC), Precambrian igneous
(IGPC), Pennsylvanian sedimentary (SP), Tertiary igneous (TIG),
Lower Cretaceous sedimentary (SKL) and Tertiary sedimentary (ST),
and Upper Cretaceous sedimentary (SKU). In addition to these
bedrock separations, the three general vegetation categories of
grasslands-shrublands, coniferous, and unvegetated/alpine were added
to the model (43).

Finally, an additional distinction in one bedrock category was found
based on the inlet/outlet status of the lake. In the SP class,
input water from lakes with only intermittent or no inlets and
outlets and small catchment basin sizes can be viewed as having
little soil contact. The main input sources must be from direct
precipitation and/or underground sources. These lakes are con-
trasted to lakes with permanent, channeled inlets and outlets,
having a large catchment basin size. These input waters have a
large contact with the soil, due to both increased surface area
contact and increased length of contact. They form a separate class
from the other low-soil contact SPs. This separation was seen only
with this the Pennsylvanian sedimentary bedrock. This is probably
due to a sampling coincidence; the range of lake morphology/soil-
contact characteristics in this group is very large. The other
bedrock classes tend to have lakes that are one type of soil-contact
or the other; they do not have the mixture of lake types that this
SP category has.
The combined geology, vegetation, and inlet/outlet status
characteristics lead to classification model comprised of nine
different categories. These classes are summarized in Table 7 and
their SIMCA and KNN self-classification rates are presented in Table
8. The overall self-classification rate is 90% (SIMCA) with the
training set composed of 67% of the total sample population. Most
of the samples removed from the training set represented lakes with
very inhomogeneous, mixed bedrock and vegetation types, or obvious
signs of anthropogenic influences. This

1 MPC Precambrian metamorphic below timberline- includes Engleman spruce, lodgepole pine, aspen Oligotrophic little to moderate moderate human use
2 Alpine Precambrian metamorphic. Precambrian igneous. Tertiary igneous. Pennsylvanian sedimentary. alpine tundra shrub, grass, meadow Small drainage basin. Isolated. Exposed bedrock
3 TIG Tertiary igneous below timberline- mixed. Engleman spruce, lodgepole pine, aspen, shrub, meadow, grassland Broad, mixed group
4 SDKt SDK1 Tertiary & Lower Cretaceuos sedimentary arid lowlands includes grassland, shrub, sagebrush, aspen High human activity, agricultural use
5 SDKu Upper Cretaceous sedimentary arid lowlands includes grassland, shrub, sagebrush, aspen High human activity, agricultural use
6 SD CON Tertiary Cretaceous sedimentary (includes Upper & Lower Cretaceous & Tertiary sediments) coniferous includes Engleman spruce, lodgepole pine, ponderosa pine, aspen High bedrock, minimal source water/soil interaction
7 SD PN-BR Pennsylvanian below timberline- sedimentary Engleman spruce (includes Triassic, Permian, Paleozoic Pennsylvanian sediments) Minimal source water/soil interaction
8 SD PN-SL Pennsylvanian sedimentary (same as Class 7) below timberline- mixed. Engleman spruce, ponderosa pine, aspen, shrub, grassland Large source water/soil interaction
9 IGPC Precambrian igneous (includes granitic rocks) below timberline- mixed. Engleman spruce, lodgepole pine, ponderosa pine, aspen, grassland Oligotrophic Isolated, or usage

SIMCA Results for GEO/VEG Model on 1984 Data
% classified as:
cat. # 1 2 3 .4 5 6 7 8 9
M PC 1 100
ALPINE 2 93 4 4
IG T 3 8 3 81 6. 3
SD K1,T 4 3 90 3 3
SD Ku 5 2 9 84 2
SD CON 6 6 6 88
SD PN-BR 7 100
SD PN-SL 8 100
IG PC 9 7 93
C0N= coniferous Total correct = 90% BR= bedrock SL= soil SIMCA Results for GEO/VEG Model on Data Set A % classified as: cat. # 1 2 3 4 5 6
IG T 1 84 11 5
ALPINE 2 3 90 7
SD K,T 3 100
SD PN 4 6 6 88
M PC 5 8 83 8
IG PC 6 100
Total correct = 90%

classification model adequately represents the more distinctive,
homogeneous, bedrock-separated lakes.
These class descriptions were initially used as a class-
ification basis for Data Set A (1985 only samples). Each new lake
selected for sampling in 1985 was chosen on the basis of the under-
lying bedrock geology and general surrounding vegetation. Conse-
quently, each new lake was predicted to belong to one of the nine
geology/vegetation classes, and 1984 lakes repeated already had an
assigned class. For Data Set A, the eleven analytical blanks, and
five samples identified from the factor analysis as outliers were
excluded from the initial classification runs. The exclusion of
these five samples and the small number of lowlands lakes sampled
this year resulted in too few samples in each of Categories 4,5,6
(SPKt, SDK1, SDKu, and SD coniferous). Consequently, these similar
categories were combined to form a single category consisting of
Tertiary and Cretaceous sedimentary bedrock and mixed vegetation
ranging from grasslands to coniferous forest. The initial class-
ification runs revealed a large overlap between Categories 7 and 8,
SP bedrock and SP soil. These two categories were therefor combined
to form a single category (without any distinction of the input
water sources); the category was composed of lakes on Pennsylvanian
sedimentary bedrock, with mixed vegetation. For the final test set,
an additional four samples were removed from the training set. Two
had shown atypical behavior in the 1984 sampling, one was a high use
lake utilized as a stock watering lake, and one was a borderline,

mixed bedrock and vegetation lake. The results of this class-
ification scheme are presented in Table 8. The overall correct
classification rate of 90% is good for the small test set and the
minimal effort and time invested in forming this particular test
set. The high classification rate was also expected since the
lakes/samples chosen for this set were selected on the basis of this
classification scheme.
The classification success was markedly less with Data Set
B. Again, the nine 1984 classes were used as the category designa-
tions for the training set. Initial runs uncovered a large overlap
between several different categories. Categories 4 and 5 had the
most extensive overlap so they were combined to form a single
category defined by Tertiary and Cretaceous sedimentary bedrock,
arid lowlands vegetation consisting of grasslands and shrublands,
with high human use. Again, categories 7 and 8 overlapped, so were
combined to form a single Pennsylvanian sedimentary bedrock class
without any source water distinction. Categories 1 and 3, which
showed some overlap in 1984, had enough unresolvable overlap this
year to justify combining the two into a class defined as
Precambrian metamorphic and Tertiary igneous bedrock, coniferous,
meadow, and grassland vegetation, and a range of use varying from
low to moderate. The test set initially consisted of the blanks and
samples identified in factor analysis as outliers. Then, an addi-
tional test set was hand-picked based on their misclassification and
any justification for unpredictable behavior such as unusual intra-

class use, borderline/mixed bedrock and vegetation, and any
catastrophic occurrences (landslides, local forest fires, etc.).
The result of this classification is presented in Table 9.
Without the additional test set, the overall classification rate
approaches the rate with the test set: 75% without, and 77% with the
test set. In other words, the tedious process of picking out a test
set does very little to improve the self-classification success with
this particular data set. The misclassifying samples in this data
set are not outlying samples, but are mainly borderline and mixed-
classes samples. With the 1984 data set and classification
analysis, forming a test set was quite successful since it removed
the samples that were outliers; samples that were on the extremes of
the classes. The additional samples from the 1985 sampling are
different from the 1984 lakes. They were not chosen randomly, but
to represent the established categories. They have expanded the
category boundaries by filling in and rounding out the gaps in the
categories caused by an insufficient number of samples in the
original nine categories. The range of the categories was well
represented by the 1984 only data set; however, undersampling within
the categories did not allow the natural overlap to. be determined.
This effect may be illustrated by examining just Category 2,
the alpine lakes. As shown in Figure 9, the factor analysis on this
category demonstrates the 1984 and 1985 sampling spread. The 1984
samples established the range for this data set along Factor one.
Distinct gaps exist in the location of the samples on the factor
score plot that correspond to four

SIMCA Results for GEO/VEG Model on Data Set B
381 Sample Training Set
% classified as:
cat. # 1 2 3 4 5 6
M PC,IG T 1 77 10 4 5 1 3
ALPINE 2 2 85 7 3 3
SD K,T 3 ' 1 2 78 12 7 1
SD CON 4 6 6 9 76 3
SD PN 5 7 5 21 68
IG PC 6 4 11 4 82
Total correct = 77%
417 Sample Training Set
% classified as:
cat. # ' 1 2 3 4 5 6
M PC,IG T 1 70 8 9 5 3 5
ALPINE 2 3 90 2 5
SD K,T 3 3 1 77 4 12 3
SD CON 4 26 3 3 65 3
SD PN 5 '4 8 12 76
IG PC 6 9 3 12 76
Total correct = 75%

Figure 9
Sd Pn 'i-----------------1
|g T , ,-----,
M Pc 1
Ig Pc ,
Sd Pn li----------------------1
Ig T | i-----------------1
M Pc I
Ig Pc | t
. Factor score plot of the alpine lakes comparing the
1984 (circled dots) to the 1985 (dots) lakes, and
showing their Factor one bedrock ranges and

different bedrock types: IGPC, MPC, TIG and SP. When the 1985
samples are added, they all fall within the bedrock boundaries of
the 1984 samples, but now are in the areas where there were per-
ceived gaps in 1984. The apparent bedrock separations seen in the
1984 sampling now appear as a continuum in the combined data set.
This phenomenon occurs for all the categories. The 1985 additional
samples have reinforced the 1984 centroids, and have also filled in
the gaps between the categories. The resulting continuum of lakes
is not well represented by THIS PARTICULAR classification scheme.
These results indicate that a different classification scheme is
needed. An adequate scheme must account for the overlap and con-
tinuity of the lakes by more intricately combining the bedrock,
vegetation, soil, morphology, and use of the lakes. The bedrock
differences are insufficient as the main distinction in a class-
ification scheme. Additional vegetation, soil, morphology, and use
characteristics must be considered along with each major bedrock
type in order to adequately classify the lakes. The original as-
sumptions of variables which influence the lake chemistry are still
true. However, they need to be incorporated in a classification
model in a different manner.
Initial cluster analysis. Cluster analysis on the P vari-
ables was performed on each of the data bases, and the corresponding
dendrograms generated. The initial runs were made on the variables
without a log transform. The dendrogram was found to be too dis-
torted to be of any utility. This is due to the logarithmic dis-

tribution of the variables. Consequently, the variables were log
transformed (excluding pH and altitude) and all the samples were
used. For Data Set A, there were several robust clusters that
corresponded to some of the categories of the 1984 model. As shown
in Figure 10, the entire dendrogram for the 1985 sampling (Data Set
A) falls within the two main clusters A and B of the 1984 sampling
set. The outlying lakes and many of the lowlands lakes were not
sampled in 1985; just the main section of the 1984 state wide
dendrogram was sampled in 1985. This is due to the 1985 sampling
design where the lakes were selected to represent this section of
lakes. In examining just the Data Set A dendrogram robust clusters
are apparent for each of the following groups: the alpine lakes; the
borderline alpine or "subalpine lakes that have all types of
bedrock geology except Cretaceous sedimentary and Tertiary sedimen-
tary; the heavily forested lakes with Precambrian metamorphic and
Tertiary igneous bedrock; the non-alpine lakes on Pennsylvanian
sedimentary bedrock; the blanks and the factor analysis identified
outliers. One hypothesis based on this dendrogram is that these
clusters may indicate classes of lakes that could be used to develop
a classification scheme. This approach will be used for the
development of the 1985 model.
The remaining data set, Data Set B (combined 1984 and 1985
data), did not produce a dendrogram with clusters that correspond to
the nine 1984 category types as did Data Set A. Data Set B shows
some of the clustering of the SIMCA categories, but in a general

Figure 10. Clustering dendrogram showing the inclusion of just
the 1985 sample, Data Set A, within the main
clusters, A and B, of the 1984 samples.

way, as shown in Figure 11. Categories 4 and 5 form a cluster of
arid, lowlands lakes. Category 3 is separated into three different
clusters, one of which combines with other coniferous lakes on
Precambrian metamorphic, Tertiary sedimentary, and Cretaceous
sedimentary bedrock. The lower use, coniferous lakes on IgPC and
MPC from categories 1 and 9 form a cluster. The alpine lakes, the
low-input SP lakes, and the borderline 'sub-alpine lakes form a
large cluster. Again, all the blanks and outliers form the loose
clusters at the edges of the dendrogram. These clusters have
similar descriptions as those in the Data Set A dendrogram, and will
be the foundation of the 1985 limnological classification scheme.
1985 Limnological Model
Cluster Analysis on Scores from Data Set B
The forming of new classes was based on the dendrogram
produced from the cluster analysis of the Q factor scores. For Data
Set B, Q is equal to five; these are the five retained factors
discussed previously. A summary of this dendrogram is presented in
Figure 12. In examining this dendrogram, there are four major
clusters plus a small cluster of outlying samples. Within each of
these four clusters (R1-R4), there are'1 further subgroups that repre-
sent the subtle differences within the four major groups (UA, UB,
UC, OA, OB, OC, MA, MB). The classification methods were tested on
both the four major groups and the nine smaller groups. The names
of the groups are based on the idea of what the total quantity and

Geo/Veg Class & Class Description
Figure 11. Clustering dendrogram of Data Set B showing the
placement of the nine Geo/Veg model classes.

Robust clusters Main groups
1.0 .8
* test set samples
A 2
Figure 12. Clustering dendrogram of the FACTOR SCORES of the
Data Set B showing the robust clusters (R1-R4) and
the corresponding main groups (classes), subgroups,
outliers, and test set samples.

concentrations of all chemical species contribute to the distinctive
lake water; that is, a nutrient or trophic scale. These names are:
1) ultra-oligotrophic to mean very few or lowest concentrations 2)
oligotrophic to mean few or low concentrations 3) mesotrophic to
mean average or intermediate 4) lowlands to mean not necessarily an
eutrophic or dying lake but to have the highest concentrations of
inorganic components in this study. Each of these clusters are
examined on the factor plots to determine the chemical reasons for
their clustering and the chemical characteristics of each category.
A table summarizing these groups is presented in Table 10.
Ultra-oligotrophic group. This group is quite similar to
the alpine group (class 2) in the 1984 preliminary model, and also
includes most of the SP low soil contact (class 7) lakes. The
bedrock types include the low-solubility Precambrian igneous,
Precambrian metamorphie, Pennsylvanian sedimentary, and Tertiary
igneous, with all samples located at the lowest end of Factor axis
one (bedrock). The vegetation is minimal with predominately alpine
vegetation, bare rock and talus, and some sparse coniferous areas;
this is chemically shown by their location at the low end of Factor
axis three (biological), Figure. 13. These high altitude lakes
generally have a small catchment basin, no permanent inlets/outlets,
and little contact with the soil. This is indicated by their
presence at the low end of Factor axes two and five
(soil/morphology), Figures 14 and 15. An interesting characteristic

FACTORS: Bedrock: Bio: I/O-soil Traces: Ba, Ca, alk Al.d, flor, Fe : F7 Si Zn, , Mn, Mo NUTRIENT (trophic) SCALE: Ultra: Oligo: Me so: Eu: very few few intermediate abundant
GROUP Alpine/subalpine Ave. Mnt. Conif. Mixed, Mnt.
GROUP oligo shallow average
FACTORS: Bedrock low low-mod mod low-mod low-mod mod-high mod ' 1 1 mod-high mod-high high high
Bio mod low mod high mod-high mod-high mod 1 mod-high i mod mod high(Fe)
subalp/ alpine subalp/ heavy mixed mixed mixed 1 | conif. mixed shrub, mixed
rocky conif. conif. conif. conif. conif. 1 i grass
I/O low-mod low-mod low mod-high mod-high mod low-mod 1 high Si mod-high low to low
soil | low F high
| (shallow' i (res.)
Traces high mod mod-high mod low low mod "j mod i J low low-mod high

^Ba Ca alk^
Figure 13. Factor score plots of Data Set B showing the
separation of the main and subgroups along the
bedrock (one) and biological (three) factors.

^ Ba Ca alk^
Figure 14. Factor score plots of Data Set B showing the
separation of the main and subgroups along the
bedrock (one) and soil/morphology (two) factors.

Meso Lowlands
^ Ba Ca alk^
Figure 15. Factor score plots of Data Set B showing the.
separation of the main and subgroups^jilong the
bedrock (one) and soil/morphology (five) factors.

of these lakes is that while they have low values for most of the
variables, they are moderately high to high for in trace elements
Zn, Mn, Mo Factor four. These lakes are the same group identified
in the initial factor analysis as being the most potentially sensi-
tive lakes. In particular, the subgroup Ultra-oligotrophic A is
located in the extreme corners on the potential-sensitive plots, the
Factors one (bedrock), versus three (biological) and four (traces)
plots, Figures 13 and 16. To a lesser extent, the subgroup Ultra-
oligotrophic B shows similar behavior. While the subgroup Ultra-
oligotrophic C tends to be at a.moderate level for many components,
they are at the lowest levels for Si and low F. This is interpreted
as very minimal soil contact, reflecting the lakes morphology: no
permanent inlets, intermittant outlets, and small catchment basin
Oligotrophic group. This group may be described as a lim-
nologically typical coniferous mountainous lake. This group in-
cludes lakes from the MPC (class 1), IgPC (class 9), IgT (class 3),
and SK and ST coniferous (class 6) classes from the 1984 model.
These bedrock types are chemically represented at the moderately low
level along Factor one (bedrock). They are heavily vegetated,
primarily with spruce-fir and lodgepole pine forests, with some
* mixing of local meadows. These lakes are correspondingly located
high along Factor axis three (biological). They have low to
moderate levels of trace elements, Factor four. This chemical
behavior can be interpreted as reflecting the heavy vegetation

^Ba Ca alk^> FACTOR ONE
Figure 16. Factor score plots of Data Set B showing the
separation of the main and subgroups along the
bedrock (one) and traces (four) factors.

characteristics. The trace elements tend to be taken up by the
vegetation and bound in the soil (15,41). This makes the traces
less available and less solubilized by the overland source water.
These lakes have significant overland water input with permanent
inlets and outlets; they are correspondingly moderately high along
the soil/morphology factor axes, two and five. Although three
subgroups are indicated, there is only minimal separations and
differences in these subgroups. Subgroup Oligotrophic A is the
lowest of the Oligotrophic subgroups along Factor axis one, Figures
14 and 15, while subgroup Oligotrophic C tends to be lower than the
other Oligotrophic samples along the soil/morphology Factor axes,
two and five.
Mesotrophic group. This group represents a more
heterogeneuos type of mountainous lakes. This is a varied and
nonhomogeneous group that has three distinctive subgroups. Compared
with the 1984 model, the samples are from most of the nine
categories including MPC (class 1), IgT (class 3), ST and SK
(classes 4 and 5), ST/SK CON (class 6), SP Soil (class 8), and IgPC
(class 9). The geology is predominantly Tertiary igneous and
Cretaceous sedimentary and Tertiary sedimentary but also includes
the entire range of bedrock types. Along Factor axis one (bedrock),
these samples range from moderate to high. The vegetation is also
mixed, consisting of mixed coniferous forests, meadows, shrublands
and aspens; the scores along biological factor 3 are correspondingly
moderate. The trace levels, Factor four scores, are in the low to
moderate range, again in

agreement with the mixed vegetation and soil development. Along the
soil-morphology factor, the samples have an extensive range that
delineates the three subgroups in the class. Part of Subgroup A is
characterized by a distinctive combination of high Si, Factor five,
and low F'J Factor two (Figures 14 and 15). These lakes are all
shallow since they do not have much overland source water. Thus,
they have low F'and Si levels, but the shallowness of the lake leads
to a larger proportion of the lake water in direct contact with the
lake bottom sediments which can lead to elevated Si levels. The
rest of Subgroup A is more of a borderline Oligo-Mesotrophic group,
with predominantly coniferous vegetation and includes some of the
less soluble bedrock types. Subgroup B can be considered the typi-
cal Mesotrophic lake with a moderate level of most the components.
Lowlands group. This is the smallest of the groups and
contains the plains lakes and the remaining arid, flatlands lakes.
Most of the lakes from the two classes in the 1984 Model, ST and SK
(classes 4 and 5), are in this group. Although this group has much
intraclass variance, it is distinct from the other three main
classes. The geology consists of higher soluble Cretaceous sedimen-
tary and Tertiary sedimentary bedrock; this group occupies the
highest position along the bedrock factor axis. The vegetation
consists primarily of dry shrub and grasslands; the lakes lie at an
intermediate level along Factor three (biological) and are low to
intermediate along Factor four (traces). The morphology of the
catchment basins are quite variable ranging from large reservoirs

with permanent inlets/outlets to smaller lakes with small catchment
basins and intermediate inlets/outlets. The positions along the
morphology/soil factors, two and five, are correspondingly quite
broad (Figures 14 and 15). Although there is a large chemical
variance in this group, there are no distinct separations that lead
to any subgroups.
Dendrogram characteristics. Upon closer examination of the
factor scores dendrogram, one can see the clusters that correspond
to the subgroups that were found chemically in the scatter plots and
limnologically in the subgroups' visual descriptions. The most
distinct subgroups are in the Mesotrophic and Ultra-oligotrophic
groups where the subclusters are not only split within one robust
cluster, but are also split in other robust "clusters. As shown in
Figure 12, there are four robust clusters labeled Rl, R2, R3, and
R4. Two of these clusters, Rl and R4, include the Mesotrophic and
Ultra-oligotrophic classes and their subgroups scattered between
them. The other groups without distinct classes, Oligotrophic and
Lowlands, directly correspond to the robust clusters R2 and R3.
With the Oligotrophic group, the three subclasses are indicated on
the dendrogram, Figure 17, but there are not the large similarity
breaks in that cluster as seen in the Ultra-oligotrophic and
Mesotrophic subgroups. With the lowlands group, there are no sig-
nificant breaks as shown in Figure 18. The samples form more of a
stairstep, loose cluster on the dendrogram which indicates a cluster
with wide variability but without any distinguishing breaks or

* test set samples
Figure 17. Clustering dendrogram of the FACTOR SCORES of the
Data Set B showing the robust clusters, R1 and R2,
and their subclasses.

average shallow oiigo
* test set samples
Figure 18. Clustering dendrogram of the FACTOR SCORES of the
Data Set B showing the robust clusters, R3 and R4,
and their subclasses.

subclusters. Looking only at the dendrogram, one can hypothesize
only some of the limnologically consistant classes. The small,
robust, subclusters are generally chemically and limnologically
consistent with the small subgroups. However, the larger clusters
must be examined by additional methods to confirm or re-group the
natural groupings. Utilizing the factor plots, the interpretation
of the corresponding factors, and the limnological and field
descriptions is required for a correct interpretation and descrip-
tion of the main lake classes.
Classification Results on Data Set B
The above described natural groups were used as the basis
for a training set. First, these groups were used for the category
designations while the actual classification algorithms were run
using the original P variables. The samples indicated as outliers
on the dendrogram, Figure 12, were excluded from the training set.
In addition, upon closer examination of the dendrogram clusters,
small individual clusters within the larger subgroups or main groups
were found to have different behaviors in terms of the dendrogram,
factor plots, and limnological descriptions. These samples, as
designated by a in the Figure 12, were also excluded from the
training set and run later in the test set. Second, the class-
ification model was run using all nine subgroups; these subgroups
were gradually recombined and the classification model rerun on the
four main groups. Both the KNN and SIMCA classification procedures
were run. Last, this entire procedure was repeated using the Q

scores in the classification algorithms. For all these procedures,
only the overall total percent correct will be considered. A sum-
mary of the total correct classification percent for these sixteen
models for comparison purposes is presented in Table 11.
Training set reduction. As expected, removing the desig-
nated test set samples (* samples) from the training set improves
the overall classification rate by several percent in all cases.
The classification algorithm for the test set will first classify a
test sample, and then provide a general measure of how well that
sample belongs in the selected class. For the test set samples, the
class that they were assigned to was not a class that was closely
associated with their original cluster. They also tended to not
classify well" in the assigned category.
The original outlying samples were also tested in the
models. They showed the least predictable behavior with only very
loose classification in any of the classes. This again confirms the
exclusion of these truly outlying samples from the data base. This
is 4% of the total lake population in this study that are designated
as true outliers. The other test (*) samples can be viewed not as
outliers, but as lakes that are on the borderline of the other
classes, or that are really a separate, very small additional
class. Currently, these test (*) samples account for an additional
16% of the entire lake population that are not adequately repre-
sented by this lake classification model. Overall, this training
set accounts for 84% of the total lake population sampled in this

Comparison of the Classification Techniques
on the Three Data Sets
total % correct:
classes P variables Q scores
1984 4 89 87 90 98
(247 train.) 6 87 84 79 96
A 4 96 87 91 96
(95 train.) 7 94 77 90 95
B 4 90 88 87 96
(351 train.) 9 82 79 84 92
B 4 87 87 82 95
(401 train.) 9 80 76 89 91

study to this date.
Nine and four classes. There was a significant improvement
in the classification with the reduction of nine subgroups to four
main groups in all cases. This is consistent with the idea of the
main group differences being much greater than the subgroup
differences. The tremendous range of the entire data set allows a
fairly easy and successful separation of the main groups. But, at
the same time, the subtle differences that separate these subgroups
will be obscured by the predominating major differences. These
subgroups are expected to be more distinctive in studies that are
examining the lakes within one or two of the main groups. Data Set
A addresses this idea of studying selected classes of lakes, and one
expects to find similar and more distinctive subgroups in that data
SIMCA/KNN and scores/variables. A general pattern emerges
when comparing the KNN and SIMCA methods on the Q scores and the P
variables. SIMCA generally works better using the P variables with
the data sets in this study, although this is not as significant and
uniform as the KNN results. KNN uniformly works significantly
better using the Q scores. This is consistent with the different
types of algorithms and theories that these techniques use. KNN, as
a distance based classification method, will benefit from distance
based dendrogram of the scores by working with a cleaner, more
direct distances, variance decreased sample set. By working in a

data space, where excess information (the unique portion of the
variance) has been discarded, only the information that emphasizes
the separability and distances between the classes is retained. In
the data space now defined by the scores, the sample data points are
more intra-class dense and less inter-class overlapped. SIMCA on
the other hand, as a variance based technique, benefits from retain-
ing the additional variance in the P variables over the Q scores to
better separate the overlapping classes. The most widely successful
technique was the KNN method on the Q scores. This data set is both
extremely broad, and has much overlap in the natural groupings. The
KNN success may be-attributed to first, the reduction of some of the
variance and hence overlap by using the scores. Second, the very
different classes tend to have an uneven variance distribution that
the KNN with the scores can address better than SIMCA. Finally, the
overlapping nature of the classes may be better addressed by
flexibility of the distance classification procedure with KNN-
scores. The SIMCA procedure may be more restricted in terms of
class shape or dimensionality by the closed class shape that the
principal component extraction requires. For this PARTICULAR data
set, this specific classification procedure was determined to be
most effective.
Stability check. A measurement of the representativeness or
the stability of the training set can be made by randomly removing a
percentage of the training set. If the training set is not repre-
sentative of the chemical characteristics of the complete data set,
or the model has been "overfit" to the data, there will be a sig-

nificant drop in the self-classification rate as the training set is
randomly reduced. In looking at the two most successful individual
methods, SIMCA with P variables and KNN with Q scores, there is
little loss in the self-classification rates. With SIMCA, there is
almost no reduction in the classification rate when 20% and 40% of
the training set is removed; the self-classification rates are 90%
and 89% respectively as compared with the original 90%. With the
KNN method, there is some reduction from the original 96% rate: with
a 20% training set reduction, 94% self-classifies correctly and with
a 40% training set reduction, 92% correctly classifies (Table 12).
The success of both methods supports the viability of these four
classes well representing this data set.
The stability test success with the SIMCA method confirms
its application as a general model, while the slightly less success-
ful KNN method still has viability. The SIMCA classification
stability also would support the predictive ability of this model
and technique. The classification decreases seen with the KNN
method may indicate that this classification technique slightly
overfits the data and is not as representative of the entire data
base as the SIMCA method. This overfitting problem is obvious with
the original 1984 classification model. The original model (9
geology-vegetation classes) had a good self-classification rate of
90%, but when the 1985 data was added, the rate dropped substan-
tially to 77%. The model was overfit to the 1984 data and was
unable to adequately represent the 1985 data. When the new

Data Set B Limnological Model
4 classes, 351 sample training set (100%)
Total % correct:
% training set retained : 100 80 60
SIMCA (P variables 20) 90 90 89
KNN (Q scores 5) 96 94 92
1984 Data Limnological Model
4 classes, 247 sample training set (100%)
Total % correct
% training set retained : 100 80 60
SIMCA (P variables 23) 89 89 89
KNN (Q scores 4) 97 95 95

classification scheme (4 limnological classes) was applied to just
the 1984 data, the self-classification rates for the SIMCA-variables
method and KNN-scores method were 89% and 97% respectively. Then,
with the 20% and 40% random deletions of the training set, the SIMCA
classification rates remained constant and the KNN results only
decreased by 2% (Table 12). With the addition of the 1985 data, the
model maintained these high classification rates, as discussed above
in Table 12.
Cluster Analysis on Scores on Data Set A
The same procedures were used on this data set as described
with Data Set B. The results were also quite similar as expected.
The dendrogram for this set is presented in Figure 19. There are
four robust clusters that contain seven subgroups that closely
correspond to seven of the nine subgroups in Data Set B.
The three Ultra-oligotrophic subgroups are present with
Ultra-oligotrophic A and Ultra-oligotrophic B closely linked in one
robust cluster, while Ultra-oligotrophic C remains a separate robust
cluster. This Ultra-oligotrophic C subgroup is somewhat different
from the Data Set B group in that it consists entirely of lakes with
little soil contact on pennsylvanian sedimentary bedrock. Ultra-
oligotrophic C group found in Data set B contained both these
samples and other samples on different bedrock types. Two
Oligotrophic subgroups, Oligotrophic A and Oligotrophic B, are found
in Data Set A. As with Data Set B, they are closely linked and form
a robust cluster well separated from the remaining samples. There

Main groups
MB ;
UC :

UB i-
UA t-

* test set samples

Figure 19. Clustering dendrogram of the FACTOR SCORES of the
Data Set A showing the main and subgroups.

are no Oligotrophic C samples simply due to the samples that com-
prise the individual data bases. The remaining robust cluster is
well separated from the remaining samples and contains the
Mesotrophic samples. The average Mesotrophic is clearly present in
one subgroup, while the other subgroup is a mixture of shallow
Mesotrophic samples and test set samples. There are two small
clusters that indicate samples which belong in a test set; these
samples were in the test set in the Data Set B classification
scheme. Although the linkages are somewhat different with the
reduced Data Set A, the chemical descriptions for the classes and
relative positions on the factor plots are very similar for the two
data sets.
Classification Results on Data Set A
The above identified test set was removed from the training
set for this classification analysis. Four robust clusters as
compared to seven subgroups were tested. Both KNN and SIMCA models
on P variables and Q scores were then compared. The total self-
classification percents are presented in Table 11 .- Some subtle but
significant results can be seen when these summary results on the
ninety five samples reduced set is compared with the overall 351
sample set B. The KNN models' success rate is approximately equal
with both sets; KNN does very well using the Q scores on both robust
clusters and subgroups, but does poorly using the P variables,
particularly with the subgroups. With the Data Set A, there is a
large overlap between two Ultra-oligotrophic subgroups and the

Mesotrophic group overlaps all the other groups. The number of
nearest neighbors selected to produce the maximum correct class-
ification is also required to be very small when using the
variables. When the scores are used, the classification improves
dramatically, and the best number of nearest neighbors is somewhat
higher around five to six. SIMCA again does better using the P
variables, but with the smaller data set, SIMCA/P did as well as the
KNN/Q score procedure. This is unlike the Data Set B where KNN/ Q
scores clearly was better than the SIMCA/P variables. On the smal-
ler data set, where there is still much chemical overlap between the
classes but the extensive range has been reduced, the SIMCA/P method
works as well as the KNN/Q method.

Classification Technique
Limitations. There are two main limitations on the use of
the factor scores-classification technique. The first is the prac-
tical problem of handling large data sets. The cluster analysis,
particularly the hierarchical linkage step, can be very time consum-
ing and requires a large memory allocation on a computer. This
problem has been noted in other studies dealing with different types
of cluster analysis (34). Also, the resulting printout of the
dendrogram will be correspondingly large and cumbersome with large
data 'sets. These problems are not an inevitable aspect of this
technique; they simply require the practical software to handle
The second limitation is a word of caution on applying this
technique. This is not a "black box" method that can be routinely
and homogenously applied to a data set. There is not a single set
of procedures that can be directly followed in applying this tech-
nique, but many decision points that first require interpretation
before continuing. There are many procedures in the technique that
will affect the final classification. The factor analysis is criti-
cal in

that it must be "correct : the factors must represent the majority
of the data (not the outliers), the variables selected and their
treatment (for example, log transform) along with the correct number
of factors used must best represent the chemical information in an
interpretable, chemically consistent manner. The decision of the
class compositions or training sets must be based not only on the
factor analysis results as presented in the scores dendrogram, but .
also on the knowledge of the data base and its characteristics. The
determination of the correct number of classes is only partially
given by this technique; it must be substantiated by the charac-
teristics of the data and interpretation of the chemistry. The
small, very robust clusters generally remain intact as certain
procedures are varied. Outlier identification and removal, variable
removal and manipulations (for example, log transform), and designa-
tion of the number of factors retained do not affect the small
clusters. The large clusters and looser links are susceptible to
these procedural changes. The main classes and larger robust clus-
ters must be determined by interpreting the factors, and this re-
quires understanding the chemistry and the significant characteris-
tics of the data. This technique finds the natural associations and
clusters in a data base; it is an additional interpretative aid that
can be applied with other techniques in the analysis of a data base.
The selection of a particular classification technique, KNN
verses SIMCA, is not specified by this technique. First, one does
not want to broadly apply many different methods and adjust them

until the results are maximized. The problem is that the final
model may have been overfit and only represents that select group of
samples; the results then only apply to a very limited group of
samples. Yet, the different classification methods have different
theories and algorithms that would inherently allow better applica-
tion to different types of data sets. Sjostrom and Kowalski (44)
examined five different classification techniques on six different
' data bases in addressing this idea. One of their findings was to
advise using several classification methods when the classes have
extensive overlap. Their reasoning for this idea is as follows:
[This is] not with the aim of finding the "best class-
ification, but with the aim of understanding the data structure
and detecting the objects for which the class-separating infor-
mation is low. (44)
This study has examined several classification procedures with the
aim of developing a chemically defined and limnologically consistent
model; the actual success of the classification technique being
secondary to the understanding of the data.
Advantages. The technique of applying cluster analysis to
factor scores has several advantages. First, in classification
analysis, this is an efficient and direct way to hypothesize the
number and compositions of classes. Particularly with large data
bases, where class structures are unknown, this technique can assist
in developing a model that well describes the entire data base.
Used in conjunction with the factor analysis, these classes are

defined chemically through the known variables. This technique is
an additional, effective method for identifying outliers and at the
same time providing the chemical reason for the anomalous behavior
by the outliers position on the dendrogram (scores) and factor
score plots. It also identifies and chemically describes test set
samples by using the same dendrogram (scores) in conjunction with
the factor plots.
Second, in data analysis methods using a distance-based
procedure, using the factor scores as the data points in the dis-
tance procedures will provide a chemical basis in the results. If
the original variables are used, the inter-sample distances in this
variable space are known but the relationships of the variables and
the samples are not given. Using the factor scores, the distances in
the factor score space are known and the chemical composition of the
scores will provide the chemical information concerning the samples
and variables relationships. In this, study, the cluster analysis
particularly benefits from obtaining a quantifiable, chemical mean-
ing for the clusters and hierarchical links. In addition, the KNN
classification method, which is distance based, benefits both in
improving the classification success and having the chemical defini-
tion of the classes.
Third, the dendrogram on the scores is an additional visual
aid in the various data interpretation techniques. This is useful
for cluster analysis in that the large, original P-space has been
compressed to a more manageable Q-space. This works in conjunction

with the factor analysis by viewing the dendrogram (scores) as the
representation of the samples in all Q factors simultaneously.
Since classification methods lack effective visual representations,
the dendrogram (scores) provide a visual aid for class descriptions.
Lake Study Summary
This study has characterized the lakes and developed a model
that specifically defines Colorado lakes while covering the entire
range of Colorado lakes. In 1984, the lakes were specifically
chosen to represent the entire range of geology and vegetation in
Colorado, while the 1985 lakes have increased the extensiveness of
the measured lakes chemistries. The range and extensiveness of the
lake sampling, and the data analysis procedures, particularly the
classification methods and tests, have produced a model that ac-
curately describes the large and complex lake data base. This model
well represents the lakes of Colorado and should be applicable to
other Colorado lake studies. In summary, this model is comprised of
four main categories of lakes : 1) alpine and sub-alpine lakes on
low solubility bedrock (mountainous) with minimal vegetation 2)
oligotrophic, a typical mountainous lake, on low solubility bedrock
with heavier coniferous vegetation 3) mesotrophic, a mountainous
lake, on moderate solubility bedrock with mixed coniferous/
deciduous/meadows vegetation and 4) lowlands lakes with high
solubility bedrock and shrub/grassland vegetation. Each class is
defined limnologically and chemically with the data analysis tech-

niques bringing to view the links between the measured chemical
variables and the limnological characteristics.
The chemistry of a lake's water has been found to be in-
fluenced primarily by four major components : the bedrock geology,
the surrounding vegetation, the inlet-outlet status and general
catchment basin morphology, and. anthropogenic influences. These
four components can be evaluated on a superficial level using pub-
lished topography and geology maps, allowing a prediction of a
lake's chemical behavior. Any deviations from these predictions can
act as a signal of unusual influences, specifically potential
anthropogenic ones. These influences are expressed in the inorganic
constituents of the lake's water, and can be summarized and repre-
sented by four factors : 1) the bedrock factor composed of Ca, Mg,
Ba, Sr, alkalinity, specific conductance 2) the
vegetation/biological factor composed of Fe, fluorescence, Mn 3) the
soil/morphology factor composed of F, Si, lake size and 4) the
traces factor composed of Zn, Cd, Mo, and -pH. The relative con-
tributions that each of these factors contribute to the variance in
the water are as follows :
Bedrock Vegetation = Soil/Morphology > Traces
In terms of application to the lakes' acidification status
study, the search for a reliable, sensitive indicator of acid
deposition is still under investigation. Currently, the lakes that
can be described as acid insult sensitive are those that are the

alpine and sub-alpine category. They are chemically defined by the
three factors bedrock, traces, and biological. These lakes are
located at low factor scores along the bedrock factor axis indicat-
ing low bedrock solubility, a low position along the biological
factor axis reflecting the minimal developed vegetation, and a
moderate to high position along the traces factor axis reflecting
the higher levels of trace elements in these lakes. Future soil
data and additional lake water data are expected to improve the
description of what constitutes an acid sensitive lake.
The application of multivariate statistical techniques has
enhanced the interpretation and understanding of a complex environ-
mental system. Each technique contributes to the understanding of
the data in a unique manner. The use of the factor analysis tech-
niques has defined the important chemical components of the lakes'
waters and defined the main influences of a lake water's chemistry.
With the classification techniques, there were two equally effective
methods for developing a model. Combining the
classification/cluster analyses with the factor analysis provides
the chemical descriptions for the model's classes. The use of
pattern recognition techniques in this lakes' study has substan-
tially enhanced the chemical interpretation and understanding of the

1. Cronan, C.S.; Schofield, C.L.; Science, 1979,204,304.
2. "Acid Deposition. Atmospheric Processes in Eastern North
America, National Research Council, National Academy Press,
Washington D.C., 1983.
3. Munger, J.W.; Eisenreich, J.J.; Envir. Sci. Tech., ,1983,17,32A.
4. Woodling,J; Ninth Annual Water Workshop Proceedings, Gunnison,
5. Lewis,W.M. Jr.; Grant,M.C.; Water Resources Research (in
6. Lewis,W.M.;Limnol. Oceanography, 1982,27,167.
7. Turk,J.; Adams, D.B.; "Sensitivity to Acidification of Lakes in
the Flat Tops Wilderness Area, Colorado, Water Resources
Research, 1983,19,346.
8. Dodson,S.I.; The American Midland Naturalist, 1982,107,173.
9. El-Ashry, "The American West's Acid Rain Test, World
Resource Institute, Washington D.C.
10. Kramer,J.; Tessier,A.; Envir. Sci. Tech., 1982,16,606A.
11. Miller,J.P.; U.S.G.S. Water Supply Paper, 1535-F, F1-F23, 1961.
12. Henriksen,A.; Proc. Int. Conf. Ecol. Impact Acid Precip., Nor-
way, 1980, pp.68-74.

13. Goldman,C.R.; Horne, A. J..; "Limnology, McGraw Hill Co., N.Y.,
14. Cole, G.A.; "Textbook of Limnology", 2nd ed., C.V. Mosby Co.,
St. Louis, 1979.
15. Swain,F.M.; "Organic Geochemistry"; Breger,I.A.; MacMillian Co.,
N.Y., 1963, pp.87-147.
16. Pennak,R.W.; "Limnology in North America"; Frey,D.G.; University
of Wisconsin Press, Madison, Wisconsin,1963, pp.349-370.
17. Neel,J.K.; "Limnology in North America"; Frey,D.G.; University
of Wisconsin Press, Madison, Wisconsin,1963, pp.575-594.
18. Chappell,W.R.;Meglen,R.R.;Swanson,G.A.;Taylor,L.A.; Sistko,R.J.;
McNelly,R.B.; Klusman,R.W.; "Acidification Status of Colorado
Lakes", Part I, Center for Environmental Sciences, Univ. of
Colo., Denver, 1985.
19. Meuzelaar; Windig,W.; Harper,A.M.; Huff.S.M.; McClennen,W.H.;
Richards, J.M.; Science,1984,226,268.
20. Rasmussen,G.T..;Isenhour,T.L. ;Lowery,S.R. ;Ritter,G.L.; Anal.
Chim. Acta.,1978,103,213.
21. McGill,J.R.;Kowalski,B.R.;Appl. Spectroscopy,1977,31,87.
22. Kwan,W;Kowalski,B.R.;Anal. Chim. Acta.,1980,2,215.
23. MacCarthy et. al.;Geochimica et Cosmochimica Acta,1985,49,2091.
24. Sistko,R.J.;"Pattern Recognition Assisted Chemical Interpreta-
tion of Groundwater Quality Data from Oil Shale Lease Tract C-
b", Master's Thesis, Univ. of Colo., Denver, 1983.
25. Meglen,R.R.; Sistko,R.J. in "Environmental Applications of
Chemometrics", Breen,J.J.;Robinson,P.E., American Chemical

Society, Washington D.C.,1985.
Nie,N.H.; Hull.C.H.; Jenkins,L.G.; Steinbrenner,K.; Bent.D.H.;
"Statistical Package for the Social Sciences",2nd ed., McGraw
Hill, N.Y., 1975,pp.468-514.
"Arthur81 Users's Manual", Infometrix, Seattle, 1981.
Davis,J.C.;"Statistics and Data Analysis in Geology", John Wiley
and Sons, New York, 1973.
Wolff,D.D.; Parsons,M.L.;"Pattern Recogntion Approach to Data
Interpretation",Plenum Press,New York,1983.
Gorsuch,R.L.;"Factor Analysis",W.B. Saunders
Malinowski,E.R.; Howery.D.G.; "Factor Analysis in Chemistry",
John Wiley and Sons, N.Y., 1980.
Harman,H.H.; "Modern Factor Analysis",3rd ed..University of
Chicago Press, Chicago, 1976.
Anderberg,M.R.; "Cluster Analysis for Applications", Academic
Press, New York, 1973.
Mas sart,D.L., Kaufman,L
Esbensen.K.H.; Anal.
Press,F,; Siever,R
cisco, 1974.
"Earth", W.H. Freeman and Co., San Fran-
Kahl,J.S.; Norton,S.A.; Williams,J.S.; "Geological Aspects of
Acid Deposition"; Bricker.O.P.; Butterworth Publishers, Boston,
1984, pp.23-75.
Levin,H.L.; "The Earth Through Time", W.B. Saunders Co.
Philadelphia, 1978.