Citation
Assessment of the representational accuracy of Globeland30 classification of the temperate and tropical forest of Mexico

Material Information

Title:
Assessment of the representational accuracy of Globeland30 classification of the temperate and tropical forest of Mexico
Creator:
Carver, Daniel Peter ( author )
Place of Publication:
Denver, Colo.
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (105 pages) : ;

Thesis/Dissertation Information

Degree:
Master's ( Master of arts)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Geography and Environmental Sciences, CU Denver
Degree Disciplines:
Applied geography and geospatial sciences

Subjects

Subjects / Keywords:
Land use -- Mexico ( lcsh )
Forest conservation ( lcsh )
Forest biodiversity ( lcsh )
Forest biodiversity ( fast )
Forest conservation ( fast )
Land use ( fast )
Mexico ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
This study performed an assessment of the representational accuracy of the forest class of the GlobeLand30 (GL30) global land cover data sets for the country of Mexico using a robust geographically distributed forest inventory survey of the forests in Mexico. The representational accuracy assessment was carried out for both the 2000 and 2010 GL30 data sets. The detailed attribute data associated with the validation set demonstrates how GL30 classifies specific forest types and how canopy coverage and number for trees per site influence the likelihood of GL30 identifying the sites correctly as forests. The results indicate that producers accuracies range from 72.3% to 97.3%. The tropical forests (89.1%) were better represented by the GL30 forest class than the temperate forest (73.9%). The most poorly represented classes from the temperate (oak: 72.3%) and tropical (low dry deciduous jungle: 74.9%) groups were deciduous. Receiver Operator Curve and Area Under the Curve analyses show that canopy coverage of a site is a better predictor of GL30, correctly identifying the site as forest for temperate forest, and that the number of the trees per site is a better predictor of GL30 correctly identifying a site as forest for tropical forests. The results also indicate a distinct spatial variability in the location of the sample sites that are misidentified as forests by GL30. The results of this thesis will help researchers and professionals better understand the representational accuracy of the GL30 data sets for the forests in Mexico.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by Daniel Peter Carver.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
on10121 ( NOTIS )
1012124910 ( OCLC )
on1012124910
Classification:
LD1193.L68 2017m C37 ( lcc )

Downloads

This item has the following downloads:


Full Text
ASSESSMENT OF THE REPRESENTATIONAL ACCURACY OF GLOBELAND30 CLASSIFICATION OF
THE TEMPERATE AND TROPICAL FOREST OF MEXICO
by
DANIEL PETER CARVER
B.S., Adams State College, 2012
B.A., Adams State College, 2012
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Masters of Arts
Applied Geography and Geospatial Science 2017
1


2017
DANIEL PETER CARVER ALL RIGHTS RESERVED
ii


This thesis for the Master of Arts degree by
Daniel Peter Carver has been approved for the Applied Geography and Geospatial Science Program
by
Rafael Moreno Sanchez, Chair Peter Anthamatten Galen Maclaurin Juan Manuel Torres Rojo
Date: May 13, 2017


Carver, Daniel Peter "M.A., Applied Geography and Geospatial Science Program"
Assessment of the Representational Accuracy of Globeland30 Classification of the Temperate and Tropical Forest of Mexico
Thesis directed by Associate Professor Rafael Moreno Sanchez
ABSTRACT
This study performed an assessment of the representational accuracy of the forest class of the GlobeLand30 (GL30) global land cover data sets for the country of Mexico using a robust geographically distributed forest inventory survey of the forests in Mexico. The representational accuracy assessment was carried out for both the 2000 and 2010 GL30 data sets. The detailed attribute data associated with the validation set demonstrates how GL30 classifies specific forest types and how canopy coverage and number for trees per site influence the likelihood of GL30 identifying the sites correctly as forests. The results indicate that producers accuracies range from 72.3% to 97.3%. The tropical forests (89.1%) were better represented by the GL30 forest class than the temperate forest (73.9%). The most poorly represented classes from the temperate (oak: 72.3%) and tropical (low dry deciduous jungle: 74.9%) groups were deciduous. Receiver Operator Curve and Area Under the Curve analyses show that canopy coverage of a site is a better predictor of GL30, correctly identifying the site as forest for temperate forest, and that the number of the trees per site is a better predictor of GL30 correctly identifying a site as forest for tropical forests. The results also indicate a distinct spatial variability in the location of the sample sites that are misidentified as forests by GL30. The results of this thesis will help researchers and
IV


professionals better understand the representational accuracy of the GL30 data sets for the
forests in Mexico.
The form and content of this abstract are approved. I recommend its publication.
Approved: Rafael Moreno Sanchez
v


I dedicate this work to those who knowing only of me and nothing of the project expressed sincere confidence that I would be successful in this endeavor.
Kailee
Ashley
Courtney
Liz
Dan
Kathy
VI


ACKNOWLEDGEMENTS
As with any geospatial project the quality of the product is directly related to the quality of the data. I want to personally thank Lily Niknami for her efforts in acquiring the GlobeLand30 data during a time when the public website in the U.S. was not functioning. Juan Manuel Torres Rojo for obtaining, sharing, and interpreting the Inventario Nacional Forestal y de Suelos data set without which this analysis would not be possible. I also thank Galen Maclaurin for his positive encouragement and valuable input in suggesting the ROC and AUC methodologies when I felt I had hit a dead end, Peter Anthamatten for a very detailed and impactful review of the written work, and Rafael Moreno for suggesting and guiding this research, many hours of brainstorming, support and positive motivation, and quick and insightful feedback. Finally, I thank Kailee Potter for invaluable series of edits and her continuous support through and through.
VII


Table of Content
I. INTRODUCTION.........................................................................1
1.1 Literature Review...............................................................3
1.1.1 Role and development of global land cover data sets..........................3
1.1.2 Development of GlobeLand30....................................................5
1.1.3 Previous studies accessing the representational accuracy of GL30.............6
1.2 Challenges in preforming and conveying accuracy assessments....................10
II. METHODOLOGY.........................................................................12
2.1 Data sets........................................................................12
2.1.1 National Forest Inventory Data Set. Comision Nacional Forestal (Inventario Nacional
Forestal y de Suelos)..............................................................12
2.1.2 GlobeLand30 from National Geomatics Center of China..........................17
2.2 Analytical Processes.............................................................23
2.2.1 Forest Type Groups...........................................................23
2.2.2 Rationale for Accuracy Assessment Methods....................................26
2.3 Assessment Method 1: Percent Correct by Intersect................................27
2.4 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent
Correct Calculations made in Method 1.................................................28
2.5Assessment Method 3: Predictive quality of the INFyS site attributes "Canopy Cover" and "Number of Trees"..................................................................32
viii


2.2.5.1 Mann-Whitney U Test......................................................32
2.2.5.2 Receiver Operator Curves (ROC) and Area Under the Curve (AUC)............34
III. RESULTS............................................................................38
3.1 Assessment Method 1: Percent Correct by Intersect................................38
3.1.1 Percent Correct of All Sites.................................................39
3.1.2 Temperate Forests............................................................39
3.1.3 Tropical Forests.............................................................40
3.1.4 Primary Temperate Forest Classes: Pino, Encino, and Pino-Encino Mix..........40
3.1.5 Primary Tropical Forest Classes: Selva Baja, Mediana, and Alta...............40
3.1.6 Erosion presence.............................................................40
3.2 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent
Correct Calculations made in Method 1.................................................41
3.2.1 Percent Correct of All Sites.................................................42
3.2.2 Temperate forests............................................................44
3.2.3 Tropical Forests.............................................................44
3.2.4 Primary Temperate Forest Classes: Pino, Encino, and Pino-Encino Mix..........44
3.2.5 Primary Tropical Forest Classes: Selva Baja, Mediana, and Alta...............44
3.2.6 Erosion presence.............................................................45
3.3 Assessment Method 3: Predictive quality of the INFyS site attributes "Canopy Cover" and
"Number of Trees".....................................................................45
3.3.1 Mann-Whitney U Test..........................................................46
IX


3.3.2 Results of the INFySO data set
47
3.3.2.1 Canopy coverage..............................................................47
3.3.2.1 Number of trees..............................................................48
3.3.3 Results of the INFySl Data Set..................................................48
3.3.3.1 Canopy coverage..............................................................48
3.3.3.2 Number of trees..............................................................48
3.4.4 Receiver Operator Curves (ROC) and Area Under the Curve(AUC)....................49
3.4.5 INFySO...........................................................................50
3.4.5.1 Canopy coverage..............................................................50
3.4.5.2 Number of trees..............................................................50
3.4.6 INFySl...........................................................................51
3.4.6.1 Canopy coverage..............................................................51
3.4.6.2 Number of trees..............................................................51
3.5 Comparison of AUC values between the INFySO and INFySl data sets.....................52
3.5.1 Canopy coverage..................................................................52
3.5.2 Number of Trees..................................................................53
IV. DISCUSSION..............................................................................55
4.1 Summary of Findings..................................................................56
4.1.1 Percent Correct by Intersect.....................................................56
4.1.2 Percent Correct by Area..........................................................56
x


4.1.3 Receiver Operator Curve (ROC) and Area Under the Curve (AUC)................59
4.2 Comparison of Producers Accuracies with Existing Studies........................61
4.3 Evaluation of the Study.........................................................62
4.3.1 Limitations of the Areal Representation of the Tropical Forest Class........65
4.3.3 Image Capture Date and Leaf on/Leaf Off Character of Deciduous Forest.......68
4.3.4 Variation within the Cluster Groups.........................................69
4.3.5 Semantic Difference in the Definition of Forest between the Two Data Sets...70
4.3.6 Importance of the Validation Techniques.....................................71
4.3.7 Ground verified sites verse remotely sensed sites...........................71
V. CONCLUSION.........................................................................72
5.1 Critical evaluation of research project.........................................76
5.2 Summary of Project..............................................................78
REFERENCES.............................................................................81
APPENDIX
A. APPENDIX A.......................................................................84
B. APENDIX B........................................................................89
XI


TABLE
Table 3.1: Percent Correct Vaules For The INFYSO and INFYS.....................................43
Table 3.2: Ranked p-values From Mann-Whitney; INFYSO...........................................47
Table 3.3: Ranked p-values From Mann-Whitney; INFYS1...........................................48
Table 3.4: Auc Values From INFYSO and INFYS1...................................................50
xii


FIGURE
Figure 2.1: Infys Sampling Site Distribution Maps..........................................................12
Figure 2.4: Attribute Data Of INFYS Shapefiles.............................................................14
Figure 2.2: Cluster Distribution.......................................................................... 16
Figure 2.3: Site Sampling Diagrams.........................................................................17
Figure 2.5: Map Of GI30_2000 For Mexico....................................................................19
Figure 2.6: Workflow Of GI30 Classification Process........................................................20
Figure 2.7: Example Of Priori Probability Process..........................................................21
Figure 2.8: Data Provided By GI30..........................................................................22
Figure 2.9: Image Capture Dates From 1999..................................................................23
Figure 2.10: Generalized Location Of Temperate Forests.....................................................25
Figure 2.11: Generalized Locations Of Tropical Forest......................................................26
Figure 2.12: Intersect Relationship Between Infys And GI30.................................................28
Figure 2.13: Percent Correct By Area Process...............................................................30
Figure 2.14: Examples Of Percent Correct By Area Values....................................................31
Figure 2.15: Distributions Of Number Of Tree And Canopy Coverage...........................................33
Figure 2.16: Example Of Roc Curve..........................................................................36
Figure 3.1: Percent Correct By Intersect Values............................................................39
Figure 3.2: Percent Correct By Area :INFYS0 And INFYS1.....................................................43
Figure 3.3: Auc Values For INFYSO..........................................................51
Figure 3.4: Auc Values For INFYS1..........................................................52
Figure 3.5: Auc Values For Canopy Coverage.................................................................53
Figure 3.6: Auc Values For Number Of Trees.................................................54
xiii


Figure 4.1: Change In Percent Correct Between Intersect And Area............................................57
Figure 4.2: Relative Density Of Partially Covered By Forest Sites From The GI30_2000 Data Set...............59
Figure 4.3: Tropical Sampling Sites Error In Representation.................................................66
Figure 4.4: Image Capture Date For The InfysO Data Set......................................................69
XIV


ABBREVIATION
GL30 GlobeLand30
GL30_2000 GlobeLand30 2000 data set
GL30_2010 GlobeLand30 2010 data set
INFyS Inventario Nacional Forestal y de Suelos data sets
INFySO Inventario Nacional Forestal y de Suelos 2004-2009 data set
INFySl Inventario Nacional Forestal y de Suelos 2009-2013 data set
xv


CHAPTER I
INTRODUCTION
Land cover/use data is an essential tool for ecological monitoring and economic planning at the national level (Townshend 1992). For forested regions, forest cover change can be used to assess the trends in land use and its impacts on the environment and forest conservation (Moreno-Sanchez, Torres-Rojo et al. 2012; Roth, Moreno-Sanchez et al. 2016) Alterations to the physical area and connectivity of the forest cover can directly affect the forests ecological diversity and capacity to perform ecological functions (Foley, DeFries et al. 2005; Fischer and Lindenmayer 2007). Mapping the process of forest fragmentation in Mexico using existing data sets has been completed in previous studies(Moreno-Sanchez, Torres-Rojo et al. 2012; Clay, Moreno-Sanchez et al. 2016). Clay (2015) suggests that finer resolution imagery may provide an improved means of mapping the interior changes occurring within forest areas. Changes within the interior of the forest areas were shown to be occurring at higher rates than in other regions of the forest.
The GlobeLand30 (GL30) data sets produced by the National Geomatics Center of China (NGCC) provides global land cover data comprising ten classes at 30m spatial resolution for 2000, and 2010(Chen et al. 2015). Previous validations of the GL30 data set at the regional scale have produce overall accuracy values of 80% for sites in Western Europe (Brovelli, Molinari et al. 2015, Arsanjani, Tayyebi et al. 2016)and 46% over four countries in Central Asia (Sun, Chen et al. 2016). A robust National Forest Inventory data set produced by the National Forestry Commission of Mexico (Comision Nacional Forestal, 2016) was used as
1


the primary validation resource for this study. This data set contains thousands of ground
sample sites collected over two time periods each corresponding to a pass of the national forest inventory for: 2004-2009 and 2009-2013. The availability of these data sets offers an opportunity to assess the representational accuracy of the GL30 Forest land cover class over a region of the world with very high diversity of forest types and ecological conditions.
The National Institute of Statistics and Geography of Mexico (INEGI) (Instituto Nacional de Estadistica y Geografia (INEGI, 2014) has produced land cover/use digital cartography at scale 1:250,000 for 2003 (Series III), 2008 (Series IV), and 2013 (Series V). These data sets classify the forest cover into twelve types of temperate forest and twelve types of tropical forest (Clay, 2015). In contrast, the GL30 data sets contain one forest class.
Mexico is an ideal country for the assessment of the representational accuracy of the GL30 Forest class. The environmental and ecological conditions in the Mexican forest areas are similar to conditions found in many other temperate, tropical, and transition areas around the world. The country has complex ecological, cultural, and economic systems that define the forested landscape. Considered "megadiverse", Mexico is one of the top five most biologically rich countries(Groombridge and Jenkins 2002). Roughly a third of the country is covered by forest (Food and Agriculture Organization (FAO) 2010). Temperate and tropical forests exist within a complex physical terrain and many unique forest types have developed, due to the presence of microclimates. The complexity of these forests provides a challenging test case for the assessment of the representational accuracy of the GL30 Forest land cover class.
2


The goal of this study is to assess how well GL30 represented the specific forest
types of Mexico and how canopy coverage and the number of trees per site affects GL30's representation of the various forest types. In doing so, it will be determined if GL30 is an effective resource for forest and land cover change studies in Mexico and other ecologically similar environments.
1.1 Literature Review
Global land cover data sets play important roles in modelling global and regional scale processes. Due to the cost and time associated with land cover mapping global land cover data can often be the best source of national level land cover data for a country. Representational accuracy is a measurement of how well the features defined by the classification match what is actually present on the ground. Confidence researchers have in using the data for quantitative inquiries is tied to its representational accuracy. Common methods to test the representational accuracy of a land cover data set include comparing it against existing, validated land cover data sets, the use of user-validated high resolution imagery or user-verified ground truth data. Within this study the representational accuracy of the GlobeLand30 data set will be validated against a robust ground truth forest inventory data set. This will allow for the interpretation of GL30 representation of various forest types across Mexico.
1.1.1 Role and development of global land cover data sets
Global land cover data sets are important resources in assessing and evaluating anthropogenic, climatic, and ecological processes at the global scale. These data sets have
3


been used since the 1980's for climate modeling and carbon cycling research (Matthews 1983). As technology and processing methods developed, the resolution improved. Initial attempts mapped features at the one degree or 0.5 degree scale (Los, Justice et al. 1994) and multiple global land cover data sets were produced at the one km scale (Hansen, DeFries et al. 2000; Loveland, Reed et al. 2000). Herold et al. 2008 performed a comprehensive analysis of representational accuracy of four of these data sets. Accuracy varied from 30 to 90 depending on the class. More importantly, variance in classification across the four data sets was prevalent in zones of ecological transitions implying that specific areas represent a significant challenge for remote sensing based land cover mapping.
Finer resolution products became available at 500m and 300m (Arino, Gross et al. 2007; Friedl, Sulla-Menashe et al. 2010). While many earth systems can be monitored and assessed at this scale, the minimum mapping unit imposed limits on the application of these data sets to more regional analyses (Fritz, McCallum et al. 2012; Gong, Wang et al. 2013). In 2013, the United States Geological Survey opened the Landsat legacy data to the public and this has become a driving force for the development of 30m global land cover data sets(Giri, Pengra et al. 2013). This development has resulted in the creation of 30m spatial resolution forest cover products(Townshend, Masek et al. 2012; Hansen, Potapov et al. 2013), and ultimately served as a driving factor in the development of the Finer Resolution Observation and Monitoring of Global Land Cover project, which lead to the creation of the GlobeLand30 (GL30) global land cover data set (Ran and Li 2015). These finer resolution data sets enables regional-scale analyses that can be performed consistently throughout the world.
4


1.1.2 Development of GlobeLand30
To understand the representational accuracy of the GLobeLand30 data set, it is important to get the full picture of the processes on which the data set was built. The first 30 meter resolution global land cover data set was called Finer Resolution Observation and Monitoring-Global Land Cover (FROM_GLC) (Gong, Wang et al. 2013). This product contained two levels of classification, ten classes within level one and 29 within level two. Land cover was identified using an automated classification algorithm using LANDSAT TM/ETM+ imagery. Accuracy improvements were made by developing the product FROM-GLCseg. This product incorporated time series of MODIS 250m EVI, digital elevation models, and soil-water information (Yu, Wang et al. 2013). FROM-GLCagg was developed by upscaling and aggregating FROM-GLC, FROM-GLCseg to Nighttime Light Impervious Surface Area and MODIS urban extent (Yu, Wang et al. 2013). The 30m product produced by this method was assessed to have a 69.50% overall accuracy. All of these products are available for download at http://data.ess.tsinghua.edu.cn/. This is comparable to the overall accuracy reported by the 2000 1km global land cover class produced by the University of Maryland (Hansen, DeFries et al. 2000).
The second 30 meter resolution global land cover data set was GlobeLand30, a fine scale comprehensive global land cover data set developed by the National Geomatics Center of China (NGCC). The data set was produced at two timestamps; GlobeLand30-2000 and GlobeLand30-2010 (Ran and Li 2015). Over 10,000 30m spatial resolution remotely sensed images were used. Ten classes were defined using an integrated Pixel and Object-based method with Knowledge (POK-based) (Chen et al. 2015).The POK-based method integrates
5


well known spectral classification methods with an object oriented approach that better
illustrates the extent of larger land cover features (Myint, Gober et al. 2011; Costa, Carrao et al. 2014). Check steps within the classification process included verification against ancillary data sets and derived parameters for specific land cover classes(Comber, Fisher et al. 2004; Verburg, Neumann et al. 2011). A hierarchical classification technique was used throughout the classification process. For example, water was classified first, then masked out of future feature classifications. This was followed by wetlands, permanent snow/ice, artificial surfaces, cultivated land. Forest, shurbland, grassland, and bareland were all mapped at the same time and tundra was mapped last (Chen, Chen et al. 2015). This method was found to limit the amount of spectral confusion between the classes (Frazier and Page 2000; Smith 2013). The classification process was performed on five degree by six degree map sheets. Inconsistencies between neighboring map sheets were manually checked and a total of 847 map sheets comprise the final product.
1.1.3 Previous studies accessing the representational accuracy of GL30
Accuracy assessment of the GL30-2010 data set was reported by Chen et al. (2015) and was performed by a third party. A two-rank sampling strategy was used to sample 80 out of 847 map sheets (rank 1) and a sampling of pixels with the map sheet was chosen proportionally based on the total area of that class help within the map sheet (rank 2)(Tong, Wang et al. 2011, Tong and Wang 2012). This resulted in 159,874 pixel samples of which 154,586 pixels were definitively identifiable and used. Trained users identified the actual land cover using high resolution imagery largely from the Google Earth platform. As a result of the temporal component of this validation method only the 2010 data set was validated.
6


The user accuracy for the land cover categories ranged between 72.16% (grasslands) and
86.76 (artificial surfaces). The overall accuracy of the data set was determined by summing the user accuracy by the relative ratio of land cover for that class. The resulting value was 80.33% positive.
There are presently only three other publically available accuracy assessments of the GL30 data. Two of these assessments took place in European countries (Brovelli, Molinari et al. 2015, Arsanjani, Tayyebi et al. 2016) and the third examined multiple countries in central Asia (Sun, Chen et al. 2016). While much of the processes and methodology is similar across the works, each study will be assessed in detail below.
Brovelli et al. (2015) is the first published work looking at the assessment of all land cover classes in GL30 for a single country. They performed a cell by cell comparison between GL30 and existing land cover data sets. The CORINE data set has coverage for the whole country and eight of the twenty regions of the country had data sets with a finer than 30m minimum mapping unit available. Results were compiled within a confusion matrix and the following statistics were derived; overall accuracy, allocation and quantity disagreements, and user and producer accuracy. The regional data sets did not match in resolution or thematic characterization, so all data sets were reclassified to a five category themes (artificial surfaces, agricultural areas, forest and semi natural areas, wetlands, and waterbodies). Raster sets were produced at both 5 and 30-meter cell size to match the original cell size of the regional and GL30 data sets and a comparison was performed to see if this difference in cell size changed the outcome of the analysis. A separate comparison was performed due to the similarity of many on the thematic classes between CORINE and
7


GL30. In this second comparison scheme the category of forest and semi natural areas was
broken down to include forest, grass and shrubs, open spaces with little to no vegetation, and glaciers and permanent snow. The influence of co-location tolerance was addressed by creating a 70m buffer along the border of cells with different classifications. All cells falling within this buffer were excluded from accuracy assessment methods. The application and results of these methods were explained in detail using a single region as a case study and the results of the other 7 regions were displayed graphically.
Concluding statements from the Brovelli paper emphasize the influence of number of categories on the overall accuracies of the classification. Using the 5 categories, overall accuracy ranged between 81 and 92 percent. Breaking up the forest and semi-natural areas, resulted in a total of nine classes rather than five, and a reduction in the overall accuracy to a range of 62 to 81 percent. The co-location tolerance was shown to have a substantial effect on the overall accuracy, with increases from 84 to 96 percent and 65 to 86 percent. This implies that many of the classification errors occur at boarders between differing land cover types. Lastly, a note is made that much of the error could be a result of variation in temporal relationship between the images used to create the data sets and well as errors present in the validation data sets themselves.
Arsanjani et al. (2016) employed a similar method to by Brovelli et al's in the assessment of GL30 in Germany. Four distinct data sets were used to evaluate the GL30's representational accuracy. All the data sets differed in number of thematic classes, minimum mapping unit, temporal coverage, spatial coverage, and positional accuracy. Three of the four validation sets CORINE, urban atlas, and ATKIS, represent professionally derived
8


data sets and Open Street Map was used as the forth data set. The CORINE data set was the
only validation data set that had full spatial coverage of Germany. All data sets were resampled to match the 30m spatial resolution of GL30 and reclassified to match the CORINE level 1 nomenclature which includes; artificial surfaces, agricultural areas, forest and semi-natural areas, wetlands, and waterbodies. Confusion matrixes were created and overall agreement, user and producer accuracy, and Kappa coefficients were calculated.
Overall accuracy ranged between 74% for Open Street Map and 92% for CORINE. The ATKINS and UA data sets produced overall agreements of 85%. Given that these two data sets have the finest spatial resolution available, it is promising that such a good agreement was reached. Considerable disagreement was seen within the wetlands class where the highest user accuracy value recorded at 27%. Waterbodies were not classified well compared to the other three classes. Due to the low area of these two land cover classes the effect on the overall accuracy is minimal. While the use of community derived data via Open Street Map offers a continuously updated resource, much difficulty was found in using the data. The biggest challenge of working with the Open Street Map was the inconsistent feature nomenclature. Overall the researchers concluded that GL30 would be an effective and useful land cover data set for Germany.
The work produced by Sun et al. (2016) most closely matches the validation methodology employed by the GL30 team in Chen et al. (2015). This work represents the first evaluation done on a regional scale. The arid to semi-arid region was defined by the boundaries of five countries; Kazakhstan, Tajikistan, Turkmenistan, Uzbekistan, and Kyrgyzstan. The GL30 data set was reclassified to 7 data sets to help reduce errors associated
9


with the visual interpretation of validation set images. In the new classification, shrubland is
included in the forest class and wetlands are included in the waterbody class. A two-tiered sampling method was employed. Tier one consisted of points randomly selected based on land cover class. Tier two represents a high density sampling method in areas with complex land cover features. The resulting 27,000 points were independently validated using high resolution imagery by three testers using a developed software package. Approximately 25,000 points showed agreement between two or more testers. These pixels were used to develop a confusion matrix. User and producer accuracy, overall accuracy, and kappa coefficients were calculated based on the matrix. This process was only performed for the 2010 data set due to the temporal limitations of the high resolution imagery.
The overall accuracy found by Sun et al.'s study was 46%. Producer and user accuracies varied greatly between classes and across the same class. For instance, cultivated land had the highest producer's accuracy at 92% but had the third lowest user's accuracy of 48%. Grassland was the most mapped feature by GL30 but had a user's accuracy of only 22%. The total area of each class was compared against two other commonly used data sets and agreement between all three data sets was found to exist only within the cultivated land class. The overall accuracy of 46% brings up questions about the validity of use GL30 within Central Asia. The primary source of error was due to difficulty distinguishing between bare land and grassland.
1.2 Challenges in preforming and conveying accuracy assessments
It is important to note the two very distinct validation methods employed in existing publications that have been used to perform this type of analysis. One method relies on
10


random sampling and the manual interpretation of higher resolution imagery. The second
relies on the comparison of the new data set to existing validated data sets. There are problems with both methods (Foody 2002).The use of image interpreters introduces human error and the use of existing data sets incorporates all the existing errors associated with the validation data set. The validation data set used within this study is closest in character to the use of human-validated image interpretation.
The use of such extensive and robust ground-verified data sets is in many ways an improvement over the two previously defined methods but also contains some limitations. For one, the data set is only available for the forest class. This means we cannot speak to the accuracy of the other 9 land cover classes. Importantly this inherently limits our interpretation of the representational accuracy to the producer accuracy and we cannot assess the user's accuracy. What this means is that we can evaluate where GL30 misrepresented locations that we know are truly forest (omission). We cannot say where GL30 defined a location as forest and it is not truly forest (commission
The evaluation of the GL30 representational accuracy of Mexico's forests is an important step in understanding the applicability of this data set in the studies of forest around the world. Due to its recent release, limited work has been done to test the accuracy of this data set across a veriety of environments. The unique nature of the validation data set requires particular methodologies to ensure that this work matches the expectations of the field. This work will contribute to the understanding of how the GL30 could be employed to answered regional and global questions.
11


CHAPTER II
METHODOLOGY
2.1 Data sets
2.1.1 National Forest Inventory Data Set. Comision Nacional Forestal (Inventario Nacional Forestal y de Suelos)
The Inventario Nacional Forestal y de Suelos (INFyS) is a national level forest inventory project undertaken by the Comision Nacional Forestal de Mexico. For the remainder of this thesis, the Inventario Nacional Forestal y de Suelos will be referred to as the INFyS data set(s). The INFyS data were used as a validation data set for the GlobeLand30 Forest land cover class. The INFyS was developed to collect a systematic set of precise and accurate statistical indicators of the character and health of the forests and soils of Mexico. The project serves as the baseline for a series of continuous assessment and monitoring programs. This data are used to understand the state of and challenges the forests face across the country (INFyS Manual, 2011).
Figure 2.1: INFyS sampling site distribution maps Illustration of the sampling site distribution for INFySO (red) and INFySl (green). The INFySO is a complete survey whereas INFySl is still being produced.


The assessment of forest resources is a legal requirement under of the Ley Forestal
of Mexico (Mexican Forestry Law) (INFyS Manual, 2011). The two most recent iterations of these forest inventories, which were carried out during the 2004-2009 and 2009-2013 periods, are used within this study. The 2004-2009 INFySO contains 60,580 individual sampling sites based at different geographical locations across the country (Figure 2.1). The 2009-2013 INFyS (INFySl) contains 11,476 individual sampling sites and is still being compiled (Figure 2.1).
A structured assessment methodology was performed within each INFyS site. This included recording values for site ID, conglomerado ID, vegetation type, number of trees, number of damaged trees, basal area, canopy coverage, wood volume, and latitude and longitude. These data were provided in a file in comma separated values (csv) format (an example of the data structure is provided in Appendix A). The attribute data of vegetation type, number of trees, and canopy coverage are the most important for the analysis. The latitude and longitude values were converted to decimal degrees. These coordinate data indicate the centroid of the sampling sites. Site ID, conglomerado ID, the decimal degree values, and the three descriptive data (vegetation type, number of trees, and canopy cover) were saved in csv format and then converted into a shapefile using ArcGIS version 10.2 (ESRI, Redlands, CA, USA) (Figure 2.2 ). In this process, the INFyS sample sites were converted to points instead of their circular (for temperate forests) or rectangular (for tropical forests) shapes.
13


FID Shape * IDCONGLOM ID_SITI0 TIPO_VEGET F_ARB0LES COBERTURA_ Longitude Latitude
0 Point 197 12945 Bosque de pino 11 0.007287 -116.018611 32.408306
1 Point 197 12946 Bosque de pino 18 0.023824 -116.018583 32.408722
2 Point 197 12947 Bosque de pino 8 0.013861 -116.01825 32.408
3 Point 197 12948 Bosque de pino 7 0.006636 -116.019139 32.408167
4 Point 241 12 Bosque de pino 9 0.009744 -115.908611 32.372611
5 Point 241 13 Bosque de pino 8 0.011137 -115.908528 32.373
6 Point 241 14 Eosque de pino 1 0.00159 -115.90825 32.372306
Figure 2.2: Attribute data ofINFyS shapefiles
Site selection for both of the INFyS data sets followed a distributed clustered sampling methodology. Using a spatial software program a 5km by 5km grid was set over the whole country. The intersection of any two lines was a deemed a candidate for the location for a group of sampling sites known as 'conglomerado'. Conglomerado will be referred to as cluster for the remained of the paper. Multiple steps were taken to determine if a location on the grid would be sampled. First, site locations must have been mapped as forest in either the Instituto Naciona! de Estadistica y Geograpfia (INEGI) Series III (2002) or Series IV (2008) Land Use/Land Cover maps (INEGI, 2009 and 2012). The INEGI land cover maps were produced at the 1:250,000 scale. This implies that all INFyS sites were originally deemed to be forested at this smaller geographical scale. Additional factors, such as accessibility to location, significance of forest type, and number of total sampling sites for each field crew, were used to determine which locations would be sampled. The subjective component of the decision was made by the field crews who were selected in part, for their local knowledge of the region (INFyS Manual, 2011).
The process described above was used to determine the location of the cluster. A cluster is composed of 1 to 4 individual sampling sites (Table 2.1).The INFySO contains
14


17,130 unique clusters and the INFySl contains 3,196 clusters. The centroid of the central sampling site is the location defined by where the two lines of the 5 by 5km grid intersect. The three other sites are located relative to the first site at a distance of 45.14m between centroids. Site two is located due north, site three is at 120 degrees, and site four is at 240 degrees (Figure 2.3). All sampling sites cover an area of 400 square meters. Due to extreme terrain, land ownership, and permission constraints several of the clusters do not have all 4 sites. Another possibility is that the area of the site simply contains no trees and was there
for not sampled.
Number of sampling sites per Cluster INFySO INFySl
1 763 120
2 1,474 232
3 2,703 484
4 12,190 2,360
Table 2.1: Frequency of clusters with specific number of sites. The number of clusters that have a given number of sampling sites for each of the INFyS data sets. The large majority of clusters have four sampling sites
The shape of the sampling site depends on the forest type being surveyed: 'Bosque' (temperate forest) or 'Selva' (tropical forest). The survey sites or 'sitios' for the temperate forest are circular with a radius of 11.28m originating at the centroid of each site (Figure
15


2.4). The tropical forest sites are rectangular with a major axis of 40m and a minor axis of
10m (Figure 2.4). The orientation of the major axis is the same as the orientation of the relationship between site one and the three other sites with azimuths of 0 degrees, 135 degrees, and 225 degrees. Site one is orientated in an east west direction. Each of the four sites has a unique ID and they all share the same group (cluster) ID.
a o
Figure 2.3: Conglomerado Distribution Example of how the sample sites that constitute a conglomerado are spatially distributed. Each orange dot represents a sampling site. The central sample is at the intersection of the 5km by 5km national grid
The ID_CONGLOM attribute in the shapefiles (Appendix A) is the unique ID for the cluster and was used for sorting the data. The ID_SITIO is the unique identifier for all sites. The TIPO_VEGET defines the forest type at a given site. The INFySO data has 58 unique forest types and the INFySl has 35. A list of these classes and the total number of sampling sites associated with each forest type is included in Appendix A. F_ARBOLES is a count of the total number of individual trees with a diameter of more than 7.5cm at 1.3m above the ground. COBERTURA_ARBOLES is the canopy cover in square meters. The original values
16


reported in the Appendix A are reported in square meters/hectare. This value must be
multiplied by 25 to obtain the canopy coverage for the 400 square meter site.
0OSQUE5 V 2QNA5 ARIDA5
smo2
UMS
SELVAS
UMP
- 20 m
36.42 rr
56.42 m
45.14 m
UMS
Figure 2.4: Site Sampling Diagrams
Examples of the sampling site structures for temperate forest (left) and tropical (right) forest. The group is what is referred to as the cluster and the 'sitios' are the individual sampling sites. All the data within this study comes from the 400 square meter area represented by the green circles in the temperate forests and the green rectangles in the tropical forests. The smaller orange and yellow features within each site refer to survey data that was collected on lower vegetation and the soil and is not included in this study.
2.1.2 GlobeLand30 from National Geomatics Center of China
GlobeLand30 (GL30) is a remote sensing derived global land cover data set produced at a minimum mapping unit of 30 square meters for two time periods, 2000 and 2010 (Figure 2.5). It was released to the United Nations at the 2014 Climate Change Summit as a resource for the scientific and policy communities (Ran Y H, Li X. 2015). Produced by the National Geomatics Center of China, GL30 represents a significant increase in resolution power over existing global land cover data sets (Chen et al. 2015). Part of what makes this data set valuable is the process used to create it.
17


The two GlobeLand30 data sets are a product of a hierarchical Pixel-Object-
Knowledge based (POK) classification system (Figure 2.6). This is an iterative process where raster cells are classified into ten classes. The classification of the ten classes occurs in a specific order. After they are classified all pixels from with that classification are then removed from subsequent classification. The "Pixel" portion of this method refers to spectral reflectance of the land cover. "Object" refers to the grouping of similar reflective signatures. "Knowledge" refers to the inclusion of limiting physical landscape parameters as well expert opinion on the potential for the existence of any given land cover data at a specific location. For example, water bodies are the first class classified. A supervised classification based on Landsat imagery is performed to identify potential water bodies (pixel). These pixels are grouped into objects based on the shared characteristics of the individual reflectance signature relative to neighboring pixels (object). Limiting parameters, such as slope, are applied to ensure that no feature is classified where it cannot exist (knowledge). All the pixels that were classified as water are then masked out and the altered image is used as the base for the next classification, wetlands. This process is repeated the following order; water, wetlands, snow and ice, artificial surfaces, cultivated land, (forest, shrubland, grassland, and bare land are classified at the same time), tundra.
18


No data
Cultivated Land Forest Grasslands Shrublands WetLands | Water Bodies Artifical Surface BareLands Ocean
Figure 2.2: Map of GL30_2000for Mexico
An image showing the GL30 2010 land cover raster mosaicked and masked to the borders and coasts of Mexico. Nine of the ten total land cover classes are present within Mexico. The Tundra class and the Permanent Ice and Snow class is not present in Mexico.
To distinguish between forest, shrubland, grassland, and bare land, a total of 6 Landsat bands and 23 MODIS-based NDVI bands were used as inputs to a maximum likelihood classification model. The Landsat bands are used for the specific spectral reflectance and the NVDI bands are used to identify seasonal trends within the region. Limited information is
19


Snow and Joe
Artificial Surfaces
Cultivated Land
Forest
ShruWand
Grassland
Bareiand
30m Remotely Sensed Data Temporal Oata (250m NOVI MODIS) Reference Oat* (Place name*. Topography xratmg global LC
\
- Pixel Based Data Supervised Classification methods
Object Based Segementation
Knowledge Based Verification
Masked Classrted Area
/
Figure 2.3: Workflow of GL30 classification process
This diagram is a representation of the GL30 classification process. Input features are represented by the green boxes. Land cover classes by the orange boxes. The three primary processing methods are in the white boxes and the masking process which occurs at the end of the classification is in the blue box.
provided regarding the specific spectral thresholds used to define which class a pixel belongs to. General statements regarding the percentage of land surface covered by a given feature were provided. Forest and shrublands must have more than 30% ground coverage, grasslands must have at least 10%, and bares lands must have less than 10% vegetative coverage. A priori probability was developed based on two existing global land cover data sets that were mapped at resolutions of 500m and 300m. These data sets were reclassified and resampled to match the cell size and categorical classification of the GL30. A five by five moving window was used to calculate the probability of a given cell belonging to a specific
20


class (Figure 2.7). The result of this was included as a parameter within the maximum likelihood classifier.
LX. F LL
LX. LX. LL LL
F X
LL
LL F
Figure 2.4: Example of priori probability process
This is a visualization of the moving window process in the a priori probability process used by GL30. The forest cells represented by 'F' are based on the two resampled existing land cover data sets. The percent forest of value for the 25 cells is used to calculate the probability that the cell X is actually forest. This is repeated for all cells for the forest, shrubland, grasslands, and bare lands classes.
A total of 191 Landsat images were used to produce the land cover maps of Mexico. Along with the raster based classification a series of shapefiles is associated with the data sets that define what image was used for the classification and what date the image was captured on (Figure 2.8). The majority of the images were collected within a year of the set date of 2000 or 2010 (Figure 2.9).
To prepare these data for analysis, the individual rasters were re-projected to North American Albert Equal Area Conical projection. A mosaic was created using all the raster images. This new raster was masked to a buffered boundary of Mexico based on a Natural
21


FID Shape * Filename Path Row Date l
Polygon pO20_rWe_l7_2QQQ1106 020 0*6 20001106
1 Polygon p020_r047_l7_20000327 020 047 20000327
2 Polygon pC2Q_r(HB_l7_2 0000327 020 048 20000327
3 Poly jon p020_r049J7_20Q00123 020 040 20000123
4 Polygon pO20_rO$QJ7_2 0000123 020 050 20000123
5 Polygon pC2l_r046J7_20010406 021 046 20010406
6 Polygon p021 _r04 7_I7_2001 £U 16 021 047 20010116
Figure 2.5: Data provided by GL30
Upper left: A GL30 sheet represented by a land cover raster. Upper right: The shapefile that accompanies the raster. Each polygon refers to a Landsat image. This shape file has been clipped to the country of Mexico. The straight line on the bottom is the boarded with Guatemala.
Earth (naturalearthdata.com, 2016) large scale resolution shapefile of the administrative boundary of Mexico. This feature was reprojected to North American Albert Equal Area Conical. To account for the different resolutions at which the data sets were created, the administrative boundary of Mexico was buffered to 5km. To ensure that no information along the coasts was lost. This shapefile was clipped to the administrative boundaries of the neighboring countries Belize, Guatemala, and the United States of America to exclude any areas from those countries.
22


U/26/ 19990 00
li/V 19990 00
11/16/ 19990 00
10/17/ 19990 00
10/7/ 1999 0 00
9/17/ 19990 00
B/Z&' 19990 00
a,r8/ 1999 000
7/1*' 1999 0 00
6/39/ :5 0 00
0 10 JO SO *0 so
Figure 2.6: Image capture dates from 1999
Graphical example of the collection dates of Landsat images from the GL30-2000 data set. All values are images that were captured in 1999. The majority of the images were captured between October and December.
2.2 Analytical Processes
2.2.1 Forest Type Groups
To fully utilize the detail of the INFyS data sets, the representational accuracy of the GL30 Forest land cover class was assessed for various forest type groups that represent the major temperate and tropical forest types in Mexico, as follows: 1. All sites, 2. Temperate forest, 3. Tropical forest, 4. Pine, 5. Oak, 6. Pine-Oak Mix, 7. Low dry Deciduous jungle, 8. Medium semi-deciduous jungle, 9.High evergreen jungle, 10. Erosion presence. To best account for the temporal differences between the GL30 and INFyS data sets, the following structure is followed throughout the methods. The GL30-2000 land cover data is compared against the INFySO data set. The GL30-2010 is compared against the INFySl data set.
23


The temperate forest type group is defined by sites containing Bosque in the INFyS
TIPO-VEGET (vegetation type) field (Figure 2.10). This included 21 forest types in the INFySO and seventeen for the INFySl data sets (see Appedix A). Of the 21 and seventeen unique temperate forest types the three most common forest types within the "temperate forest" group are Encino (5. Oak), Pino (4. Pine), and a mixture of the two. In this study, the forest types Pino-Encino and Encino-Pino were combined to form the single forest type group (6. Pine-Oak Mix). These forest types are dominant in the cooler and mountainous regions of the country. The pine forest is evergreen, whereas the oak forest is deciduous and loses its leaves annually.
The tropical forest group is defined by all forest types that contained the word "Selva" in the TIPO-VEGET (vegetation type) field (Figure 2.11). This included eighteen forest classes in the INFySO data sets and thirteen classes in the INFySl data sets. The tropical forests are split into three groups based on the average height of the dominant trees and the percent of the trees that drop their leaves during the dry season. Selva Baja is low dry jungle where almost 100% of trees drop their leaves during the dry season. Selva Mediana is medium semi-deciduous jungle where about 50% of trees drop their leaves during the dry season, and Selva Alta is high evergreen jungle where almost no trees drop their leaves. Forest height is due in large part to environmental conditions and as a result the forests appear in distinct geographical locations. The Low dry deciduous jungle is found mostly along the low elevations of the western coast, Medium semi-deciduous jungle is found primarily on the Yucatan Peninsula, and the High evergreen jungle is mostly found in the states of Veracruz and Chiapas.
24


Figure 2.7: Generalized location of Temperate Forests
The forest group type Erosion presence includes all forest types where major soil damage is present. Overall, this represents a small portion of the total number of sites for both forest inventories. It was included because soil degradation is an environmental indicator which may be associated with marginal landscape conditions.
25


Low dry deciduous jungle
Medium semi-deciduous jungle
High evergreen jungle
Figure 2.8: Generalized locations of Tropical Forest
This map shows the relative density of the three tropical forest type groups; low dry deciduous jungle, medium semi-deciduous jungle, and high evergreen jungle.
2.2.2 Rationale for Accuracy Assessment Methods
The strength of the INFyS data as a validation set is in the sheer number and distribution of the sites and the attribute data collected at the individual sites. The three attributes which are most important to the validation methodology are forest type, number of trees, and canopy coverage. Both the number of trees and canopy coverage values are viewed as relative measures of forest density and are continuous data sets which allow for a more robust receiver operator curve analysis, explained below. To account for the temporal
26


variance between the land cover and INFyS data sets, the GL30-2000 data was evaluated
against the INFySO data and the GL30-2010 data will be evaluated against the INFySl data.
Based on the nature of the INFyS data sets a unique methodology was used to best assess the representational accuracy of GL30's classification of the Mexican forests. Because an error matrix cannot be developed, so the common reporting parameter for a land cover data sets assessment cannot be calculated. This all hinges on the fact that the INFyS data sets only provide information on where the forest is known to be. It does not define where forest is not present. As a result, it is not possible to speak to the user's accuracy of the GL30's representation of Mexico's forest. While this does limit the ability of this assessment to be directly related back to other published assessments it does not mean the results of the methodology are uninformative. It should also be noted that in assessing a single class of the GL30 data, the scope of analysis is significantly more precise then all previous assessment effects.
2.3 Assessment Method 1: Percent Correct by Intersect
Location data for the INFyS data sets is stored as a set of coordinates which represents the centroid of the sampling site. This positional data was used to create a point feature class in ArcMap. A simple intersect was performed between this centroid and Globeland30 to determine what land cover class every given point fell into (Figure 2.12). A new binary column was added to the INFyS data based on whether the site fell into a forest class or not, 1 = forest and 0 = not forest. This value was stored with the INFyS data sets and was used as a basic and general indicator of the percent correct relationship between the two data sets. This method is referred to as the percent correct by intersect
27


Figure 2.9: Intersect Relationship between INFyS and GL30
In this case three sampling sites from this conglomarado would be classified as forest (dark green) and one would be classified as shrubland (light green).
2.4 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent Correct Calculations made in Method 1
To provide a more realistic representation of sampling sites, a method was developed to represent the sites as an area rather than a point. The sampling sites have a known area of 400 square meters and a known shape. For computational purposes, all sites were chosen to be represented by the shape associated with temperate forests, a circle with a radius of 11.28m. To account for the assumption of a perfect positional accuracy between the two data sets a positional error value of 50m associated with Landsat TM and ETM was used (Tucker, C.J. et al. 2004J. This implies that any given Landsat image pixel could be within 50 meters of its actual position on the map. No supported positional accuracy data for the GPS units used was identified but it is assumed to be less than the error associated with the remotely sensed imagery. In knowing the size of the sampling site and potential
28


positional accuracy error associated with the underlying raster we can say that any given
sampling site is very likely to represented by a circle with an area of 400 square meter within 61.28 meters of the centroid of that site. Therefore, if all the area within the 61.28 meters surrounding the site centroid is represented as forest by the GL30, it is assumed that this is a correct representation. Under the same logic, a misrepresentation is where none of the area is classified as forest (Figure 2.13). Finally, sites with a mixed value (e.g. 30% forest and 70% grassland) are considered to partially covered by forest sites and were excluded for specific analysis due to the inherent uncertainty in their classification. This method is referred to as the percent correct by area.
The percent correct calculation was performed using a Python script. The calculation itself is: serived by simply dividing the total number of sites correct over total number of sites sampled. However, the selection of the forest groups is more cumbersome. The INFyS data with the new binary column was exported as a csv and imported as a pandas data frame. Two different selection methods were used. The first method selected all sites where the vegetation type included a forest group, such as bosque" or "selva". The second method was used when a specific forest type was selected, 'bosque de pino' or bosque de encino'. Once selected, the total number of true values from the forest binary column was counted. The total number of sites within the selection was calculated and the percent correct value was recorded and saved in a list. This process was repeated for all ten forest groups. These lists of values were converted into a dataframe and then exported as a csv.
29



Location and Area based on GPS coordinates of the site centroid taken by field crews who sampled the location
Possible locations due to potential variance in the positional accuracy of the Landsat image used to represent this location
Forest Class of GL30
Grassland Class of GI30
Figure 2.10: Percent Correct by Area process
In the firs method, this location would have been classified as Forest class by the GL30 because the centroid of the site falls within the GL30 forest class. In the second method, this site would have been defined as partially covered by forest. The largest radius circle represents all the area in which the 400m2 area of the INFyS site could be, given the known potential positional error of the Landsat ETM+ of 50m (Tucker et al. 2004). This can be visualized by imagining the centroid staying in the same position and the image beneath it moving in any direction for up to 50 meters. If the image was to move to the left by 30m the centroid and sampling area would be fully within the grassland class of the GL30. The means that due to the inherent uncertainty in the position of the Landsat image used to create the GL30 it is not possible to say with complete certainty that this site is actually in the forest class.
In order to obtain percent area value, an individual sampling site was selected and buffered to a circle with a radius of 61.28m. This buffer feature maintained the site ID for linking back to the INFyS data. This polygon was converted to a raster with lm cells. A area calculation was performed using the GL30 data sets as the reference data. The area
30


calculation determines how much area of the feature being tested falls into the unique classes defined by the reference data set. In this case, the sampling site with an area of 11,791m2 is compared against the GL30 land cover data. The output identified the unique sampling site ID and how many of the 11,791 one meter cells fell into each the ten land cover classes associated with GL30. This output table was added to a pandas data frame and the percent forest value was calculated by determining how many of the 11,791 cells were classified as forest (Figure 2.14). This process was repeated for all the sampling sites for the two INFyS data sets. Once complete, the percent area value was joined back to the original shapefile based on the site ID. A binary results column was created on the pandas data frame where 1 = 100% forest, 0 = 0% forest, and 'None' (equivalent to "no data") was applied to all partially covered by forest sites.
Figure 2.11: Examples of percent correct by area values
The black circle represents the buffered area with a radius of 61.28m. All area surrounding the site on the left is classified as forest, the center has a portion classified as forest which is considered partially covered by forest and the right has no forest.
31


The percent correct value was calculated using the same base structure as the script
referenced in section 3.2. To exclude the partially covered by forest sites, once the selection was made based on the forest group, all sites with "None" in the forest binary column were removed. After that the percent correct values were calculated and exported as a csv.
2.5Assessment Method 3: Predictive quality of the INFyS site attributes "Canopy Cover"
and "Number of Trees"
Given the methods through which remotely sensed images are derived, the number of trees and canopy coverage at a given site can serve as key indicators of whether the location was correctly represented by GL30. These values can both be viewed as indicators of forest density. Therefore, the distribution of the values associated with correctly represented sites is likely different then the distribution of values associated with misrepresented sites. This relies on the assumption that the greater the forest density at a site, the more likely it is to be identified as forest in remotely sensed data. The distribution of these values is non-normal and considerably skewed to the right (Figure 2.15). Therefore, the Mann-Whitney was used to determine whether the distribution between the correctly represented and misrepresented sites is indeed different.
2.2.5.1 Mann-Whitney U Test.
The distribution of values for number of trees and canopy coverage between the two classes was compared using a Mann-Whitney U test. This test ordinally ranks the values from both groups. From this ranking a sum of the ranks is calculated from a single group.
The U value for both groups is determined based on the number of samples within each
32


group and the sum rank value. The smaller of the two values is compared then against a critical U value. If the calculated U value is smaller than the critical U the null hypothesis is rejected. The null hypothesis of the Mann-Whitney test is that the distributions are the same. The p-value associated with each test was used as an indicator of whether if the null hypothesis could be rejected, and therefore the distributions of the data are indeed different.
Figure 2.12: Distributions of Number of Tree and Canopy Coverage
This figure shows the distribution of correctly identified and misidentified sites based on the number of trees per site (left) and canopy coverage (right). The AUC value for canopy coverage was higher than that of number of trees.
By adapting the original selection script used in the percent correct by intersect method, a Mann-Whitney test was performed on the ten forest groups, all of which had over 100 sampling sites. Testing the number of sites in a group was done to ensure the validity of the statistical methods. Sites were selected based on forest group and then stored to one of four arrays base on being correctly (100% forest) or incorrectly (0% forest) represented as forest by the percent correct by area method. Partially covered by forest
33


sites were excluded from the analysis. The four arrays were: 1. correct canopy coverage values, 2. misrepresented canopy coverage values, 3. correct number of trees values, and 4. misrepresented number of trees values. Two Mann-Whitney U tests were performed for each forest group. One for canopy coverage and one for number of trees per site. These values were stored in a pandas data frame. The process was completed for all ten forest groups and the data frame was exported as a csv.
2.2.5.2 Receiver Operator Curves (ROC) and Area Under the Curve (AUC).
Given the results of the Mann-Whitney test, the distribution of values for number of trees and canopy coverage between the correctly represented and misrepresented sites were different. Knowing this allows for the use the Receive Operator Curve (ROC) and Area Under the Curve (AUC) statistical methods. These methods provide a means of evaluating how effective the number of trees and canopy coverage values are at predicting the classification of a given forest type by the GL30 data. A unique characteristic of the AUC is that the output of the analysis is invariant against class skew and evaluated score (Fawcett, 2005). Therefore regardless of the number of features within a distribution or the range at which values are measured the AUC can be compared directly against other AUC values.
The ROC test is created by plotting the true positive rate against the false positive rate of a binary distribution at multiple thresholds. The true positive rate is the number of values that are known to be true at or above a threshold divided by the total number of true values. The false positive rate is the number of false values at or above a given threshold divided by the total number of false values. The true positive and false positive rates are
34


represented as percentages and form the x and y axis of the plot. The AUC is then determined by calculating the area under the curve created by the ROC test. ROCs are very useful for visually representing the data and the AUC is useful as a single value representation of the dataset.
Keeping consistent with the methodology deployed for the Mann-Whitney test, the ROC and AUC analyses were performed on all forest groups, ensuring that each group contained 100 or more sample sites. The process for the ROC and AUC was built on top of the script for the Mann-Whitney. The four arrays created for each forest group served as the inputs to the roc_auc_score and sklearn.metric.roc_curve functions that are part of the sklearn.metrics library. The output of these functions is an area under the curve percentage, false positive rate, true positive rate, and threshold values. These values were stored in a data frame and the process was repeated for the number of trees per site attribute data as well. While the AUC value can be compared directly the three other values are used to construct the ROC curve for visual interpretation (Figure 2.16). This process was completed for all forest groups. And the final data frame of the AUC values was exported as a csv. The true positive rate is the proportion of all positively identified sites that are found to have at given threshold or above. The false negative rate it the proportion on misidentified sites that are found at a given threshold or above. A detailed example is provided below.
35


Figure 2.13: Example of ROC curve
ROC curve plot for the coverage values and number of trees for the pine forest in the INFySO data set.
An ROC and AUC test is being performed on a number of trees values for a forest type group that has 1000 total samples of which 800 were identified by GL30 as the forest class and 200 were not. Therefore, the percentage correct or producer's accuracy is 80%. This is a general value for all the sites and it may be that the density of trees in a particular section of forest is well known, such as 15 per 400m2. This value of 15 trees per 400m2 represents a threshold in the ROC test. The true positive rate is determined by identifying how many sites that have 15 trees per 400m2 have been correctly identified and dividing that by the total number of correctly identified sites. The false positive rate is determined by the same method but in this case it is the number of misidentified sites found above the
36


threshold over the total number of misidentified sites. A powerful option of this process is
the ability to calculate a percent correct value based on a given threshold. In this case, it was found that 500 of the 800 total correctly identified sites and 50 of the 200 misidentified sites had at least 15 trees in a 400m2 area. This would produce a true positive rate of 0.625 and a false positive rate of 0.25. The percent correct value at this threshold is 91%. This is 11% higher than the percent correct value found for all sites. This provides another means of understanding how well GL30 has represented the forest in that region.
37


CHAPTER III
RESULTS
The results of this project are presented in three sections based on the three distinct methods used to assess the accuracy of GL30's representation of the forests of Mexico. The first two methods, percent correct by intersect and percent correct by area involve the direct spatial comparison between the GlobalLand30 and the INFyS data sets through an intersect and percent area calculation. The product of this method is a percent correct value for the ten forest type groups examined here(see Appendix A for definition of each group). The third method uses attribute data associated with the INFyS sites to determine how well forest inventory sample site attributes can be used to assess how well the GL30 Forest class correctly identifies the forests in Mexico. This is done by applying the receiver operator curve and area under the curve processes (subsection 3.4). By using multiple assessment methods, this study produces a more complete assessment of the relationship between the GL30 and INFyS data sets and a more complete assessment of how well the GL30 forest class represents the forests of Mexico.
3.1 Assessment Method 1: Percent Correct by Intersect The first process undertaken provides a well-known and understood baseline assessment to compare against the other methods. The values reported represent the spatial relationship between the centroid of the sampling site and the GL30 30X30m cell where that centroid falls. Two sets of results were produced. One compares GL30_2000 with the 60,580 sites of the INFyS 2004 -2009 (INFySO) inventory data. The second compares GL30_2010 with the 11,476 sites of the INFyS 2009-2013 (INFySl) data.
38


3.1.1 Percent Correct of All Sites
The first assessment was performed on the forest type group for all sites (see Appendix A). The INFySO data set contains 58 unique forest types (see Appendix A) and the INFySl data set contains 35 unique forest types (see Appendix A). For the GL30_2000-INFySO relationship, an overall percent correct value was found to be 77.2%. The GL30_2010-INFySl relationship had an overall agreement of 79.6% (Figure 3.1, Table 3 1).
Percent Correct by Intersect
Erosion presence High evergreen jungle Medium semi-deciduous jungle
bO
§ Low dry deciduous jungle
0
5 Pine-Oak Mix
1
+->
£ Pine
o
Ll_
Tropical Jungle Temperate forest All Sites
70.0 75.0 80.0 85.0 90.0 95.0 100.0
Pecent Correct
Figure 3.1: Percent Correct By Intersect Values
This graph summarizes the results of the percent correct by intersect values of the GL30_2000-INFyS0 data and the GL30_2010-INFysl data sets.
3.1.2 Temperate Forests
Temperate forest was defined by all forest types which contained the word "Bosque" in the INFyS TIPO-VEGT vegetation class field. This included 21 forest types in the INFySO and 17 in the INFySl data sets (see Appendix A). The percent correct value was 73.9% for
39


the GL30_2000-INFyS0 relationship and 74.6% for the GL30_2010-INFySl relationship.
3.1.3 Tropical Forests
Tropical forest was defined by all forest types with contained the "Selva" in the INFyS TIPO-VEG vegetation class field. This included 18 forest types in the INFySO data sets and 13 forest types for the INFySl data set (see Appendix A). The percent correct value was 89.3% for the GL30_2000-INFyS0 relationship and 89.07% for the GL30_2010-INFySl relationship.
3.1.4 Primary Temperate Forest Classes: Pino, Encino, and Pino-Encino Mix
The temperate forest class contains the most sampling sites in both of the INFyS data sets. The temperate forests are primarily pino (pine), enico (Oak), or a mixture of the two. The percent correct value for the Pine, Oak, and the Pine-oak mix forests for the INFySO data are: 76.2%, 72.3%, and 73.8% respectively. These values were similar in the INFySl data: 74.6%, 72.3%, and 75.4% respectively.
3.1.5 Primary Tropical Forest Classes: Selva Baja, Mediana, and Alta
There is a great diversity tropical forest types in Mexico. Within the INFyS data the tropical forest types are distinguished based on the height and percent of trees that drop their leaves during the dry season. The percent correct values for the Selva Baja ( Low dry deciduous jungle), Mediana (Medium semi-deciduous jungle), and Alta (High evergreen jungle) forest for the INFySO data set are: 77.2%, 95.0%, and 91.4% respectively. The values for the INFySl data are: 74.9%, 95.5%, and 90.9% respectively.
3.1.6 Erosion presence
The Erosional class is a mixture of temperate and tropical forest types that contain the term "Erosion" within the INFyS TIPO-VEG vegetation class field name. The INFySO data
40


set contains 18 unique forest types and the INFySl data set contains 9 unique forest types
that include the "Erosion" term in their description (see Appendix A). For the GL30_2000-INFySO relationship, an overall percent correct value was found to be 78.7%. The GL30_2010-INFySl relationship has an overall agreement of 76.2%.
The results reported here represent the most basic relationship possible between GL30 and INFyS. The methods presented next attempt to account for some of the limitation of these results. However, the results against the more detailed evaluations presented in the next subsections.
3.2 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent Correct Calculations made in Method 1
The results in this subsection offer an interpretation of percent correct relationship that better accounts for the spatial character of the data. This was done by addressing two major assumptions inherent in the percent correct by intersect method presented above. First, the assumption made that sampling sites are points instead of polygons with an area. Second, perfect positional accuracy between the GL30 and INFyS data sets is assumed. Perfect positional accuracy implies that every cell of the GL30 data is found on the exact position of the Earth which it is supposed to be. The method presented in this subsection accounts for those assumptions by defining all sites as either 100% forest, 0% forest, or some percent value in between these, referred to as "partially covered by forests". The percentage refers to the proportion of a defined area around the centroid which is classified as forest by the GL30 Forest class. All percent correct results reported below were calculated using sample sites that have 100% forest cover and samples sites that have 0% forest cover only. The
41


partially covered by forest sites were excluded because there is about whether they were truly represented as forest by the GL30 data. These partially covered by forest sites are found in the boundaries between the forest and other land cover/use classes. By addressing the inherent uncertainty, in the data the percent correct by area method provides a more realistic interpretation of the spatial relationship between the GL30 and INFyS sets.
The method described in this subsection limits the number of sites that are included in the analysis by excluding all sites that are not either 100% or 0% covered by forests in its entire area. Of the 60,580 sites of the INFySO data, 10,801 sites were identified as partially covered by forest. These sites are 17.8% of the total sites in the INFySO sites. Of the 11,476 sites of the INFySl, 352 sites were identified as partially covered by forest. These sites are 3.0% of the total sites in the INFySO sites.
3.2.1 Percent Correct of All Sites
This assessment used all INFyS sites that were either 0% or 100% covered by the forest class of the GL30. The INFySO data set contains 58 unique forest classes and the INFySl data set contains 35 unique forest types (see Appendix A). For the GL30_2000-INFySO relationship, an overall percent correct value was found to be 84.4%. The GL30_2010-INFySl relationship has an overall agreement of 80.4% (Figure 3.2, Table 3 1).
42


Percent Correct by Area INFySO and INFySl
70.0 75.0 80.0 85.0 90.0 95.0 100.0
Percent Correct
Figure 3.2: Percent Correct by Area INFySO and INFySl
This graph summarizes the results of the percent correct by area values of the GL30_2000-INFySO data and the GL30_2010-INFysl data sets.
All Sites Temperate Forest Tropical Forest Pine Oak Pine-Oak mix Low dry deciduous jungle Medium semi-deciduous jungle High evergreen jungle Erosion presence
Intersect INFySO 79.2 73.9 89.3 76.3 72.3 73.8 77.2 95.0 91.4 78.7
INFySl 79.6 74.5 89.1 74.6 72.3 75.5 74.9 95.5 90.9 76.2
Area INFySO 84.4 79.5 93.1 82.3 77.9 79.3 82.8 97.3 93.9 85.8
INFySl 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4
Table 3.1: Percent correct values for the INFySO and INFySl
The percent correct values are reported for the INFySO and INFySl data sets for the percent correct by intersect and percent correct by area evaluations.
43


3.2.2 Temperate forests
Temperate forest was defined by all forest types which contained the word "Bosque" in the vegetation class field which includes a wide range tropical forest types. This included 21 forest classes in the INFySO and 17 for the INFySl data set (see Appendix A). The percent correct value was 79.5% for the GL30_2000-INFyS0 relationship and 75.4% for the GL30_2010-INFySl relationship.
3.2.3 Tropical Forests
Tropical forest was defined by all forest types which contained "Selva" in the vegetation class. This included 18 forest classes in the INFySO and 13 classes for the INFySl data sets. The percent correct value was 93.1% for the GL30_2000-INFyS0 relationship and 89.8% for the GL30_2010-INFySl relationship.
3.2.4 Primary Temperate Forest Classes: Pino, Encino, and Pino-Encino Mix
The percent correct values for the pino (Pine), encino (Oak), and pino-encino mix (Pine-Oak mix) forests for the INFySO data are: 82.3%, 77.9%, and 79.3% respectively. These values were substantially higher than those from the INFySl data: 75.7%, 73.1%, and 76.4% respectively.
3.2.5 Primary Tropical Forest Classes: Selva Baja, Mediana, and Alta
The percent correct value for the Selva Baja (Low dry deciduous jungle), Mediana (Medium semi-deciduous jungle), and Alta (High evergreen jungle) forests for the INFySO data are: 82.8%, 97.3%, and 93.9% respectively. The values for the INFySl data are: 75.9%, 95.9%, and 91.5% respectively.
44


3.2.6 Erosion presence
The Erosion presence class is a blend of temperate and tropical forest types that contain the term "Erosion" within the INFyS TIPO-VEG vegetation class field name. The INFySO data set contains eighteen unique forest type and the INFySl data set contains nine unique forest types (Appendix A). For the GL30_2000-INFyS0 relationship, an overall percent correct value was found to be 85.8%. The GL30_2010-INFySl relationship has an overall agreement of 75.9%.
3.3 Assessment Method 3: Predictive quality of the INFyS site attributes "Canopy Cover"
and "Number of Trees"
The first two methods are direct measurements of the representational accuracy of the Forest class in the GL30 data sets. The third method, presented here, is different because it uses attribute data contained in the INFyS data sets to demonstrate how those attribute values can be used to assess how sensitive the representational accuracy of the GL30 Forest class is to differences in forest canopy coverage ("COBERTURA_ARBOREA"), and number of trees ("UARBOLES") reported for the site. The main hypothesis of this assessment method is that sites that have lower canopy coverage and/or fewer trees are less likely to be correctly identified as forests by the GL30 Forest class than sites with high values for those two attributes. Two statistical measures are used to examine this relationship. The Mann-Whitney U test provides an initial indication of whether the values between the correctly represented and misrepresented sites are indeed different. The Receiver Operator Curve and Area Under the Curve measurements enables the comparison of the predictive quality of these two attributes (canopy cover and number of trees). By using the canopy cover and
45


number of trees contained in the INFyS data sets we are able to produce a fuller perspective
of how well the GL30 Forest class represents the forest of Mexico.
3.3.1 Mann-Whitney U Test
The first question is: whether the values of canopy coverage and number of trees per site differ between correctly classified as forests and incorrectly classified as forest sites. The Mann-Whitney U test was chosen for this test because it is a non-parametric test that can be used to determine whether the two data distributions are indeed different. There are limitations on the sample size needed to ensure the inferences of the test are correct. The documentation for the Python function used to perform the Mann-Whitney U test suggests a minimum sample size of 20 for each of the two groups being tested (docs.scipy.org, 2016). In this study, we performed this analysis on forest type groups (see Appendix B) that contained more than 100 sites total. This ensured that all forest type groups tested meet the minimum sample size required in the test. Both correctly identified and misidentified sites have more than 20 sites each for each forest type group analyzed. The output of a Mann -Whitney test is a U-value and p-value. The p-value is the result of a two-sided test. If the p-value is less the 0.001 it can be said with confidence that the distributions of the canopy coverage and number of trees associated with correctly identified and misidentified sites are different. The results of the Mann-Whittney U Test provide confidence that the data is well suited for the ROC and Area Under the Curve analysis.
46


3.3.2 Results of the INFySO data set
A listing of the U-values and p-values from the test is included in Appendix B. For clarity purposes, the p-values are reported here by relative significance, 1 refers to the smallest p-value possible (Table 3 2).
INFySO All Sites Temperate Forest Tropical Forest (D _C Q_ O Pine-Oak mix Low dry deciduous iiinulp Medium semi- rlprirliinii"; High evergreen jungle Erosion presence
P_Coverage N/A N/A 4 8 3 i 7 13 15 16
P_T rees N/A 5 2 12 11 10 9 6 14 17
Table 3.2: Ranked p-values from Mann-Whitney; INFySO
The p-values from the Mann-Whitney U test of the canopy coverage (P_Coverage) and number of trees per sites (P_Trees) are reported in ranked order with 1 being the smallest value.
3.3.2.1 Canopy coverage.
All p-values were less the 0.001. The Python function did not produce a p-value for 3 of the twenty catergories. These forest type groups represent the most numerous forest type groups and it is believed that some limitation in the function may have occurred as a result. The function may have been overpowered by the number of features being tested. The Mann-Whitney U test is dependent on the number of samples within the group. It is more likely that a variance be present within a group when there is a large number of samples. However, the exact cause of this is not known. The results do imply that the distribution of canopy coverage values is different between the correctly identified and misidentified sites.
47


3.3.2.2 Number of trees.
All p-values were significantly less than 0.001. The p-value for the Forest type group, all sites, was not reported by the python function. The results imply that the distribution of number of trees is different between the correctly identified and misidentified sites.
3.3.3 Results of the INFySl Data Set
A listing of the U-values and p-values from the test is included within Appendix B. For clarity purposes the p-values are reported here by relative significance, 1 refers to the smallest p-value possible (Table 3 3).
INFySO All Sites Temperate Forest Tropical Forest Pine Oak Pine-Oak mix Low dry deciduous jungle Medium semi-deciduous jungle High evergreen jungle Erosion presence
P_Coverage 2 3 8 13 7 6 15 16 18 20
P_Trees 1 5 4 14 10 11 12 9 17 19
Table 3.3: Ranked p-values from Mann-Whitney; INFySl
The p-values from the Mann-Whitney U test of the canopy coverage(P_Coverage) and number of trees per sites (P_Trees) are reported in ranked order with 1 being the smallest value.
3.3.3.1 Canopy coverage.
All p-values were significantly less the 0.001. The results imply that the distribution of canopy coverage is different between the correctly identified and misidentified sites.
3.3.3.2 Number of trees.
All p-values were significantly less the 0.001. This implies that the distribution of number of trees is different between the correctly identified and misidentified sites.
48


3.4.4 Receiver Operator Curves (ROC) and Area Under the Curve(AUC)
In order to capture the potential of the canopy coverage and number of trees values associated with the INFyS data sets the Receiver Operator Curve (ROC) and Area Under the Curve (AUC) calculations were applied. These analyses were performed on the measured attributes of canopy coverage and number of trees. These attribute values are representative of forest density in the sampling site locations. Since GL30 is based on remotely sensed data, these measurements are believed to be good indicators of the likelihood of a correct classification at a given location by the GL30 Forest class. The logic is that areas with higher forest density will be more likely to be correctly identified as forest by GL30. The ROC is a visual representation of the data whereas the AUC is a numerical representation of the predictive power of the relationship. AUC values are comparable across different data set and are the primary reporting tool for this analysis. The best way of interpreting the AUC is as follows. The AUC is the probability that any randomly chosen positively identified site will have a higher coverage and/or number of trees value then a randomly chosen misidentified site for that given forest type group (Fawcett, 2006). An AUC of 0.50 is the equivalent of a random guess or no predictive power. As the AUC approaches 1 the predictive power increases. See Fawcett (2006) fora more complete description of the process. It should be noted that for all forest type groups from both INFyS data sets, the AUC values were always above 0.50. This implies that across the board, canopy coverage and number of trees at a location has a positive predictive value on whether or not GL30 will classify that site as forest.
49


All Sites Temperate Forest Tropical Forest Pine Oak Pine-Oak mix Low dry deciduous iungle Medium semi-deciduous jungle High evergreen jungle Erosion presence
INFySO AUC_Coverage 0.71 0.71 0.71 0.71 0.70 0.72 0.69 0.71 0.75 0.78
AUC_Trees 0.67 0.61 0.76 0.67 0.59 0.61 0.68 0.80 0.77 0.76
INFySl AUC_Coverage 0.68 0.69 0.66 0.69 0.68 0.69 0.65 0.70 0.72 0.77
AUC_Trees 0.69 0.64 0.76 0.68 0.63 0.63 0.70 0.80 0.78 0.76
Table 3.4: AUC Values from INFySO and INFySl
Identifies the Area under the curve value for both the INFySO and INFySl data set for the canopy coverage and number of trees per site values.
3.4.5 INFySO
3.4.5.1 Canopy coverage. The AUC value for coverage varied little between forest groups (Table 3.4, Figure 3.3), the range is only 0.09. The best predictor was the erosiona class and the worst predictor was the Low dry deciduous jungle.
3.4.5.2 Number of trees. Compared to the canopy coverage values the AUC for number of trees is generally lower and has more variability between forest groups. The values range from 0.59 in the Oak to 0.80 in the Medium semi-deciduous jungle. This produces a range of 0.21, which is slightly more than two times that found in the canopy coverage values (Table 3.4).
50


AUC for INFySO Canopy Coverage and Number of Trees
Erosion presence
High evergreen jungle
Medium semi-deciduous jungle
Low dry deciduous jungle Pine-Oak Mix Oak Pine
Tropical Jungle Temperate forest All Sites
Area Under the Curve
0.85
Figure 3.3: AUC Values for INFySO
Area under the curve values for canopy coverage and number of trees for a given forest group.
3.4.6 INFySl
3.4.6.1 Canopy coverage. The AUC values for canopy coverage varied little outside of the Erosional class. The lowest value of 0.65 was found in the Low dry deciduous jungle group and the highest was found in the Erosion presence group at 0.77. The range in value between the two is 0.12 (Table 3.4).
3.4.6.2 Number of trees. The AUC values for the number of trees at a site varied across the forest groups. The highest value of 0.80 was found in the Medium semi-deciduous jungle and the lowest value of 0.63 was shared between the Pine and Pino-Oak
51


mix. The range is 0.17 (Table 3.4).
AUC for INFySl Canopy Coverage and Number of Trees
Erosion presence High evergreen jungle Medium semi-deciduous jungle Low dry deciduous jungle Pine-Oak Mix Oak Pine
Tropical Jungle Temperate forest All Sites
0.55 0.60 0.65 0.70 0.75 0.80 0.85
Area Under the Curve
Figure 3.4: AUC Values for INFySl
Area under the curve values for canopy coverage and number of trees for a given forest group.
3.5 Comparison of AUC values between the INFySO and INFySl data sets
Both the GL30 data sets, 2000 and 2010, used the same classification methodology. Therefore, differences between the AUC results may be more reflective of the differences in the INFyS data sets.
3.5.1 Canopy coverage
Interestingly, all the AUC values for canopy coverage decreased between the INFySO and INFySl data sets. While these differences are minor, never more than 0.05, the ubiquity
52


of the decrease is worth noting (Figure 3.5).
AUC for Canopy Coverage
Erosion presence High evergreen jungle Medium semi-deciduous jungle Low dry deciduous jungle Pine-Oak mix Oak Pine
Tropical Forest Temperate Forest All Sites
0.55 0.60 0.65 0.70 0.75 0.80
INFySl INFySO
0.85
Figure 3.5: AUC values for canopy coverage
AUC values for canopy coverage for the INFySO and INFySl data sets.
3.5.2 Number of Trees
The AUC values for the number of trees either stayed the same or increased between the INFySO and INFySl data sets. The largest increase was found in the Oak forest group at 0.04 (Figure 3.6).
53


AUC for Number of trees
Erosion presence High evergreen jungle Medium semi-deciduous jungle Low dry deciduous jungle Pine-Oak mix Oak Pine
Tropical Forest Temperate Forest All Sites
0.55 0.60 0.65 0.70 0.75 0.80
0.85
INFySl INFySO
Figure 3.6: AUC values for Number of Trees
Auc values for the number of trees for both the INFySO and INFySl data sets.
54


CHAPTER IV
DISCUSSION
The release of the GL30 global land cover data marked a dramatic increase in the resolving power of global land cover data sets (Chen et al. 2015). The ten land cover classes of the GL30 were mapped at 30m resolution. The extra detail that is offered by this finer resolution data set has important implications for future land cover/use studies. Having been released in 2015, there have been only a limited number of studies what have assessed the accuracy of the GL30. The goal of this study was to perform an assessment of the representational accuracy of the forest class of the GL30 for the temperate and tropical forest of Mexico.
This work will contribute to the overall understanding of GL30 success as a land cover product because it is the first independent study to evaluate GL30 in environments south of 30 degrees north latitude. This study is different than other assessments of the GL30 in that it is only assessing a single land cover class and it will add a significant level of understanding as to how well the GL30 Forest class identifies various forest types. The INFyS data sets enable a robust and detailed assessment of ten distinct forest type groups. These groups are representative of specific ecological conditions and the results of this study can be extended to serve as a proxy for how well GL30 will represent similar forest types around the world. Through the application of the Receiver Operator Curve (ROC) and Area Under the Curve (AUC) assessments, this study was able to incorporate forest density metric of canopy coverage and number of trees to assess how those values affect GL30 Forest class ability to identify forest.
55


4.1 Summary of Findings
4.1.1 Percent Correct by Intersect
Many important findings come from the percent correct by intersect method. Most significantly, all forest types have a producer's accuracy of over 72.3% (Figure 3 1). The second factor worth noting is that there is very little variance, less than 3%, when comparing the percent correct values between the INFySO and INFySl data sets. This consistency across multiple tests gives adds weight to the accuracy of the results. The tropical forests were better represented than the temperate forests by the GL30 forest class with the medium semi-deciduous jungle and high evergreen jungle obtaining percent correct values of over 90%. Forest classes that have a deciduous character, such as oak and low dry deciduous jungles, were the worst represented of the forest groups, suggesting that the seasonal variability of these forests may not be well accounted for in the GL30 process. Lastly, the erosion presence forest type class performed as well as the all sites class which suggest that the presence of marginal land conditions does little to affect GL30 representation of the forests of Mexico.
4.1.2 Percent Correct by Area
The reasoning for applying the percent correct by area method was to reduce the uncertainty within the relationship between the GL30 Forest class and the INFyS data. The results show that by reducing the uncertainty the percent correct values increased. Between the ten forest type groups, only one forest type group showed a decrease in the percent correct from the percent correct by intersect to the percent correct by area method (Figure 4.1). This was the Erosion presence class of the GL30_2010-INFySl relationship, which
56


Forest Groups
decreased by 0.3%. This highlights two well understood components of remote sensing based land cover assessment. 1) Internal core areas that represent a homogenous land cover type are well classified by remote sensing based land cover classification methods. 2: Areas of ecological transitions that relate to changes in land cover classes are difficult to classify (McCallum et al. 2005). The percent correct by area method clearly illustrates this relationship with the GL30 data because all transitional sites that were only partially covered by forest were excluded. This left only core areas of GL30 Forest class or core areas of another GL30 class.
Erosion presence High evergreen jungle Medium semi-deciduous jungle Low dry deciduous jungle Pine-Oak Mix Oak Pine
Tropical Jungle Temperate forest All Sites
I percent change INFySl I percent change INFySO
-2.0
0.0
2.0
4.0
6.0
8.0
Percent Change from Intersect to Area Percent Area Values
Figure 4.1: Change in percent correct between intersect and area
The difference between the producer's accuracy of the percent correct by intersect and percent correct by area methods for both the INFySO and INFySl data sets.
57


The increase in percent correct was found in both data sets, but the increases within
the INFySl data set were relatively modest (Figure 3.2). This is because out of the 14,451 sampling sites only 352 sampling sites were found to be partially covered by forest. Therefore, the impact that these sites could have on the overall change in the percent correct value for the various forest type groups is limited. Percent correct values for the forest type groups increased between 0.4 and 1.1%, excluding the Erosion presence forest type group, which decreased 0.3%.
The GL30_2000-INFyS0 relationship was highly affected by this method. Among the 60,580 sampling sites 10,801 sampling sites were classified as partially covered by forest. This resulted in a dramatic increase in the percent correct values of the ten forest type groups. The increase ranged from the forest type group medium semi-deciduous jungle at 2.3%, to the forest type group Erosion presence at 7.2%. In general, the forest type groups with the lower percent correct values, such as oak and low dry deciduous jungle were most strongly affected by the elimination of partially covered by forest sites. This implies that these forest type groups are more likely to be found in transition environments where land cover is changing frequently over space (Figure 4.2).
58


Partially covered by forest Site Locations
GL30 Forested Area
Figure 4.2: Relative Density of Partially Covered by Forest Sites from the GL30_2000 Data Set
This map shows the relative density of partially covered by forest sites within the GL30_2000-INFyS0 relationship. The light green area represents the location where GL30 has mapped forest for 2000. A 40% transparency is applied to the 'partially covered by forest' layer to allow the underlying forested areas to show through.
4.1.3 Receiver Operator Curve (ROC) and Area Under the Curve (AUC)
For the purpose of comparison and interpretation, the AUC values are more generally useful than the graphical representations of the ROC curves. Because the AUC test provides a single value that is comparable across other AUC analyses. The AUC value can be viewed as the likelihood that a randomly chosen correctly identified value will have a higher score
59


then a randomly chosen misidentified value. An AUC value of 1 indicates a perfect predictor
and an AUC of 0.5 indicates a predictive quality that is the same as random chance. The AUC value of 0.7 implies that a correctly identified site will have a higher value for canopy coverage or number of trees then a misidentified site 70% of the time. Within this study, all forest type groups had an AUC value of over 0.5. This was an expected results, based on the results of the Mann-Whitney test, which showed that the distributions of the correctly identified and misidentified sites were indeed different. All ROC and AUC tests were performed on the subsets of the INFySO and INFySl data sets that excluded the partially covered by forest sites.
The AUC value for canopy coverage varied distinctly between the INFySO and INFySl data (Figure 3.5). In all of the ten forest type groups, the INFySO data had a higher AUC value then the INFySl data. This implies that in general, the canopy coverage between correctly identified and misidentified sites is more distinct within the INFySO data set then the INFySl data set. The AUC values ranged between 0.65 and 0.78. While the forest type group erosion presence and high evergreen jungle had distinctively higher AUC values, the variability among the other forest type groups is rather low. This implies that the GL30 Forest class ability to distinguish between forested and non-forested lands based on the canopy coverage is consistent across the majority of the forest type groups.
The AUC values for the number of trees were varied between the INFySO and INFySl data sets with the INFySl generally having a higher AUC value then the INFySO (Figure 3.6). This implies that there is a greater distinction in the number of trees in the correctly and misidentified sites then in the INFySO data within the INFySl data set. Four of the ten forest
60


type groups had AUC values above 0.75, implying a very high degree of predictive quality
relative to other forest type groups within this study. Tropical forests were distinguished well by the total number of trees at a given location.
4.2 Comparison of Producers Accuracies with Existing Studies
The accuracy assessment performed in this study will be contrasted against information contained in the four published research studies; Chen et al. 2015, Brovelli et al. 2015, Arsanjani et al. 2016, and Sun et al. 2016. The values from the percent correct by intersect method can be compared against the results from all four studies, the values from the percent correct by area method can only be compared against those from the Brovelli study, and the AUC and ROC values cannot be directly compared to any of the four studies. The values taken from the four studies represent the producer's accuracy metric. The calculation of the producer's accuracy is equivalent to the percent correct calculations so the terms are interchangeable.
The producer's accuracy from the percent correct by interest method in this study ranged from were generally lower than the values reported in the four established studies. The Chen et al. 2015 study report a single producers accuracy for the forest class of 92.40%. Considering that value was taken from a region above 30 degree north it represents a much higher level of accuracy then what was seen in the temperate forest classes of this study. Brovelli et al. 2015 reported producer's accuracies of 85% for reclassified class that include forest and 4 other GL30 land cover types, 90% for the same reclassified class that was buffered to account for co-location effect, and 81% for the forest class by itself. An increase in accuracy seen when accounting for potential positional error of the data set matches the
61


results seen in this study very well. The assessment of the forest class by itself is approximately 7% higher than the value reported for temperate forest within this study. Arsanjani et al. 2016 reported four producer's accuracies from four unique land cover validation sets. These values; 23.65%, 82.89%, 84.63% and 93.97% were all reported for a reclassified class that include forest and 4 other GL30 land cover types. Excluding the lowest value, the remaining there were all significantly higher than the percent correct value found for the temperate forest class in this study. Sun et al. 2016 combined the forest and shrubland class of the GI30 and reported a producer's accuracy of 34%. This value is extremely low and was not seen in any of the ten forest type groups within this study.
Due to the various assessment methods employed by each study, making a direct comparison between them is challenging. It is important to note that all these studies relied on existing land cover data sets or user interpreted high-resolution imagery and took place in temperate forests north of Mexico. Overall, the reported values were higher then what we assessed in this study. The variance among all the studies suggest that the regional level assessment provide important perspective on how the GL30 Forest class identifies forest in the different geographic locations.
4.3 Evaluation of the Study
The introduction of GlobeLand30 to the world represents a significant increase in the resolving power of global land cover data sets. As GL30 is validated across different locations, the significance and importance of the data will become clear to settle in to the land cover research community.
62


This project provides a detailed perspective on how GL30 represents various types of
forest. The INFyS data were collected to provide a spatially continuous survey that accounted for the various forest types of Mexico. As a result, there is good coverage across the forested areas of Mexico. Because of all the sampling sites contain spatial information it is very clear where misrepresentation occurs. This is an important advantage compared to raster based assessments. GlobeLand30 contains a single land cover class for forest. This simplification can mask details about the accuracy of specific forest types. For instance, in this study it was found the medium semi-deciduous jungles were very well represented by the GL30 Forest class, but the low dry deciduous jungle was not well represented. The level of detail of this analysis may be the most valuable aspect of this study. All the forest types within the study are tied to certain ecological conditions. In other regions around the world, were similar forest types are found in similar ecological conditions, the results of this study could be used as a general proxy for how the GL30 Forest class will represent that class of forest.
The primary challenge in comparing the results of this study to other evaluations of the GL30 data set is that the INFyS data set only contains information on where forest is present and nothing about where it is not present. This means that only the producer's accuracy can be addressed. The other common evaluation methods, which include user's accuracy, overall accuracy, and derivatives of those, cannot be addressed as part of this study.
This work is unique with respect to the sheer number of validation sites present. There are a total of 72,056 ground-verified sites from against which the Forest class of the GL30 was tested. The initial accuracy assessment of the GL30 reported in Chen et al. 2015
63


was done using 159,874 pixels for all ten land cover classes. These pixels were distributed
across a selection of the 847 map sheets that were used to create the GL30 (Chen et al. 2015). As a result, the density of validation sites used in this study is significantly greater than any existing study that employing user-interpreted or ground-verified sites for validation.
The ROC and AUC process shows the significance the forest density metrics of number of trees and canopy coverage in understanding of how GL30 Forest class represents a location as forest or not. For instance, the GL30 Forest class is more likely to represent a tropical forest as forest if there is high number of trees at the site. In the same sense, a temperate forest with a high value for canopy coverage is more likely to be represented as forest by the GL30 Forest class.
The most unique aspect of this study is the use of the ROC and AUC methodology. These methods capture the quality of the forest inventory data sets. The ROC and AUC process shows the significance the forest density metrics of number of trees and canopy coverage in understanding of how GL30 Forest class represents a location as forest or not. In the same sense, a temperate forest with a high value for canopy coverage is more likely to be represented as forest by the GL30 Forest class. One of the most powerful aspects of the ROC test is that it can provide a means for predicting the likelihood that the GL30 Forest class will identify an area as forest based on the forest type and either the number of tree or canopy coverage at the site. If a forest density metric of canopy coverage or number of trees per 400m2 area is known it can then be determined how many sites belonging to that forest type group were correctly identified above that threshold and how many sites were
64


misidentified below that threshold. From these two values, a probability of that location being accurate could be represented in the GL30 Forest class. The methodology for this process is still being developed. In concept, it offers a unique means of predicting how well an un-validated region may be represented by the GL30.
4.3.1 Limitations of the Areal Representation of the Tropical Forest Class
The INFyS data set defined specific site survey area based on if the forest type was temperate or tropical. Temperate sites were represented as a 400m2 circle centered on the sampling site centroid. Tropical sites were represented as a 400m2 rectangle with a primary access of 40m and secondary axis of 10m. An orientation was also applied based on the sampling sites position within the cluster. Do to the computational challenges of defining the orientation of the sampling site it was determined that the circular area would be used to represent both temperate and tropical sites. Because the centroid is known a circle will always include approximately 212m2 of the 400m2 sampling area regardless of the orientation of the tropical sampling site. Alternatives such as using a rectangular form and selecting a known orientation could at most account for 400m2 of the sampling area once per cluster and between 100m2 and approximately 135m2 for the other sites. The sum of these options so that the circle (212 *4 = 848) and rectangle (400 + 100 + 135 +135 = 770) indicates that on average the circular area will capture more of the true sampling site area then use a rectangular form (Figure 4.3). Accounting for the orientation of the sampling sites is a major opportunity for improvement in the accuracy of the interpretation of the tropical forest types.
65


Figure 4.3: Tropical Sampling Sites Error in representation
Due to the limitation in the ability to represent the correct orientation of the tropical forest sampling sites area for the INFyS data sets a cirle was used to capture the most posible area of the sampling sites. The circular site will capture 212m2 (Purple area on the left image) of the 400m2 area of all four rectangular sampling site areas. The purple area of the two rectangles represents approximately 135m2. The circle is the optimal shape to including the most possible area.
4.3.2 Limitation of Temporal Mismatch of Data sets
One factor of the data that represents a source of error is the temporal difference between the collection date of the images used within the GL30 classification and the time that the INFyS data was collected. The collection data of the GL30 images is known but the INFyS data is only loosely defined by the date ranges of 2004 to 2009 and 2009 to 2013. This temporal variance will result in errors in representation because the forest does change over time due to natural and anthropogenic processes. An important factor to considered is that sampling site selection for the INFySO data set was determined based on the presence of a forest class on the INEGI Series III land cover data. The INEGI Series three data was map and published in 2002. While not explored within this study multiple methods are outlined
66


below that could be used to understand the potential error associated with the temporal
mismatch of the two data sets.
The first resource for attempting to understand the potential for temporal change in forest cover over a five to ten-year period in Mexico is through an examination of the existing scientific literature. Studies that summarize the overall change in forest or forest fragmentation will be the primary sources for gathering quantitative measures of how much change is occurring and where it is occurring. This will provide the most distinctive assessment of forest change.
A second resource is a cross tabulation of the GL30_2000 and GL30_2010 data sets. This will provide distinct values of how much forest was mapped in both time periods, how much new forest was classified, and how much forest is no longer classified as forest. This work is currently being published by Moreno et al. 2017.
Lastly, some cluster locations are sampled in both the INFySO and INFySl data sets. This means that data regarding the number of trees and canopy cover is present over the two temporal periods. This overlap of sampling sites provides information about where forest has remained and how much it has changed due to the detailed attribute data associated with each site. While it is unclear how valuable this process may be there is potential that it can add to the understanding of how the forest has changed over time.
While close consideration must be given to addressing the temporal variance between the validation set and the GL30 data recent work has shown that rates of forest change across Mexico have lowered into the 2000's(Moreno et al. 2014). This implies that
67


the temporal mismatch may not represent a significant source of error within the validation
of the GL30 data set.
4.3.3 Image Capture Date and Leaf on/Leaf Off Character of Deciduous Forest
The spectral reflectance of many forests changes seasonally. This is especially defined in deciduous forests that lose their leaves. Given that the image capture date is known for all images used in the GL30 process a simple test could be performed to see if the leaf on or leaf off character of specific forest types was captured by the image. Research into the ecological character of the forest types defined within the INFyS data will show which forests lose leaves seasonally. The general date ranges for when the forests have no leaves will be recorded. Using this information, the image capture date for forest types that seasonally loose leaves will be compared to the date for the leaf off season. One can then predict if the forest had leaves or not when the image was captured. This will greatly change the spectral reflectance of the landscape.
This represents an important first step in any future line of research. In all assessments the low dry deciduous jungle forest type groups which contains multiple deciduous forest types and oak which is a deciduous forest have had the lowest representational accuracies of any of the ten forest type groups. An initial examination of the image capture from the GL30_2000 data set shows that the majority of images were captured from September to December during the 1999 and 2001 time periods (Figure 4.4). This coincides with the dryer and cooler period of the year when the forests drop their leaves. Further analysis of this step will add to the understanding of how GL30 classifies deciduous forest.
68


Figure 4.4: Image capture date for the INFySO data set
This plot illustrates the collection time for the 191 Landsat images that were used to create the GL30_2000 land cover classification. Each point represents a single Landsat image. There is a high concentration of images during the time around November in both 1999 and 2001.
4.3.4 Variation within the Cluster Groups
Within the INFyS data set the cluster is a rigidly defined structure. There is however variation, within this sampling structure due to accessibility limitations from terrain and private land and/or there being no trees present within a specific site. What this amounts to is the majority of clusters have 4 sites yet clusters do exist with three, two, and even one site. Since the lack of a sampling site could suggest no forest, it is possible that the clusters with less than four sites are more likely to be representative of transitional areas within the forest and hence more likely to be misclassified.
To address this query, clusters could be grouped based on the number of sites within them. From these groups a similar set of test; percent correct intersect, percent correct by area and ROC-AUC analysis could be performed. Testing to see how many of sites within clusters that have less than 4 sites total or how many of the sites are partially covered by
69


forest sites would may demonstrate the forest variability of the location. An important aspect of these results would be that it is not possible to tease out if any of the missing sites within the cluster are directly contributed the lack of trees. Terrain and political boundaries could be the limiting factor. Due in part to the planning associated with the site selection; it is the thought of the author that it is far more likely that the clusters with less than four sites represent transitional zones in the forest where no trees were found within.
4.3.5 Semantic Difference in the Definition of Forest between the Two Data Sets
These data sets are produce by different processes and different organizations. It must be noted that some of the error in representation is likely due to semantic variation in what qualifies as forest between the two data sets. The exact spectral threshold by which GL30 defines forest is unknown. The classification process does provide some framework and it is know that the forest type groups must have at least 30% ground cover. Sampling sites for the INFyS data are defined by the national level 5 by 5km grid and the location of that grid point within an INEGI defined forest type group. Beyond this sites must have at least 1 tree. It is unclear how often a single tree in a 400 square meter area can match the 30% ground cover in a 900 square meter cell required to meet GL30 definition of forest. To account for this it may be possible to remove sites that have a corbutura values that does not equate to 30% of the sampling sites. This was avoided within this study because there is too much uncertainty in attempting to take an average value for a 400 square meter values and compare it to one or more 900 square meter cells. It may be possible to treat all four sites in the cluster as a one which gives a 1600 square meter area. Accepting this semantic
70


difference exist was deemed more appropriate then applying this method. Future evaluations of this may change this interpretation.
4.3.6 Importance of the Validation Techniques
The use of the INFyS data set provides many advantages over the validation methods that use existing remote sensing derived data sets. The primary advantage is that all sites were verified as forest by a person who physically visited the location. Adding to this it is known where GL30 correctly identified sites and where it did not. The inclusion of specific forest types allows for the extension of these results to other areas with similar forest types. These specific qualities greatly increases the importance of the validation.
4.3.7 Ground verified sites verse remotely sensed sites
The standard practice for validating a land cover data sets is to either use an existing land cover data set or create validation sites through user interpretation of high resolution remotely sensed data. Both these methods offer advantages and disadvantages. The most important aspect of the chosen validation set is that it should be highly accurate to insure the assessment of the land cover data set is accurate as well (Stehman, V.S. and Czaplewski, L.R., 1998). While the sematic and temporal variance between the GL30 and the INFyS data does incorporate an inherent level of error, the data represent a very accurate validation set.
71


CHAPTER V
CONCLUSION
The goal of this project was to evaluate how well GI30's single Forest class represented the various forest types of Mexico. Two geographically distributed and ground verified forest survey data sets were used to validate the GL30. The producers accuracy of the GL30 was reported for ten major forest type groups based on a simple intersect and a percent area of forest between the GL30 and INFyS sites. How canopy coverage and number for trees per site effects GL30 representation of the forest was tested using the ROC and AUC methods which give a unique insight into why GL30 may be misrepresenting certain areas. This study is exclusive to the country of Mexico, but do to the level of detail of the analysis the results and conclusion can be applied to other geographic locations with similar forest classes.
The GlobeLand30 global land cover data set offers a dramatic increase in the minimum mapping unit over other currently available global land cover data sets. This increase in resolution provides the potential for improved precision for regional and smaller scale land cover studies. Created at two time stamps, 2000 and 2010, the GL30 could also be useful for temporal based studies such as land use change. Currently there have been no targeted evaluations of the representational accuracy of the GL30 in the sub-tropic to tropical regions of the world. This study is the first external review of this product specifically for a land mass south of 36 degrees north.
A detailed forest inventory survey produced by the Comission Nacional Forestal is
72


used to validate the representational accuracy of GL30 Forest class. This ground verified
data contains 72,056 sampling sites between the years 2004 and 2013 which is split into two data sets, INFySO and INFySl. Each sampling location contains information on the forest type, canopy coverage, and the number of trees which allows for a very detailed assessment of the GL30. A group often specific forest combinations was chosen based on their overall representation of the diversity of Mexico's forest. All analysis was performed on these ten classes.
The first method used was the percent correct by intersect. An intersect was performed between the centroid of the sampling site and the GL30 data. From this the percent correct value was determined for the ten forest classes. Due to the nature of the data set this value is equivalent to the more commonly reported assessment value of producer's accuracy. The values found within this study were generally lower than those reported in previous assessments of the GL30 Forest class. Medium semi- deciduous and high evergreen jungles had a producer's accuracy of 90% or higher. These values were on par with or above the values report in the other assessments of the GL30 Forest class. The temperate forest had a consistent producer's accuracy across the three primary groups which was in the mid-seventies. This method provide a good base level evaluation but rests on the two major assumptions. One being perfect positional accuracy between the two data sets and the second being that sampling site are represented as a point not an area in this method. The percent correct by area methodology was developed to account for those assumptions and provide a more accurate measurement of the producer's accuracy of the GL30.
73


The second methodology used was the percent correct by area. This was done to
account for two primary assumption of the percent correct by intersect method. This method accounted for both the area of the sampling sites and the potential position error associated with the Landsat imagery from which the GL30 was developed. Individual sampling site centroid were buffered to a circle with a radius of 61.28m. This buffer was rasterized and a cross tabulation was performed to determine how much of the GL30 land cover classes fell within the given area. A percent forest value was calculated and joined back to the forest inventory data. Three groups were defined based on the percent forest value; Forest = 100% percent forest, Not forest = 100% percent forest, and partially covered by forest sites = any value between 0 and 100 percent forest. The partially covered by forest sites are locations which there is degree of uncertainty as to if the sites are truly represented as forest or not. As a results they were exclude from further analysis. A percent correct was calculated based on the forested and not forested sites for all ten forest type groups. This resulted in an increase in an overall increase in the percent correct values of the data set. A significant increase in the percent correct values was seen within the INFySO data set. This is because just over 10,000 sites were identified as partially covered by forest within this data set. The 17% reduction in the number of sampling sites greatly altered the results of the percent correct calculations. By better accounting for the spatial character of the data the values from this analysis are a more accurate representation of the producer's accuracy of the GL30.
The third and final assessment methodology use the continuous data of canopy cover and number of trees per site to test how well those values could be used to
74


distinguish between correctly represented and misrepresented sites. The ROC curve is developed by moving through various threshold values (canopy coverage or number of trees) of the continuous data and plotting the proportion of correctly represented points (True Positive) against the proportion of incorrectly represented points False Positive). From this plot an area under the curve value can be calculated. The AUC can be viewed as the likelihood that a randomly selected correctly identified site will have a higher value then a randomly selected misidentified site. The AUC value is directly comparable across data sets. All AUC values within this study were over 60%. This means that for all tested cases the number of trees and canopy coverage at a site can be used to predict how GL30 will represent that location as forest or not. For tropical forest the number of trees at a site was a more effective distinguisher between correctly and misrepresented sites. For the temperate forest the canopy coverage value performed better.
Based on the results of this study, GL30 would be an effective land cover data set for forest studies within Mexico. Percent correct values from both the percent correct by intersect and percent correct by area methodologies were above 70%. The increase of percent correct when the area was accounted for implies that like most land cover data sets GL30 does not classify the transitional zones as well as the core areas. Specifically, regions with rapid ecological change, such as the temperate forest regions of north central Mexico in the states of Durango and Chihuahua. Evergreen tropical forests performed extremely well with producer's accuracies consistently above 90%. More modest returns were found in the temperate forest classes. The ROC and AUC analysis show that number of trees and canopy coverage of a site are good indicators of GL30 likelihood of representing that
75


location as forest.
5.1 Critical Evaluation of Research Project
While the number of sampling sites and consistency of the INFyS data sets does provide a solid framework for the understanding of the accuracy of the GL30's Forest class this work is not without limitations. The primary limitation of this work is that due to the nature of the INFyS data set it is only known where forest is and nothing is known about where forest is not. This limits the evaluation potential to the Forest class of the GL30 and also eliminates this studies ability to use other evaluation metrics such as user accuracy, overall accuracy, and similar derivatives. The producer's accuracy represents the only direct comparison that can be made between the results of this study and the results of existing evaluations of the GL30.
This study represents a focused analysis of the representational accuracy of the GL30 Forest class. The Forest class represents one of ten GL30 land cover classes. Of these ten classes only 8 of them were represented in Mexico by the GL30. The GL30 land cover classes of tundra and permanent ice and snow were not included in the classification. Since this study only speaks to the Forest class limited to no conclusions could be drawn about the success of GL30's identification of the 7 other land cover classes that it classified within Mexico.
The INFyS data sets are unique as validation sets due to the fact that there were all verified as forest by a person who physically visited the site. Yet it is likely that the distinction between what was considered a forest is different between the INFyS data set
76


and the GL30 data set. This is in part due to the nature of the collection method of the INFyS
and GL30 data sets. The INFyS data set is ground verified and the GL30 data set is derived from remotely sensed imagery. Add to this the temporal variance in the collection time between the GL30 and INFyS data sets and it must be understood that errors in this evaluation process are present and are not able to be accounted for at this time.
Any study will contain limitations and this one is no different. Much time was spent attempting to account for these limitations and addressing just what effect they will have on the overall interpretation of the accuracy at which GL30 has classified the forest of Mexico. Methods to deepen this assessment are suggested as future work.
Possible the most intriguing of the proposed processes for future work is the assessment of leaf on and leaf off characteristics of the deciduous forest in relation to the image capture data. In general the deciduous forest types were poorly represented compared to the evergreen and semi deciduous forest types. This could be a product of the environment in which these forest or found or it could be a limitation GL30's ability to determine the appropriate time frame for image selection. The methods for this process could be applied for any forested area classified by because GL30 provides the image capture dates within the land cover data.
One significant source of error in the current methodology is the representation of tropical forest survey sites as circles rather than the directional orientated rectangles which they are. To determine the orientation of the sampling sites based information about the cluster that it is a part of must be referenced. The process would require advanced data
77


analysis work that is current beyond the capability of the primary researcher. The current
assessment method only accounts for just over half of the actual area of the tropical forest sampling sites.
All the values of this report are generalized statements about the character of the GL30 Forest class as a whole. To expand upon the potential of the ROC analysis it could be possible to develop a process that could predict if GL30 would classify a given location as forest based on the specific values for canopy coverage or number of trees at a sites. The use of such and application is unknown and so was not perused as part of this analysis. Given the right motivation is may provide a useful tool for further on the ground assessments of the GL30 Forest class. The ground work of the assessment has been laid and the future of this analysis is dependent in large part on the needs of the users involved with the project.
5.2 Summary of Project
Going back to the introduction, the importance of any land cover data set is due in part to how well it represents the actual features on the ground. The most important aspect of this assessment is the inclusion of the specific forest type groups within the encompassing GL30 Forest class. This precision allows for the interpretation of the representational accuracy of the GL30 Forest class at different spatial and ecological scales. This is due to the fact that the selected forest type groups are generally tied to specific geographic locations. This specificity allows for the interpretation of the results of more regional scales of analysis. For example a study looking at forest fragmentation could use this product with the understanding that the results of the analysis from the Yucatan
78


Peninsula will be much more accurate than the results from the eastern slope of the Serria
Madres. Being able to alter the interpretation of the results of a study because on forest class of interest adds assessment value to the GL30 as a whole.
The GL30 Forest class has very successfully represented the medium semi deciduous jungles and the high evergreen jungles of Mexico. These tropical forest types are characterized by dense canopies and complex understory. These forest type groups are common throughout Central and South America and it is likely the GL30 Forest class will represent those forest well.
The GL30 is the first comprehensive global land cover data set to map land cover at the 30m resolution. This increase resolving power of this data sets make it an important resource for regional scale analysis of land cover / land use change studies. This study represents the first independent assessment of the GL30 data set at latitudes south of 30 degrees north and is an important step in determining how well GL30 represents the subtropical and tropical environments. The unique and detailed assessment of the GL30 Forest classes provides an evaluation of multiple different forest types. The results indict a producer's accuracy ranging from 72.3% to 97.3% depending on the forest type group. The ROC and AUC analysis illustrated how sensitive is the representational accuracy of the GL30 Forest class to the measures of canopy coverage and the number of trees in the location. As a whole tropical forest were better represented then temperate forest. Deciduous forest types were the poorest represented classes. Compared to existing validation of the GL30 Forest class these results are generally lower than those found in 3 of the 4 validation studies. Overall the GL30 did capture the diversity of the forest of Mexico well and the
79


increase spatial resolution it offers make it an appealing data set for future forest cover
change studies in Mexico.
80


REFERENCES
Arino, O., et al. (2007). GlobCover: ESA service for global land cover from MERIS. Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. IEEE International, IEEE.
Arsanjani, J. J., et al. (2016). "GlobeLand30 as an alternative fine-scale global land cover map: challenges, possibilities, and implications for developing countries." Habitat International 55: 25-31.
Brovelli, M. A., et al. (2015). "The first comprehensive accuracy assessment of GlobeLand30 at a national level: Methodology and results." Remote Sensing 7(4): 4191-4212.
Chen, J., et al. (2015). "Global land cover mapping at 30m resolution: A POK-based operational approach." ISPRS Journal of Photogrammetrv and Remote Sensing 103: 7-27.
Clay, E., et al. (2016). "National Assessment of the Fragmentation Levels and Fragmentation-Class Transitions of the Forests in Mexico for 2002, 2008 and 2013." Forests 7(3): 48.
Comber, A., et al. (2004). "Integrating land-cover data with different ontologies: identifying change from inconsistency." International Journal of Geographical Information Science 18(7): 691-708.
Costa, H., et al. (2014). "Combining per-pixel and object-based classifications for mapping land cover over large areas." International Journal of Remote Sensing 35(2): 738-753.
Fischer, J. and D. B. Lindenmayer (2007). "Landscape modification and habitat fragmentation: a synthesis." Global ecology and biogeography 16(3): 265-280.
Foley, J. A., et al. (2005). "Global consequences of land use." science 309(5734): 570-574.
Foody, G. M. (2002). "Status of land cover classification accuracy assessment." Remote Sensing of Environment 80(1): 185-201.
Frazier, P. S. and K. J. Page (2000). "Water body detection and delineation with Landsat TM data." Photogrammetric engineering and remote sensing 66(12): 1461-1468.
Friedl, M. A., et al. (2010). "MODIS Collection 5 global land cover: Algorithm refinements and characterization of new data sets." Remote Sensing of Environment 114(1): 168-182.
Fritz, S., et al. (2012). "Geo-Wiki: An online platform for improving global land cover." Environmental Modelling & Software 31: 110-123.
Giri, C., et al. (2013). "Next generation of global land cover characterization, mapping, and monitoring." International Journal of Applied Earth Observation and Geoinformation 25: 30-37.
Gong, R, et al. (2013). "Finer resolution observation and monitoring of global land cover: First mapping results with Landsat TM and ETM+ data." International Journal of Remote Sensing 34(7): 2607-2654.
Groombridge, B. and M. Jenkins (2002). World atlas of biodiversity: earth's living resources in the 21st century, Univ of California Press.
Hansen, M., et al. (2000). "Global land cover classification at 1 km spatial resolution using a classification tree approach." International Journal of Remote Sensing 21(6-7): 1331-1364.
Hansen, M. C., et al. (2013). "High-resolution global maps of 21st-century forest cover change."
81


science 342(6160): 850-853.
Instituto Nacional de Geograffa e Informatica. Gufa Para la Interpretacion de la Cartograffa uso del Suelo y Vegetacion, Escala 1:250,000 Serie III; INEGI: Aguascalientes, Mexico, 2009; p. 77. Available online:
http://www.inegi.org.mx/prod_serv/contenidos/espanol/bvinegi/productos/geografia/publi caciones/ guias-carto/sueloyveg/l_250_lll/Suelo_Vegeta.pdf (accessed on 1 August 2015).
Instituto Nacional de Geograffa e Informatica. Gufa Para la Interpretacion de la Cartograffa uso del Suelo y Vegetacion, Escala 1:250,000 Serie IV; INEGI: Aguascalientes Mexico, 2012; p. 132. Available online:
http://www.inegi.org.mx/prod_serv/contenidos/espanol/bvinegi/productos/geografia/publi caciones/ guias-carto/sueloyveg/l_250_IV/l_250_IV.pdf (accessed on 1 August 2015).
Los, S. O., et al. (1994). "A global 1 by 1 NDVI data set for climate studies derived from the GIMMS continental NDVI data." International Journal of Remote Sensing 15(17): 3493-3518.
Loveland, T. R., et al. (2000). "Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data." International Journal of Remote Sensing 21(6-7): 1303-1330.
Matthews, E. (1983). "Global vegetation and land use: New high-resolution data bases for climate studies." Journal of climate and applied Meteorology 22(3): 474-487.
Moreno-Sanchez, R., et al. (2012). "National assessment of the fragmentation, accessibility and anthropogenic pressure on the forests in Mexico." Journal of forestry research 23(4): 529.
Myint, S. W., et al. (2011). "Per-pixel vs. object-based classification of urban land cover extraction using high spatial resolution imagery." Remote Sensing of Environment 115(5): 1145-1161.
Ran, Y. and X. Li (2015). "First comprehensive fine-resolution global land cover map in the world from ChinaComments on global land cover map at 30-m resolution." Science China Earth Sciences 58(9): 1677-1678.
Roth, D., et al. (2016). "Estimation of human induced disturbance of the environment associated with 2002, 2008 and 2013 land use/cover patterns in Mexico." Applied Geography 66: 22-34.
Smith, G. (2013). "Hybrid pixel-and object-based approach to habitat condition monitoring."
Sun, B., et al. (2016). "Uncertainty Assessment of GLOBELAND30 Land Cover Data Set Over Central Asia." ISPRS-International Archives of the Photogrammetry. Remote Sensing and Spatial Information Sciences: 1313-1317.
Tong, X. and Z. Wang (2012). "Fuzzy acceptance sampling plans for inspection of geospatial data with ambiguity in quality characteristics." Computers & Geosciences 48: 256-266.
Tong, X., et al. (2011). "Designing a two-rank acceptance sampling plan for quality inspection of geospatial data products." Computers & Geosciences 37(10): 1570-1583.
Townshend, J. G. (1992). "Land cover." International Journal of Remote Sensing 13(6-7): 1319-1328.
Townshend, J. R., et al. (2012). "Global characterization and monitoring of forest cover using Landsat data: opportunities and challenges." International Journal of Digital Earth 5(5): 373-397.
82


Verburg, P. H., et al. (2011). "Challenges in using land use and land cover data for global change studies." Global Change Biology 17(2): 974-989.
Yu, L., et al. (2013). "FROM-GC: 30 m global cropland extent derived through multisource data integration." International Journal of Digital Earth 6(6): 521-533.
Yu, L., et al. (2013). "Improving 30 m global land-cover map FROM-GLC with time series MODIS and auxiliary data sets: a segmentation-based approach." International Journal of Remote Sensing 34(16): 5851-5867.
83


APPENDIX A
Appendix Figures
This appendix contains the detailed information on the structure of the original and create data that is a part of this work. Frequency diagrams display number of sampling sites associated with a given forest type. Forest type groups are defined in this section as well.
ID_CONGLOMERADO ID_SITIO NODV13D3A Odll #ARB0LES ARBOLES_DANADOS AREA_BASAL COBERTURA_ARBOREA VOLUMEN GRADOSLONG MINUTOSLONG SEGUNDOSLONG GRADOSLAT MINUTOSLAT SEGUNDOSLAT
154 44302 Bosque de NJ o 3 0.0402124 1 0.0001963 3 J 0.0564625 -116 1* 25.8 32 27 9.9
CP o o o
CO b b o
-Q NJ C 3
£> C un UJ -p*
4^* CD C 3 UJ UJ . ,
14 Un UJ O Q. 3 P* 00 3 00 s. i* CD 1* 14 un UJ NJ I^
4^ UJ P* o J l o 1 1* CD 1* NJ 4^
UD o o O
CO o b o
-Q UJ C 3 un
4^* C 00 O O
4^* CD C 3 1* 00 i .
1* UJ CO o CD 1*
Un o 14 CD 3 1* UJ NJ
4^ un NJ o CO s. 00 3 CD 1* Ln NJ Ln
UD O o O
CO b b 14
i* -Q un C 3 UJ
NJ C CO C 3 1*
O CD un CD UJ . ,
14 Un 00 Q. J O 00 O 1 00 1* 14 un CD UJ NJ
Un CO un P* J CO s. 1* s. NJ un 00 CD NJ Ln
r oj o O O
CO b b 14
14 -Q 4^ o o
NJ c 4^* i* 1*
O CD CD i* NJ . ,
1* 00 00 -p* NJ i*
Un 00 J o 3 o un i* un UJ NJ
Un o UJ O J CD 4^ 3 un 00 un NJ "h-*
UD O O o
CO o b o
NJ -Q 1* C 3 un
UJ C 00 NJ un
00 CD CD 00 i* 1 n
1* NJ CD o O i*
Un un J 00 J o 3 -P* i* un NJ UJ NJ
CD o UJ O ) 3 3 un un un UJ NJ 1*
Example of the INFySO data orignally in a csv format
84


FID ID_CONGLO M ID_SITIO TIPO_VEGET F_ARBOLE S COBER- TURA_ Longitude Latitude
0 154 44,302 Bosque de pino 2 0.00019635 116.023833 3 32.45275
1 154 44,303 Bosque de pino 4 0.003378005 116.023805 6 32.4531666 7
2 154 44,305 Bosque de pino 2 0.000106814 116.024305 6 32.4526388 9
3 155 120,879 Bosque de pino 5 0.000607114 115.971277 8 32.4576388 9
Example of the values from the original data that were store in the shapefile for the INFySO data set.
85


Full Text

PAGE 1

1 ASSESSMENT OF THE REPRESENTATIONAL ACCURACY OF GLOBELAND30 CLASSIFICATION OF THE TEMPERATE AND TROPICAL FOREST OF MEXICO by DANIEL PETER CARVER B.S., Adams State College, 2012 B.A., Adams State College, 2012 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Masters of Arts Applied Geography and Geospatial Science 2017

PAGE 2

ii 2017 DANIEL PETER CARVER ALL RIGHTS RESERVED

PAGE 3

iii This thesis for the Master of Arts degree by Daniel Peter Carver has been approved for the Applied Geography and Geospatial Science Program by Rafael Moreno Sanchez, Chair Peter Anthamatten Galen Maclaurin Juan Manuel Torres Rojo Date: May 13, 2017

PAGE 4

iv Carver, Daniel Peter M.A., Applied Geography and Geospatial Science Program Assessment of the Representational Accuracy of Globeland30 Classification of the Temperate and Tropical Forest of Mexico Thesis dir ected by Associate Professor Rafael Moreno Sanchez ABSTRACT This study performed an assessment of the representational accuracy of the forest class of the Globe Land30 (GL30) global land cover data set s for the country of Mexico using a robust geographically distributed forest inventory survey of the forest s in Mexico. Th e representational accuracy assessment was carried out for both the 2000 and 2010 GL 30 data sets. The detailed attribute data associated with the validation set demonstrates how GL 30 classifies specific forest types and how canopy coverage and number for trees per site influence the likelihood of GL30 identifying the site s correctly as forest s T he results indicate that producers accuracies range from 72.3% to 97.3%. The tropical forest s (89.1%) were better represented by the GL30 forest class than the temperate forest ( 73.9% ). The most poorly represented classes from the temperate (oak: 72.3%) and tropical (low dry deciduous jungle: 74.9%) groups were deciduous. R eceiver Operator Curve and A rea U nder the C urve analys e s show that canopy coverage of a site is a better predictor of GL30 correctly identify ing the site as forest for temperate forest and that the number of the trees per site is a better predictor of GL30 correctly identify ing a site as forest for tropical forest s The results also indicate a distinct spatial variability in the location of the sample s ites that are misidentified as forests by GL30 The results of this thesis will help researcher s and

PAGE 5

v professionals better understand the representational accuracy of the GL30 data sets for the forests in Mexico The form and content of this abstract are appro ved. I recommend its publication. Approved: Rafael Moreno Sanchez

PAGE 6

vi I dedicate this work to those who knowing only of me and nothing of the project expressed sincere confidence that I would be successful in this endeavor. Kailee Ashley Courtney Liz Dan Kathy

PAGE 7

vii ACKNOWLEDGEMENTS As with any geospatial project the quality of the product is directly related to the quality of the data. I want to personally thank Lily Niknami for her efforts in acquiring the GlobeLand30 data during a time when the public website in the U.S. was not functioning Juan Manue l Torres Rojo for obtaining, sharing and interpreting the Inventario Nacional Forestal y de Suelos data set without which this analysis would not be possible I also thank Galen Maclaurin for his positive encouragement and valuable input in suggesting the ROC and AUC methodolog ies when I felt I had hit a dead end, Peter Anthamatten for a very detailed and impactful review of the written work, and Rafael Moreno for suggesting and guiding this research, many hours of brainstorming, support and positive motivation, and quick and insightful feedback. Finally, I thank Kailee Potter for invaluable series of edits and her continuous support through and t h r ough

PAGE 8

viii Table of Con tent I. INTRODUCTION ................................................................................................................................... 1 1.1 Literature Review ................................................................................................................... 3 1.1.1 Ro le and development of global land cover data sets .......................................................... 3 1.1.2 Development of GlobeLand30 ............................................................................................... 5 1.1.3 Previous studies accessing the representational accuracy of GL30 ...................................... 6 1.2 Challenges in preforming and conveying accuracy assessments ......................................... 10 II. METHODOLOGY ................................................................................................................................ 12 2.1 Data sets ..................................................................................................................................... 12 2.1.1 National Forest Inventory Data Set. Comisin Nacional Forestal (Inventario Nacional Forestal y de Suelos) ..................................................................................................................... 12 2.1.2 GlobeLand30 from National Geomatics Center of China .................................................... 17 2.2 Analytical Processes ................................................................................................................... 23 2.2.1 Forest Type Groups ............................................................................................................. 23 2.2.2 Rationale for Accuracy Assessment Methods ..................................................................... 26 2.3 Assessment Method 1: Percent Correct by Intersect ................................................................. 27 2.4 Assessment Method 2: Accounting for Sampling Sit es Area and Positional Error in Percent Correct Calculations made in Method 1 ........................................................................................... 28 2.5Assessment Method 3: Predictive quality of the INFyS site attributes Canopy Cover and Number of Trees ........................................................................................................................... 32

PAGE 9

ix 2.2.5.1 Mann Whitney U Test. .................................................................................................. 32 2.2.5.2 Receiver Operator Curves (ROC) and Area Under the Curve (AUC). ............................ 34 III. RESULTS ........................................................................................................................................... 38 3.1 Assessment Method 1: Percent Correct by Intersect ................................................................. 38 3.1.1 Percent Correct of All Sites .................................................................................................. 39 3.1.2 Temperate Forests ............................................................................................................... 39 3.1.3 Tropical Forests ................................................................................................................... 40 3.1.4 Primary Temperate Forest Classes: Pino Encino and Pino Encino Mix .............................. 40 3.1.5 Primary Tropical Forest Classes: Selva Baja Mediana, and Alta ........................................ 40 3.1.6 Erosion presence ................................................................................................................. 40 3.2 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent Correct Calculations made in Method 1 ........................................................................................... 41 3.2.1 Percent Correct of All Sites .................................................................................................. 42 3.2.2 Temperate forests ............................................................................................................... 44 3.2.3 Tropical Forests ................................................................................................................... 44 3.2.4 Primary Temperate Forest Classes: Pino Encino and Pino Encino Mix .............................. 44 3.2.5 Primary Tropical Forest Classes: Selva Baja Mediana, and Alta ........................................ 44 3.2.6 Erosion presence ................................................................................................................. 45 3.3 Assessment Method 3: Predictive quality of the INFyS site attributes Canopy Cover and Number of Trees ........................................................................................................................... 45 3.3.1 Mann Whitney U Test ......................................................................................................... 46

PAGE 10

x 3.3.2 Results of t he INFyS0 data set ............................................................................................. 47 3.3.2.1 Canopy coverage. ......................................................................................................... 47 3.3.2.1 Number of trees. .......................................................................................................... 48 3.3. 3 Results of the INFyS1 Data Set ............................................................................................ 48 3.3.3.1 Canopy coverage .......................................................................................................... 48 3.3.3.2 Number of trees. .......................................................................................................... 48 3.4.4 Receiver Operator Curves (ROC) and Area Under the Curve(AUC) ..................................... 49 3.4.5 INFyS0 .................................................................................................................................. 50 3.4.5.1 Canopy coverage. ......................................................................................................... 50 3.4.5.2 Number of trees ........................................................................................................... 50 3.4.6 INFyS1 .................................................................................................................................. 51 3.4.6.1 Canopy coverage. ......................................................................................................... 51 3.4.6.2 Number of trees. .......................................................................................................... 51 3.5 Comparison of AUC values between the INFyS0 and INFyS1 data sets ..................................... 52 3.5.1 Canopy coverage ................................................................................................................. 52 3.5.2 Number of Trees .................................................................................................................. 53 IV. DISCUSSION ..................................................................................................................................... 55 4.1 Summary of Findings .................................................................................................................. 56 4.1.1 Percent Correct by Intersect ............................................................................................... 56 4.1.2 Percent Correct by Area ...................................................................................................... 56

PAGE 11

xi 4.1.3 Receiver Operator Curve (ROC) and Area Under the Curve (AUC) ..................................... 59 4.2 Comparison of Producers Accuracies with Existing Studies ....................................................... 61 4.3 Evaluation of the Study .............................................................................................................. 62 4.3.1 Limitations of the Areal Representation of the Tropical Forest Class ................................. 65 4.3.3 Image Capture Date and Leaf on/Leaf Off Character of Deciduous Forest ......................... 68 4.3.4 Variation within the Cluster Groups .................................................................................... 69 4.3.5 Semantic Difference in the Def inition of Forest between the Two Data Sets .................... 70 4.3.6 Importance of the Validation Techniques ........................................................................... 71 4.3.7 Ground verified sites verse remotely sensed sites .............................................................. 7 1 V. CONCLUSION .................................................................................................................................... 72 5.1 Critical evaluation of research project ....................................................................................... 76 5.2 Summary of Project .................................................................................................................... 78 REFERENCES ......................................................................................................................................... 81 APPENDIX A. APPENDIX A .................................................................................................................................. 84 B. APENDIX B ..................................................................................................................................... 89

PAGE 12

xii TABLE Table 3.1: P ercent Correct Vaules For The INFYS0 a nd INFYS 43 Table 3.2: Ranked p value s From Mann Whitney; I NFYSO 47 Table 3.3: Ranked p value s From Mann Whitney; INFYS1 48 Table 3.4: Auc Values From INFYS0 a nd I NFYS1 50

PAGE 13

xiii F IGURE Figure 2.1: Infys Sampling Site Distribution M aps 12 Figur e 2. 4: Attribute Data Of INFYS S hapefiles 14 Figure 2.2: Cluster Distribution 16 Figure 2.3: Site Sampling Diagrams 17 Figure 2.5: Map Of Gl30_2000 For Mexico 19 Figure 2.6: Workflow Of Gl30 Classification Process .. 20 Figure 2.7: Example Of Priori Proba bility Process 21 Figure 2.8: Data Provided By Gl30 22 Figure 2.9: Image Capture Dates From 1999 23 Figure 2.10: Generalized Location Of Temperate Forests 25 Figure 2.11: Generalized Locations Of Tropical Forest .. 26 Figure 2.12: Intersect Relationship Between Infys And Gl30 28 Figure 2.13: Percent Correct By Area Process 30 Figure 2.14: Examples Of Percent Correct By Area Values 31 Figure 2.15: Distributions Of Number Of Tree And Canopy Coverage 33 Figure 2.16: Example Of Roc Curve 36 Figure 3.1: Percent Correct By Intersect Values 39 Figure 3.2: Percent Correct By Area : INFYS0 And INFYS1 43 Figure 3.3: Auc Values For INFYS0 51 Figure 3.4 : Auc Values For INFYS1 52 Figure 3.5: Auc Values For Canopy Coverage .. 53 Figure 3.6: Auc Values For Number Of Trees 54

PAGE 14

xiv Figure 4.1: Change In Percent Correct Between Intersect And Area .................................................................... 57 Fi gure 4.2: Relative Density Of Partially Covered By Forest Sites From The Gl30_2000 Data Set ........................ 59 Figure 4.3: Tropical Sampling Sites Error In Representation ................................................................................. 66 Figure 4.4: Image Capture Date For The Infys0 Data Set ...................................................................................... 69

PAGE 15

xv ABBREVIATION GL 30 GlobeLand30 GL 30_ 2000 GlobeLand30 2000 data set GL 30_ 2010 GlobeLand30 2010 data set INFyS Inventario Nacional Forestal y de Suelos data sets INFyS 0 Inventario Nacional Forestal y de Suelos 2004 2009 data set INFyS 1 Inventario Nacional Forestal y de Suelos 2009 2013 data set

PAGE 16

1 CHAPTER I INTRODUCTION Land cover/use data is an essential tool for ecological monitoring and economic planning at the national level ( Townshend 1992) For forested regions, forest cover change can be used to assess the trends in land use and its impacts on the envir onment and forest conservation ( Moreno Sanchez, Torres Rojo et al. 2012; Roth, Moreno Sanchez et al. 2016 ) Alterations to the physical area and connectivity of the forest cover can directly affect the forests ecological diversity and capacity t o perform ecological functions ( Foley, DeFries et al. 2005; Fischer and Lin denmayer 2007 ) Mapping the process of forest fragmentation in Mexico using existing data sets has been completed in previous studies ( Moreno Sanchez, Torres Rojo et al. 2012 ; Clay, Moreno Sanchez et al. 2016 ) Clay ( 2015) suggests that finer resolution imagery may provide an improved means of mapping the interior changes occurring within forest areas. Changes within the interior of the forest areas were shown to be occurring at higher rates than in other regions of the for est. The GlobeLand30 (GL30) data sets produced by the National Geomatics C enter of China (NGCC) provides global land cover data comprising ten classes at 30m spatial resolution for 2000, and 2010( Chen et al. 2015 ) Previous validations of the GL30 data set at the regional scale have produce overall accuracy values of 8 0% for sites in Western Europe ( Brovelli, Molinari et al. 2015 Arsanjani, Tayyebi et al. 2016 ) and 46% over four countries in Central Asia ( Sun, Chen et al. 2016 ) A robust National Forest Inventory data set produced by the National Forestry Commission of Me xico ( Comisin Nacional Forestal, 2016 ) was used as

PAGE 17

2 the primary validation resource for this study. This data set contains thousands of ground sample sites collected over two time periods each corresponding to a pass of the national forest inventory for : 2 004 2009 and 2009 2013. The availability of these data sets offers an opportunity to assess the representational accuracy of the GL30 Forest land cover class over a region of the world with very high diversity of forest types and ecological conditions. The National Institute of Statistics and Geography of Mexico (INEGI) ( Instituto Nacional de Estadstica y Geografa (INEGI 2014) has produced land cover/use digital cartog raphy at scale 1:250,000 for 2003 (Series III), 2008 (Series IV), and 2013 (Series V). These data sets c lassify the forest cover into twelve types of temperate forest and twelve types of tropical forest (Clay, 2015). In contr ast, the GL30 data sets contain one forest class. Mexico is an ideal country for the assessment of the representat ional accuracy of the GL30 Forest class. The environmental and ecological conditions in the Mexican forest areas are similar to conditions found in many other temperate, tropical, and transition areas around the world. The country has complex ecological, c ultural, and economic systems that define the forested landscape. Considered megadiverse, Mexico is one of the top five m ost biologically rich countries ( Groombridge and Jenkins 2002 ) Roughly a third of the country is covered by forest (Food and Agriculture Organization (FAO) 2010). Temperate and tropical forests exist within a complex physical terrain and many unique forest types have developed, due to the presence of microclimates. The complexity of these forests provides a challenging test case for the assessment of the representational accuracy of the GL30 Forest land cover class.

PAGE 18

3 The goal of this study is to assess how well GL30 represented the specific forest types of Mexico and how canopy coverage and the number of trees per site affects G L30s representation of the various forest types. In doing so it will be determined if GL30 is an effective resource f or forest and land cover change studies in Mexico and other ecologic ally similar environments. 1.1 Literature Review Global land cover data s ets play important roles in modelling global and regional scale processes. Due to the cost and time associated with land cover mapping global land cover data can often be the best source of national level land cover data for a country. Representational acc uracy is a measurement of how well the features defined by the classification match what is actually present on the ground. Confidence researchers have in using the data for quantitative inquiries is tied to its representational accuracy. Common methods to test the representational accuracy of a land cover data set include comparing it against existing validated land cover data set s, the use of user va lidated high resolution imagery or user verified ground truth data. Within this study the representational accuracy of the GlobeLand30 data set will be validated against a robust ground truth forest inventory data set This will allow for the interpretation of GL30 representation of various fore st types across Mexico. 1.1.1 Role and development of global land cover data set s Global land cover data sets are important resources in assessing and evaluating anthropogenic, clima tic, and ecological processes at the global scale. These data sets have

PAGE 19

4 been used since the 1980s for climate modeling and carbon c ycling research ( Matthews 1983) As technology and processing methods developed, the resolution improved Initial attempts mapped features at the one degree or 0.5 degree scale ( Los, Justice et al. 1994 ) and m ultiple global land cover data sets w ere produced at the one km scale ( Hansen, DeFries et al. 2000 ; Loveland, Reed et al. 2000) Herold et al. 2008 performed a comprehensive analysis of representational accuracy of four of these data sets Accuracy varied from 30 to 90 depending on the class. More importantly, varianc e in classification across the four data sets was prevalent in zones of ecological transitions implying that specific areas represent a significant challenge for remote sensing based land cover mapping. Finer resolution products became available at 500m and 300m ( Arino, Gross et al. 2007; Friedl, Sulla Menashe et al. 2010 ) While many earth systems can be monitored and assessed at this scal e, the minimum mapping unit impose d limits on the application of these data sets to more regional analyses ( Fritz, McCallum et al. 2012 ; Gong, Wang et al. 2013 ) In 2013, the United States Geological Survey opened the Landsa t legacy data to the public and this has become a driving force for the development of 30m global land cover data sets ( Giri, Pengra et al. 2013 ) This development has resulted in the creation of 30m spatial resolution forest cover products ( Townshend, Masek et al. 2012 ; Hansen, Potapov et al. 2013 ) and ultimately served as a driving factor in the development of the Finer Resolution Observation and Monitoring of Global Land Cover project, which lead to the creation of the GlobeLand30 (GL30) global land cover data set ( Ran and Li 2015 ) These finer resolution data set s enables regionalscale analyse s that can be performed consistently throughout the world.

PAGE 20

5 1.1.2 Development of GlobeLand30 To understand the representational accuracy of the GLobeLand30 data set it is important to get the full picture of the processes on which the data set was built. The first 30 meter resolution global land cover data set was called Finer Resolution Observation and Monitoring Global Land Cover (FROM_GLC) ( Gong, Wang et al. 2013 ) This product contained two levels of classification, ten classes within level one and 29 within level two Land cover was identified using an automated classification algorithm using LANDSAT TM/ETM+ imagery. Accuracy improvements were made by developing the product FROM GLCseg. This product incorporated time series of MODIS 250m EVI, digital elevation models, and soil water information ( Yu, Wang et al. 2013 ) FROM GLCagg was developed by upscaling and aggregating FROM GLC, FROM GLCseg to Nighttime Light Impervious Surface Area and MODIS urban extent ( Yu, Wang et al. 2013 ) The 30m product produc ed by this method was assessed to have a 69.50% overall accuracy. All of these products are available for download at http://data.ess.tsinghua.edu.cn/ This is comparable to the overall accuracy reported b y the 2000 1km global land cover class produced by the University of Maryland ( Hansen, DeFries et al. 2000 ) The second 30 meter resolution global land cover data set was GlobeLand30, a fine scale comprehensive global land cover data set developed by the National Geomatics Center of China (NGCC). The data set was produced at two timestamps; GlobeL and30 2000 and GlobeLand30 2010 ( Ran and Li 2015 ) Over 10,000 30m spatial resolution remotely sensed images were used. Ten classes were defined using an i ntegrated Pixel and Object based method with K nowledg e (POK based) ( Chen et al. 2015) The POK based method integrates

PAGE 21

6 well known spectral classificatio n methods with an object oriented approach that better illustrates the extent of larger land cover features ( Myint, Gob er et al. 2011 ; Costa, Carro et al. 2014 ) Check steps within the classification process included verification against ancillary data sets and derived parameters for specific land cover classes ( Comber, Fisher et al. 2004 ; Verburg, Neumann et al. 2011 ) A hierarchical classification technique was used throughout the classification process. For example, water was classified first, then masked out of future feature classifications. This was followed by wetlands, permanent snow/ice, artificial surfaces, cultivated land. Forest, shurbland, grassland, and bareland were all mapped at the same time and tundra w as map ped last ( Chen, Chen et al. 2015 ) This method was found to limit the amount of spectral confusion between the classes ( Frazier and Page 2000 ; Smith 2013) The classification process was performed on five degree by six degree map sheets. Inconsistencies between neighboring map sheets were manually checked and a total of 847 map sheets comprise the final product. 1.1.3 Previous studies accessing the repr esentational accuracy of GL30 Accuracy assessment of the GL30 2010 data set was reported by Chen et al. (2015) and was performed by a third party. A two rank sampling strategy was used to sample 80 out of 847 map sheets (rank 1) and a sampling of pixel s with the map sheet was chosen proportionally based on the total area of that class help within the map sheet (rank 2) ( Tong, Wang et al. 2011 Tong and Wang 2012) This resu lted in 159,874 pixel samples o f which 154,586 pixels were definitively identifiable and used. Trained users identified the actual land cover using high resolution imagery largely from the Google Earth platform. As a result of the t emporal component of this validation method only the 2010 data set was validated.

PAGE 22

7 The user accuracy for the land cover categories ranged between 72.16% (grasslands) and 86.76 (artificial surfaces). The overall accuracy of the data set was determined by sum ming the user accuracy by the relative ratio of land cover for that class The resulting value was 80.33% positive. There are presently only three other publically available accur acy assessments of the GL30 data Two of these assessments took place in European countries ( Brovelli, Molinari et al. 2015 Arsanjani, Tayyebi et al. 2016 ) and the third examined multiple countries in central Asia ( Sun, Chen et al. 2016 ) While much of the processes and methodology is similar across the works, each study will be assessed in detail below. Brovelli et al. (2015) is the first published work looking at the assessment of all land cover classes in GL30 for a single country. Th ey performed a cell by cell comparison between GL30 and existing land cover data sets. The CORINE data set has coverage for the whole country and eight of the twenty regions of the country had data sets with a finer than 30m minimum mapping unit available. Results were compiled within a confusion matrix and the following statistics were derived; overall accuracy, allocation and quantity disagreements, and user and producer accuracy. The regional data sets did not match in resolution or thematic characteriz ation so all da ta sets were reclassified to a five category themes (artificial surfaces, agricultural areas, forest and semi natural areas, wetlands, and waterbodies). Raster sets were produced at both 5 and 30 meter cell size to match the original cell size of the regional and GL30 data sets and a comparison was performed to see if this difference in cell size changed the outcome of the analysis. A separate comparison was performed due to the similarity of many on the thematic classes between CORINE and

PAGE 23

8 GL30. In this second comparison scheme the category of forest and semi natural areas was broken down to include forest, grass and shrubs, open spaces with little to no vegetation, and glaciers and permanent snow. The influence of co location tolerance was addressed by creating a 70m buffer along the border of cells with different classifications. All cells falling within this buffer were excluded from accuracy assessment methods. The application and results of these methods were explained in detail using a single region as a case study and the results of the other 7 regions were displayed graphically. Concluding statements from the Brovelli paper emphasize the influence of number of categories on the overall accuracies of the classification. Using the 5 categories, overall accuracy ranged between 81 and 92 percent. Breaking up the forest and semi natural areas resulted in a total of nine classes rather than five, and a reduction in the overall accuracy to a range of 62 to 81 percent. The colocation tol erance was shown to have a substantial effect on the overall accuracy with increases from 84 to 96 percent and 65 to 86 percent. This implies that many of the classification errors occur at boarders between differing land cover types. Lastly a note is ma de that much of the error could be a result of variation in temporal relationship between the images used to create the data sets and well as errors present in the validation data sets themselves. Arsanjani et al. (2016) employed a similar method to by Brovelli et al s in the assessment of GL30 in Germany. Four distinct data sets were used to evaluate the GL30s representational accuracy. All the data sets differed in number of thematic classes, minimum mapping unit, temporal coverage, spatial coverage, and positional accuracy. Thr ee of the four validation sets CORINE, urban atlas, and ATKIS, represent professionally derived

PAGE 24

9 data sets and Open Street Map was used as the forth data set. The CORINE data set was the only validation data set that had full spatial coverage of Germany. All data sets were resampled to match the 30m spatial resolution of GL30 and reclassified to match the CORINE level 1 nomenclature which includes; artificial surfaces, agricultural areas, forest and semi natural areas, wetlands, and waterbodies. Confusion matrixes were created and overall agreement, user and producer accuracy, and Kappa coefficients were calculated. Overall accuracy ranged between 74% for Open Street Map and 92% for CORINE. The ATKINS and UA data sets produced overall agreements of 85%. Given that these two data set s have the finest spatial resolution available, it is promising that such a good agreement was reached. Considerable disagreement was seen within the wetlands class where the highest user accuracy va lue recorded at 27%. Waterbodies were not classified well compared to the other three classes. Due to the low area of these two land cover classes the effect on the overall accuracy is minimal. While the use of community derived data via Open Street Map of fers a continuously updated resource, much difficulty was found in using the data. The biggest challenge of working with the Open Street Map was the inconsistent feature nomenclature. Overall the researchers concluded that GL30 would be an effective and us eful land cover data set for Germany. The work produced by Sun et al (2016) most closely matches the validation methodology employed by the GL30 team in Chen et al (2015). This work represents the first evaluation done on a regional scale. The arid to semi arid region was defined by the boundaries of five countries; Kazakhstan, Tajikistan, Turkmenistan, Uzbekistan, and Kyrgyzstan. The GL30 data set was reclassified to 7 data sets to help reduce errors associated

PAGE 25

10 with the visual interpretation of validation set images. In the new classification shrubland is included in the forest class and wetlands are include d in the waterbody class. A two tiered sampling method was employed. Tier one consisted of points randomly selected based on land cover class. Tier two represents a high density sampling method in areas with complex land cover features. The resulting 27,000 points were independently validated using high resolutio n imagery by three testers using a developed software package. Approximately 25,000 points showed agreement between two or more testers. These pixels were used to develop a confusion matrix. User and producer accuracy, overall accuracy, and kappa coefficie nts were calculated based on the matrix. This process was only performed for the 2010 data set due to the temporal limitations of the high resolution imagery. The overall accuracy found by Sun et al.s study was 46%. Producer and user accuracies varie d greatly between classes and across the same class. For instance, cultivated land had the highest producer's accuracy at 92% but had the third lowest user's accuracy of 48%. Grassland was the most mapped feature by GL30 but had a users accuracy of only 2 2%. The total area of each class was compared against two other commonly used data sets and agreement between all three data sets was found to exist only within the cultivated land class. The overall accuracy of 46% brings up questions about the validity o f use GL30 within Central Asia. The primary source of error was due to difficulty distinguishing betwe en bare land and grassland. 1.2 Challenges in preforming and conveying accuracy assessments It is important to note the two very distinct validation methods employed in existing publications that have been used to perform this type of analysis. One method relies on

PAGE 26

11 random sampling and the manual interpretation of higher resolution imagery. The se cond relies on the comparison of the new data set to existing validated data sets. There are problems with both methods ( Foody 2002) The use of image interpreters introduces human error and the use of existing data sets incorporates all the existing errors associated with the validation data set The validation data set used within this study is closest i n character to the use of human validated image interpretation. The use of s uch extensive and robust ground verified data sets is in many ways an improvement over the two previously defined methods but also contains some limitations. For one, the data set is only available for the forest class. T his means we cannot speak to the accuracy of the other 9 land cover classes. Importantly this inherently limits our interpretation of the representational accuracy to the producer accuracy and we cannot assess the users accuracy. What this means is that w e can evaluate where GL30 misrepresented locations that we know are truly forest (omission). We cannot say where GL30 defined a location as forest and it is not truly forest (commission The evaluation of the GL30 representational accuracy of Mexicos fores ts is an important step in understanding the applicability of this data set in the studies of forest around the world. Due to its recent release limited work has been done to test the accuracy of this data set across a veriety of environments. The unique nature of the validation data set requires particular methodologies to ensure that this work matches the expectations of the field. This work will contribute to the understanding of how the GL30 could be employed to answered regional and global questions.

PAGE 27

12 Figure 2. 1 : INFyS sampling site distribution maps Illustration of the sampling site distribution for INFyS0 (red) and INFyS1 (green). The INFyS0 is a complete survey whereas INFyS1 is still being produced. C HAPTER II METHODOLOGY 2.1 Data sets 2.1.1 National Forest Inventory Data Set. Comisin Nacional Forestal (Inventario Nacional Forestal y de Suelos) The Inventario Nacional Forestal y de Suelos (INFyS) is a national level forest inventory project undertaken by the Comisin Nacional Forestal de Mexico For the remainder of this thesis, the I nventario Nacional Forestal y de Suelos will be referr ed to as the INFyS data set(s) The INFyS data were used as a validation data set for the GlobeLand30 Forest land cover class. The INFyS was developed to collect a systematic set of precise and ac curate statistical indicators of the characte r and health of the forests and soils of Mexico. The project serves as the baseline for a series of continuous assessment and mo nitoring programs. This data are used to understand the state of and challenges the forests face across the country (INFyS Manual, 2011).

PAGE 28

13 The assessment of forest resources is a legal requirement under of the Ley Forestal of Mxico (Mexican Forestry Law) ( INFyS Manual, 2011) The two most recent iterations of these forest inventories which were carried out during the 2004 2009 and 2009 2013 periods, are used within this study. The 2004 2009 INFyS0 contains 60,580 individual sampling sites based at different geographical locations across the country (Figure 2. 1). The 2009 2013 INFyS ( INFyS1 ) contains 11,476 individual sampling sites and is still being compiled (Figure 2 1). A structured assessment methodology was performed within each INFyS site. This included recording values for site ID, conglomerado ID, vegetation type, number of trees, number of damaged trees, basal area, canopy coverage, wood volume, and latitude and longitude. These data were provided in a file in comma s eparated values (csv) format (an example of the data structure is provided in Appendix A ). The attribute data of vegetation type, number of trees, and canopy coverage are the most important for the analysis. The latitude and longitude values were co nverted to decimal degrees. These coordinate data indicate the centroid of the sampl ing site s Site ID, conglomerado ID, the decimal degree values, and the three descriptive data (vegetation type, number of trees, and canopy cover) were saved in csv format and then converted into a shapefile using ArcGIS version 10.2 ( ES R I, Redlands, CA, USA) (Figure 2.2 ). In this process, the INFyS sample sites were converted to points instead of their circular (for temperate forests) or rectangular (for tropical forests) shapes.

PAGE 29

14 Site selection for both of the INFyS data sets followed a distributed clustered sampling methodology. Using a spatial software program a 5km by 5km grid was set over the whole country. The intersection of any two lines was a deemed a candidate for the loc ation for a group of sampling sites known as 'conglomerado '. Conglomerado will be referred to as cluster for the remained of the paper. Multiple steps were taken to determine if a location on the grid would be sampled. First site locations must have been mapped as forest in either the Instituto Nacional de Estadistica y Geograpfia (INEGI) Series III (2002) or Series IV (2008) Land Use/Land Cover maps (INE GI, 2009 and 2012) The INEGI land cover maps were produced at the 1:250,000 scale. This implies that all INFyS sites were originally deemed to be forested at this smaller geographical scale. Additional factors such as accessibility to location, significance of forest type, and number of total sampling sites for each field crew were used to determi ne which locations would be sampled. The subjective component of the decision was made by the field crews who were selected in part, for their local knowledge o f the region (INFyS Manual, 2011 ). The process described above was used to determine the locati on of the cluster A cluster is composed of 1 to 4 individual sampling sites (Table 2.1). The INFyS0 contains Figure 2.2: Attribute data of INFyS shapefiles

PAGE 30

15 17,130 unique clusters and the INFyS1 contains 3,196 cluster s. The centroid of the central sampling site is the location defined by where the two l ines of the 5 by 5km grid intersect. The three other sites are located relative to the first site at a distance of 45.14m between centroids. Site two is located due north, site three is at 120 degrees, and site four is at 240 degrees (Figure 2 3 ). All sampling sites cover an area of 400 square meters. Due to extreme terrain, land ownership and permission constraints several of the cluster s do not have all 4 sites. Another possibility is that the area of the site simply contains no trees and was there for not sampled. The shape of the sampling site depends on the forest type being surveyed: 'Bosque' (temperate forest) or Selva (tropical forest). The survey sites or 'sitios' for the temperate forest are circular with a radius of 11.28m originating at the centroid of each site (Figure Number of sampling sites per Cluster INFyS0 INFyS1 1 763 120 2 1,474 232 3 2,703 484 4 12,190 2,360 Table 2. 1 : Frequency of cluster s with specific number of sites. The number of cluster s that have a given number of sampling sites for each of the INFyS data sets. The large majority of cluster s have four sampling sites

PAGE 31

16 2 4 ). The tropical forest sites are rectangular with a major axis of 40m and a minor axis of 10m (Figure 2. 4 ). The orientation of the major axis is the same as the orientation of the relationship between site one and the three other sites with azimuth s of 0 degrees, 135 degrees, and 225 degrees. Site one is orientated in an east west direction. Each of the four sites has a unique ID and they all share the same group ( cluster ) ID. The ID_CONGLOM attribute in the shapefiles (Appendix A ) is the unique ID for the cluster and was used for sorting the data. The ID_SITIO is the unique identifier for all sites. The TIPO_VEGET defines the forest type at a given site. The INFyS0 data has 58 unique forest types and the INFyS1 has 35. A list of these classes and the to tal number of sampling sites associated with each forest type is included in Appendix A F_ARBOLES is a count of the total number of individual trees with a diameter of more than 7.5cm at 1.3m above the ground. COBERTURA_ARBOLES is the canopy cover in square meters. The original v alues Figure 2.3: Conglomerado Distribution EX ample of how the sample sites that constitute a conglomerado are spatially distributed. Each orange dot represents a sampling site. The central sample is at the intersection of the 5km by 5km national grid

PAGE 32

17 reported in the Appendix A are reported in square meters/hectare. This value must be multiplied by 25 to obtain the canopy coverage for the 400 square meter site. Examples of the sampling site structures for temperate forest (left) and tropical (right) forest. The group is what is referred to as the cluster and the s itios' are the individual sampling sites. All the data within this study comes from the 400 square meter area represented by the green circles in the temperate forests and the green rectangles in the tropical forests. The smaller orange and yellow features withi n each site refer to survey data that was collected on lower vegetation and the soil and is not included in this study. 2.1.2 GlobeLand30 from National Geomatics Center of China GlobeLand30 (GL30) is a remote sensing derived global land cover data set prod uced at a minimum mapping unit of 30 square meters for two time periods, 2000 and 2010 (Figure 2. 5). It was released to the United Nations at the 2014 Climate Change Summit as a resource for the scientific and policy communities ( Ran Y H, Li X. 2015). Prod uced by the National Geomatics Center of China, GL30 represents a significant increase in resolution power over existing global land cover data sets (Chen et al. 2015). Part of what makes this data set valuable is the process used to create it. Figure 2.4: Site Sampling Diagrams

PAGE 33

18 The two GlobeLand30 data sets are a product of a hierarchical PixelObject Knowledge based (POK) classification system (Figure 2. 6). This is an iterative process where raster cells are classified into ten class es. The classification of the ten classe s occurs in a specific order. After they are classified all pixels from with that classification are then removed from subsequent classification. The "Pixel" portion of this method refers to spectral reflectance of the land cover. "Object" refers to the grouping of sim ilar reflective signatures. "Knowledge" refers to the inclusion of limiting physical landscape parameters as well expert opinion on the potential for the existence of any given land cover data at a specific location. For example, water bodies are the firs t class classified. A supervised classification based on Landsat imagery is performed to identify potential water bodies (pixel). These pixels are grouped into objects based on the shared characteristics of the individual reflectance signature relative to neighboring pixels (object). Limiting parameters, such as slope are applied to ensure that no feature is classified where it cannot exist (knowledge). All the pixels that were classified as water are then masked out and the altered image is used as the base for the next classification, wetlands. This process is repeated the following order; wate r, wetlands, snow and ice, artificial surfaces, cultivated land, (forest, shrubland, grassland, and bare land are classified at the same time), tundra.

PAGE 34

19 An image showing the GL30 2010 land cover raster mosaicked and masked to the borders and coasts of Mexico. Nine of the ten total land cover classes are present within Mexico. The Tundra class and the Permanent Ice and Snow class is not present in Mexico. To distinguish between forest, shrubland, grassland, and bare land, a total of 6 Landsat bands and 23 MODIS based NDVI bands were used as inputs to a maximum likelihood classification model. The Landsat bands are used for the specific spectral reflectance and the NVDI bands are used to identify seasonal trends within the regi on. Limited information is Figure 2. 2 : Map of GL30_2000 for Mexico

PAGE 35

20 This diagram is a representation of the GL30 classification process. Input features are represented by the green boxes. Land cover classes by the orange boxes. The three primary processing methods are in the white boxes and the masking process which occurs at the end of the classification is in the blue box. provided rega rding the specific spectral thresholds used to define which class a pixel belongs to. General statements regarding the percentage of land surface covered by a given feature were provided. Forest and shrublands must have more than 30% ground coverage, grasslands must have at least 10%, and bares lands must have less than 10% vegetative coverage. A priori probability was developed based on two existing global land cover data sets that were mapped at resolutions of 500m and 300m. These data sets were reclassif ied and resampled to match the cell size and categorical classification of the GL30. A five by five moving window was used to calculate the probability of a given cell belonging to a specific Figure 2. 3 : Workflow of GL30 classification process

PAGE 36

21 class (Figure 2 7). The result of this was included as a parameter within the maximum likelihood classifier. This is a visualization of the moving window process in the a priori probability process used by GL30. The forest cells represented by 'F' are based on the two resampled existing land cover data sets. The percent forest of value for the 25 cells is used to calculate the probability that the cell X is actually forest. This is repeated for all cells for the forest, shrubland, grasslands, and bare lands classes. A total of 191 Landsat images were used to produce the land cover maps of Mexico. Along with the raster based classification a series of shapefiles is associated with the data sets that define what image was used for the classification and what date the image was captured on (Figure 2. 8). The majority of the images were collected within a year of the set date of 2000 or 2010 (Figure 2 9). To prepare these data for an alysis the individual rasters were re projected to North American Albert Equal Area Conical projection. A mosaic was created using all the raster images. This new raster was masked to a buffered boundary of Mexico based on a Natural Figure 2. 4 : Example of priori probability process

PAGE 37

22 Upper left: A GL30 sheet represented by a land cover raster. Upper right: The shapefile that accompanies the raster. Each polygon refers to a Landsat image. This shape file has been clipped to the country of Mexico. The straight line on the bottom is the boarded with Guatemala. Earth (naturalearthdata.com, 2016) large scale resolution shapefile of the administrative boundary of Mexico. This feature was reprojected to North American Albert Equal Area Conic al. To account for the different resolutions at w hich the data sets were created, the administrative boundary of Mexico was buffered to 5km. To ensure that no information along the coasts was lost. This shapefile was clipped to the administrative boundari es of the neighboring countries Belize, Guatemala, and the United States of America to exclude any areas from those countries. Figure 2. 5 : Data provided by GL30

PAGE 38

23 Graphical example of the collection dates of Landsat images from the GL30 2000 data set. All values are images that were captured in 1999. The majority of the images were captur ed between October and December. 2.2 Analytical Processes 2.2.1 Forest Type Groups To fully utilize the detail of the INFyS data sets, the representational accuracy of the GL30 Forest land cover class was assessed for various forest type groups that represent the major temperate and tropical forest types in Mexico as follows: 1. All sites, 2. Temperate forest, 3. Tropical forest, 4. Pine, 5. Oak, 6. Pine Oak Mix, 7. Low dry Deciduous jungle, 8. Medium semi deciduous jungle, 9.High evergreen jungle, 10. Erosion presence. To best account for the temporal differences between the GL30 and INFyS data sets t he following structure is followed throughout the methods. The GL30 2000 land cover data is compared against the INFyS0 data set. The GL30 2010 is compared against the INFyS1 data set. Figure 2. 6 : Image capture dates from 1999

PAGE 39

24 The temperate forest type group is defined by sites containing Bosque in the INFyS TIPO VEGET (vegetation type) field (Figure 2 10). This included 21 forest types in the INFyS0 and seventeen for the I NFyS1 data sets (see Appedix A ). Of the 21 and seventeen unique temperate forest types the three most common forest types within the temperate forest group are Encino (5. Oak), Pino (4. Pine), and a mixture of the two. In this study the forest types Pino Encino and Encino Pino were combined to form the single forest type group (6. Pine Oak Mix). These forest types are dominan t in the cooler and mountainous regions of the country. The pine forest is evergreen, whereas the oak forest is deciduous and lose s its leaves annually The tropical forest group is defined by all forest types that contained the word Selva in the TIPO VEGET (vegetation type) field (Figure 2. 11). This included eighteen forest classe s in the INFyS0 data sets and thirteen classes in the INFyS1 data sets. The tropical forests are split into three groups based on the average height of the domin ant trees and the percent of the trees that drop their leaves during the dry season. Selva Baja is low dry jungle where almost 100% of trees drop their leaves during the dry season. Selva Mediana is medium semi deciduous jungle where about 50% of trees dr op their leaves during the dry season, and Selva Alta is high evergreen jungle where almost no trees drop their leaves. Forest height is due in large part to environmental conditions and as a result the forests appear in distinct geographical locations. T he Low dry deciduous jungle is found mostly along the low elevations of the western coast, Medium semi deciduous jungle is found primarily on the Yucatan Peninsula, and the High evergreen jungle is mostly found in the states of Veracruz and Chiapas.

PAGE 40

25 \ \ The forest group type Erosion presence includes al l forest types where major soil damage is present. Overall, this represents a small portion of the total number of sites for both forest inventories. It was included because soil degradation is an environmental indicator w hich may be associated with marginal landscape conditions. Temperate Forest Locations F igure 2. 7 : Generalized location of Temperate Forests

PAGE 41

26 This map shows the relative density of the three tropical forest type groups; low dry deciduous jungle, medium semi deciduous jungle, and high evergreen jungle. 2.2.2 Rationale for Accuracy Assessment Methods The strength of the INFyS data as a validation set is in the sheer number and distribution of the sites and the attribute data collected at the individual sites. The three attributes which are most important to the validation methodology are forest type, number of trees, and canopy coverage. Both the number of trees and canopy coverage values are viewed as relative measures of forest density and are continuous data sets which allow for a more robust receiver operator curve analysis explained below. To account for the temporal Low dry deciduous jungle Medi um semi deciduous jungle High evergreen jungle Figure 2. 8 : Generalized locations of Tropical Forest

PAGE 42

27 variance betwee n the land cover and INFyS data sets the GL302000 data was evaluated against the INFyS0 data and the GL30 2010 data will be evaluated against the INFyS1 data. Based on the nature of the INFyS data sets a unique methodology was used to best assess the rep resentational accuracy of GL30's classification o f the Mexican forests. B ecause an error matrix cannot be developed, so the common reporting parameter for a land cover data sets assessment cannot be calculated. This all hinges on the fact that the INFyS da ta sets only provide information on where the forest is known to be. It does not define where forest is not present. As a result, it is not possible to speak to the users accuracy of the GL30's representation of Mexico's forest. While this does limit the ability of this assessment to be directly related back to other published assessments it does not mean the results of the methodology are uninformative. It should also be noted that in assessing a single class of the GL30 data the scope of analysis is significantly more precise then all previous assessment effects. 2 .3 Assessment Method 1: Percent Correct by Intersect Location data for the INFyS data sets is stored as a set of coordinates which represents the centroid of the sampling site. This positi onal data was used to create a point feature class in ArcMap. A simple intersect was performed between this centroid and Globeland30 to determine what land cover class every given point fell into (Figure 2. 12). A new binary column was added to the INFyS da ta based on whether the site fell into a forest class or not, 1 = forest and 0 = not forest. This value was stored with the INFyS data sets and was used as a basic and general indicator of the percent correct relationship between the two data sets. This m ethod is referred to as t he percent correct by intersect

PAGE 43

28 In this case three sampling sites from this conglomarado would be classified as forest (dark green) and one would be classified as shrubland (light green). 2.4 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent Correct Calculations made in Method 1 To provide a more realistic representation of sampling sites, a method was developed to represent the sites as an area rather than a point. The sampling sites have a known area of 400 square meters and a known shape. For computational purposes, all sites were chosen to be represented by the shape associated with temperate forests, a circle with a radius of 11.28m. To account for the assumption of a perfect positional accuracy between the two data sets a positional error value of 50m associated with Landsat TM and ETM was used ( Tucker, C.J. et al. 2004). This implies that any given Landsat image pixel could be within 5 0 meters of its actual position on the map. No supported positional accuracy data for the GPS units used was identified but it is assumed to be less than the error associated with the remotely sensed imagery. In knowing the size of the sampling site and po tential Figure 2. 9 : Intersect Relationship between INFyS and GL30

PAGE 44

29 positional accuracy error associated with the underlying raster we can say that any given sampling site is very likely to represented by a circle with an area of 400 square meter within 61.28 meters of the centroid of that site. Therefore, if all t he area within the 61.28 meters surrounding the site centroid is represented as forest by the GL30, it is assumed that this is a correct representation. Under the same logic, a misrepresentation is where none of the area is classified as forest (Figure 2. 1 3). Finally, sites with a mixed value ( e.g. 30% forest and 70% grassland) are considered to partially covered by forest sites and were excluded for specific analysis due to the inherent uncertainty in their classification. This method is referred to as the percent correct by area. The pe rcent correct calculation was p e rformed using a P ython script. The calculation itself is: serived by simply dividing the total number of sites correct over total number of sites sampled. However, the selection of the forest groups is more cumbersome. The INFyS data with the new binary column was exported as a csv and imported as a pandas data frame. Two different selection methods were used. The first method selected all sites where the vegetation type included a forest group, such as bosque or selva The second method was used when a specific forest type was selected, 'bosque de pino or bosque de encino Once selected the total number of true values from the forest binary column was counted The total number of sites within the selection was calculated and the percent correct value was recorded and saved in a list. This process was repeated for all ten forest g roups. These lists of values were converted into a dataframe and then exported as a csv.

PAGE 45

30 In the firs method, this location would have been classified as Forest class by the GL30 because the centroid of the site falls within the GL30 forest class. In the second method this site would have been defined as partially covered by forest. The largest radius circle represents all the area in which the 400m2 area of the INFyS site could be given the known p otential positional error of the Landsat ETM+ of 50m (Tucker et al. 2004). This can be visualized by imagining the centroid staying in the same position and the image beneath it moving in any direction for up to 50 meters. If the image was to move to the left by 30m the centroid and sampling area would be fully within the grassland class of the GL30. The means that due to the inherent uncertainty in the position of the Landsat image used to create the GL30 it is not possible to say with complete certainty that this site is actually in the forest class. In order to o btain percent area value an individual sampling site was selected and buffered to a circle with a radius of 61.28m. This buffer feature maintained the site ID for linking back to the INFyS data. This polygon was converted to a raster with 1m cells. A area calculation was performed using the GL30 data sets as the reference data. The area Figure 2. 10: Percent Correct by Area process

PAGE 46

31 calculation determine s how much area of the feature being tested falls into the unique classes defined by the reference data set. In this case, the sampling site with an area of 11,791m2 is compared against the GL30 land cover data. The output identified the unique sampling sit e ID and how many of the 11,791 one meter cells fell into each the ten land cover classes associated with GL30. This output table was added to a pandas data frame and the percent forest value was calculated by determining how many of the 11,791 cells were classified as forest (Figure 2. 14). This process was repeated for all the sampling sites for the two INFyS data sets. Once complete the percent area value was joined back to the original shapefile based on the site ID. A binary results column was created on the pandas data frame where 1 = 100% for est, 0 = 0% forest, and 'None' ( equivalent to no data ) was applied to all partially covered by forest sites. The black circle represents the buffered area with a radius of 61.28m. All area surrounding the site on the left is classified as forest, the center has a portion classified as forest which is considered partially covered by forest and the right has no forest. Figure 2. 11: Examples of percent correct by area values

PAGE 47

32 The percent correct value was calculated using the same base structure as the script referenced in section 3.2. To exclude the partially covered by forest sites, once the selection was made based on the forest group, all sites with None in the forest binary column were removed. After that the percent correct values were calculated and exported as a csv. 2.5Assessment Method 3: Predictive quality of the INFyS site attributes Canopy Cover and Number of Trees Given the methods through which remotely sensed images are derived, the number of trees and canopy coverage at a given site can serve as key indicators of whether the location was correctly represented by GL30. These values can both be viewed as indicators of forest density. There fore, the distribution of the values associated with correctly represented sites is likely different then the distribution of values associated with misrepresented sites. This relies on the assumption that the greater the forest density at a site the more likely it is to be identified as forest in remotely sensed data. The distribution of these values is non normal and considerably skewed to the right (Figure 2. 15). Therefore, the Mann Whitney was used to determine whether the distribution between the correctly represented and misrepresented sites is indeed different. 2.2.5.1 MannWhitney U Test The distribution of values for number of trees and canopy coverage between the two classes was compared using a Mann Whitney U test. This test ordinally ranks the values from both groups. From this ranking a sum of the ranks is calculated from a single group. The U value for both groups is determined based on the number of samples within each

PAGE 48

33 group and the sum rank value. T he smaller of the two values is compared then against a critical U value. If the calculated U value is smaller than the critical U the null hypothesis is rejected. The null hypothesis of the MannWhitney test is that the distributions are the same. The p value as sociated with each test was used as an indicator of whether if the null hypothesis could be rejected, and therefore the distributions of the data are indeed different. This figure shows the distribution of correctly identified and misidentified sites based on the number of trees per site (left) and canopy coverage (right). The AUC value for canopy coverage was higher than that of number of trees. By adaptin g the original selection script used in the percent correct by intersect method, a Mann Whitney test was performed on the ten forest groups, all of which had over 100 sampling sites. Testing the number of sites in a group was done to ensure the validity of the statistical methods. Sites were selected based on forest group and then stored to one of four arrays base on being correctly (100% forest) or incorrectly (0% forest) represented as forest by the percent correct by area method. Partially covered by for est Figure 2. 12: Distributions of Number of Tree and Canopy Coverage

PAGE 49

34 sites were excluded from the analysis. The four arrays were: 1. correct canopy coverage values, 2. misrepresented canopy coverage values, 3. correct number of trees values, and 4. misrepresented number of trees values. Two MannWhitney U tests were pe rformed for each forest group. One for canopy coverage and one for number of trees per site. These values were stored in a pandas data frame. The process was completed for all ten forest groups and the da ta frame was exported as a csv. 2.2.5.2 Receiver Operator Curves (ROC) and Area Under the Curve (AUC). Given the results of the Mann Whitney test, the distribution of values for number of trees and canopy coverage between the correctly represented and misrepresented sites were different. Knowing this al lows for the use the Receive Operator Curve (ROC) and Area Under the Curve (AUC) statistical methods. These methods provide a means of evaluating how effective the number of trees and canopy coverage values are at predicting the classification of a given f orest type by the GL30 data. A unique characteristic of the AUC is that the output of the analysis is invariant against class skew and evaluated score (Fawcett, 2005). Therefore regardless of the number of features within a distribution or the range at whi ch values are measured the AUC can be compared directly against other AUC values. The ROC test is created by plotting the true positive rate against the false positive rate of a binary distribution at multiple thresholds. The true positive rate is the number of values that are known to be true at or above a threshold divided by the total number of true values. The false positive rate is the number of false values at or above a given threshold divided by the total number of false values. The true positive and false positive rates are

PAGE 50

35 represented as percentages and form the x and y axis of the plot. The AUC is then determined by calculating the area under the curve created by the ROC test. ROCs are very useful for visually representing the data and the A UC is useful as a single value representation of the dataset. Keeping consistent with the methodology deployed for the Mann Whitney test, the ROC and AUC analyses were performed on all forest groups, ensuring that each group contained 100 or more sample sites. The process for the ROC and AUC was built on top of the script for the Mann Whitney. The four arrays created for each forest group served as the inputs to the roc_auc_score and sklearn.metric.roc_curve functions that are part of the sklearn.metrics library. The output of these functions is an area under the curve percentage, false positive rate, true positive rate, and threshold values. These values were stored in a data frame and the process was repeated for the number of trees per site attribute da ta as well. While the AUC value can be compared directly the three other values are used to construct the ROC curve for visual interpretation (Figure 2. 16). This process was completed for all forest groups. And the final data frame of the AUC values was ex ported as a csv. The true positive rate is the proportion of all positively identified sites that are found to have at given threshold or above. The false negative rate it the proportion on misidentified sites that are found at a given threshold or above. A detailed example is provided below.

PAGE 51

36 ROC curve plot for the coverage values and number of trees for the pine forest in the INFyS0 data set. An ROC and AUC test is being performed on a number of trees values for a forest type group that has 1000 total samples of which 800 were identified by GL30 as the forest class and 200 were not. Therefore, the percentage correct or producers acc uracy is 80%. This is a general value for all the sites and it may be that the density of trees in a particular se ction of forest is well known, such as 15 per 400m2. This value of 15 tree s per 400m2 represents a threshold in the ROC test. The true positive rate is determined by identifying how many sites that have 15 trees per 400m2 have been correctly identified and dividing that by the total number of correctly identified sites. The false positive rate is determined by the same method but in this case it is the number of misidentified sites found above the Figure 2. 13: Example of ROC curve

PAGE 52

37 threshold over the total number of misidentified sites. A powerful option of this process is the ability to calculate a percent correct value based on a given threshold. In this case it was found that 500 of the 800 total correctly identified sites and 50 of the 200 misidentified sites had at least 15 trees in a 400m2 area. This would produce a true positive rate of 0.625 and a false positive rate of 0.25. The percent correct value at this threshold is 91%. This is 11% higher than the percent correct value found for all sites. This provides another means of understanding how well GL30 has represented the forest in that region.

PAGE 53

38 CHAPTER III RESULTS The results of this project are pr esented in three sections based on the three distinct methods used to assess the accuracy of GL30's representation of the forests of Mexico. The first two methods, percent correct by intersect and percent correct by area involve the direct spatial comparison between the GlobalLand30 and the INFyS data sets through an intersect and percent area calculation. The product of this method is a percent correct value for the ten forest type groups examined here (see Appendix A for definition of each group). The third method uses attribute data associated with the INFyS sites to determine how well forest inventory sample site attributes can be used to assess how well the GL30 Forest class correctly identifies the forests in Mexic o. This is done by applying the receiver operator curve and area under the curve processes (subsection 3.4 ). By using multiple assessment methods this study produces a more complete assessment of the relationship between the GL30 and INFyS data sets and a more complete assessment of how well the GL30 forest class repre sents the forests of Mexico. 3.1 Assessment Method 1: Percent Correct by Intersect The first process undertaken provides a well known and understood baseline assessment to compare against the other methods. The values reported represent the spatial relationship between the centroid of the sampling site and the GL30 30X30m cell where that centroid falls. Two sets of results were produced. One compares GL30_2000 with the 60,580 sites of the I NFyS 2004 2009 (INFyS0) inventory data. The second compares GL30_2010 with the 11,476 sites of the INFyS 20092013 (INFyS1) data.

PAGE 54

39 3.1.1 Percent C orrect of All S ites The first assessment was performed on the forest type gr oup for all sites (see Appendix A ). The INFyS0 data set contains 58 unique f orest types (see Appendix A ) and the INFyS1 data set contains 35 unique forest types (see A ppendix A ). For the GL30_2000 INFyS0 relationship an overall percent correct value was found to be 77.2%. The GL30_2010 INFyS1 relationship had an overall agreement of 79.6% (Figure 3. 1 Table 3 1) Figure 3. 1 : Percent Correct By Intersect Values This graph summarizes the results of the percent correct by intersect values of the GL30_2000 INFyS0 data and the GL30_2010 INFys1 data sets. 3.1.2 Temperate F orests Temperate forest was defined by all forest types which contained the word "Bosque" in t he INFyS TIPO VEGT vegetation class field. This included 21 forest types in the INFyS0 and 17 in the INFyS1 data sets (see Appendix A ). The percent correct value was 73.9% for All Sites Temperate forest Tropical Jungle Pine Oak Pine Oak Mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 70.0 75.0 80.0 85.0 90.0 95.0 100.0Forest Class GroupingPecent CorrectPercent Correct by Intersect INFyS1 INFyS0

PAGE 55

40 the GL30_2000 INFyS0 relationship and 74.6% for the GL30_2010 INFyS1 relationshi p. 3.1.3 Tropical Forests Tropical forest was defined by all forest types with contained the Selva in the INFyS TIPO VEG vegetation class field. This included 18 forest types in the INFyS0 data sets and 13 forest types for the INFyS1 data set ( see Appendix A ). The percent correct value was 89.3% for the GL30_2000 INFyS0 relationship and 89.07% for the GL30_2010 INFyS1 relationship. 3.1.4 Primary Temperate Forest Classes: Pino Encino, and Pino Encino Mix The temperate forest class contains the most sampling sites in both of the INFyS data sets. The temperate forests are primarily pino (pine), enico (Oak), or a mixture of the two. The percent correct value for the Pine, Oak, and the Pine oak mix forests for the INFyS0 data are: 76.2%, 72.3%, and 73.8% respectively. These values were similar in the INFyS1 data: 74.6%, 72.3%, and 75.4% respectively. 3.1.5 Primary Tropical Forest Classes: Selva Baja Mediana and Alta There is a great diversity tropical forest types in Mexico. Within the INFyS data the tropical forest types are distinguished based on the height and percent of trees that drop their leaves during the dry season. The percent correct values for the Selva Baja ( Low dry deciduous jungle), Mediana (Medium semi deciduous jungle), and Alta (High evergreen jungle) forest for the INFyS0 data set are: 77.2%, 95.0%, and 91.4% respectively. The values for the INFyS1 data are: 74.9%, 95.5%, and 90.9% respectively. 3.1.6 Erosion presence The Erosional class is a mixture of temperate and t ropical forest types that contain the term "Erosion" within the INFyS TIPO VEG vegetation class field name. The INFyS0 data

PAGE 56

41 set contains 18 unique forest types and the INFyS1 data set contains 9 unique forest types that include the Erosion term in their description ( see Appendix A ). For the GL30_2000 INFyS0 relationship an overall percent correct value was found to be 78.7%. The GL30_2010 INFyS1 relationship has a n overall agreement of 76.2%. The results reported here represent the most basic relation ship possible between GL30 and INFyS. The methods presented next attempt to account for some of the limitation of these results. However, the results against the more detailed evaluations prese nted in the next subsections. 3.2 Assessment Method 2: Accounting for Sampling Sites Area and Positional Error in Percent Correct Calculations made in Method 1 The results in this subsection offer an interpretation of percent correct relationship that better accounts for the spatial character of the data. This was done by addressing two major assumptions inherent in the percent correct by intersect method presented above. First, the assumption made that sampling sites are points instead of polygons with an area. Second, perfect positional accuracy between the G L30 and INFyS data sets is assumed Perfect positional accuracy implies that every cell of the GL30 data is found on the exact position of the Earth which it is supposed to be. The method presented in this subsection accounts for those assumptions by defining all sites as either 100% forest, 0% forest, or some percent value in between these referred to as partially covered by forests The percentage refers to the proportion of a defined area around the centroid which is classified as forest by the GL30 Forest class. All percent correct results reported below were calculat ed using sample sites that have 100% forest cover and samples sites that have 0% forest cover only. The

PAGE 57

42 partially covered by forest sites were excluded because there is about whether t hey were truly represented as forest by the GL30 data. These partially covered by forest sites are found in the boundaries between the forest and other land cover/use classes. By addr essing the inherent uncertainty, in the data the percent correct by area method provides a more realistic interpretation of the spatial relationship between the GL30 and INFyS sets. The method described in this subsection limits the number of sites that are included in the analysis by ex cluding all sites that are not either 100% or 0% covered by forests in its entire area. Of the 60,580 sites of the INFyS0 data, 10,801 sites were identified as partially covered by forest. These sites are 17.8% of the total sites in the INFyS0 sites. Of the 11,476 sites of the INFyS1 352 sites were identified as partially covered by forest. These sites are 3.0% of the tot al sites in the INFyS0 sites. 3.2.1 Percent Correct of All S ites This assessment used all INFyS sites that were either 0% or 100% c overed by the forest class of the GL30. The INFyS0 data set contains 58 unique forest classes and the INFyS1 data set contains 35 unique forest types (see Appendix A ). For the GL30_2000 INFyS0 relationship an overall percent correct value was found to be 84.4%. The GL30_2010 INFyS1 relationship has an overall agreement of 80.4% (Figure 3. 2 Table 3 1 )

PAGE 58

43 Figure 3. 2 : Percent Correct by Area INFyS0 and INFyS1 This graph summariz es the results of the percent correct by area values of the GL30_2000 INFyS0 data and the GL30_2010 INFys1 data sets. The percent correct values are reported for the INFyS0 and INFyS1 data sets for the percent correct by intersect and percent correct by area evaluations. All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 70.0 75.0 80.0 85.0 90.0 95.0 100.0Forest Type GroupsPercent CorrectPercent Correct by Area INFyS0 and INFyS1 INFyS1 INFyS0 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence Intersect INFyS0 79.2 73.9 89.3 76.3 72.3 73.8 77.2 95.0 91.4 78.7 INFyS1 79.6 74.5 89.1 74.6 72.3 75.5 74.9 95.5 90.9 76.2 Area INFyS0 84.4 79.5 93.1 82.3 77.9 79.3 82.8 97.3 93.9 85.8 INFyS1 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4 80.4 Table 3. 1 : Percent correct values for the INFyS0 and INFyS1

PAGE 59

44 3.2.2 Temperate forests Temperate forest was defined by all forest types which contained the word Bosque in the vegetation class field which includes a wide range tropical forest types. This included 21 forest classes in the INFyS0 and 17 for the I NFyS1 data set (see Appendix A ). The percent correct value was 79.5% for the GL30_2000 INFyS0 relationsh ip and 75.4% for the GL30_2010 INFyS1 relationship. 3.2.3 Tropical Forests Tropical forest was defined by all forest types which contained Selva in the vegetation class. This included 18 forest classes in the INFyS0 and 13 classes for the INFyS1 data sets. The percent correct value was 93.1% for the GL30_2000INFyS0 relationship and 89.8% for the GL30_2010INFyS1 relationship. 3.2.4 Primary Temperate Forest Classes: Pino Encino, and Pino Encino Mix The percent correct values for the pino (Pine), e ncino (Oak), and pino encino mix (Pine Oak mix) forests for the INFyS0 data are: 82.3%, 77.9%, and 79.3% respectively. These values were substantially higher than those from the INFyS1 data: 75.7%, 73.1%, and 76.4% respectively. 3.2.5 Primary Tropical Forest Classes: Selva Baja Mediana and Alta The percent correct value for the Selva Baja (Low dry deciduous jungle), Mediana (Medium semi deciduous jungle), and Alta ( High evergreen jungle) forests for the INFyS0 data are: 82.8%, 97.3%, and 93.9% respectively. The values for the INFyS1 data are: 75.9%, 95.9%, and 91.5% respectively.

PAGE 60

45 3.2.6 Erosion presence The Erosion presence class is a blend of temperate and tropical forest types that contain the term "Erosion" within the INFyS TIPO VEG vegetation class field name. The INFyS0 data set contains eighteen unique forest type and the INFyS1 data set contains nine unique forest types (Ap pendix A ) For the GL30_2000 INFyS0 relationship, an overall percent correct value was found to be 85.8%. The GL30_2010 INFyS1 relationship has an overall agreement of 75.9%. 3.3 Assessment Method 3: Predictive quality of the INFyS site attributes Canopy Cover and Number of Trees The first two methods are direct measurements of the representational accuracy of the Forest class in the GL30 data sets. The third method, presented here, is different because it uses attribute data contained in the INFyS data sets to demonstrate how those attribute values can be used to assess how sensitive the representational accuracy of the GL30 Forest class is to differences in forest canopy coverage ( COBERTURA_ARBOREA ), and number of trees ( # ARBOLES ) reported for the site The main hypothesis of this assessment method is that sites that have lower canopy coverage and/or fewer trees are less likely to be correctly identified as forests by the GL30 Forest class than sites with high values for those two attributes. Two statistical measures are used to examine this relatio nship. The Mann Whitney U test provides an initial indication of whether the values between the correctly represented and misrepresented sites are indeed different. The Receiver Operator Curve and Area Under the Curve measurements enables the comparison of the predictive quality of these two attributes (canopy cover and number of trees). By using the canopy cover and

PAGE 61

46 number of trees contained in the INFyS data sets we are able to produce a fuller perspective of how well the GL30 Forest class represents the forest of Mexico. 3.3.1 Mann Whitney U Test The first question is: whether the values of canopy coverage and nu mber of trees per site differ between correctly classified as forests and incorre ctly classified as forest sites. The MannWhitney U test was chosen for this test because it is a non parametric test that can be used to determine whether the two data distributions are indeed different. T here are limitations on the sample size needed to ensure the inferences of the tes t are correct. The documentation for the P ython function used to perform the Ma nnWhitney U test suggests a minimum sample size of 20 for each of the two groups being tested (docs.scipy.org, 2016). In this study we performed this analysis on fore st type g roups (see Appendix B ) that contained more than 100 sites total. This ensured that all forest type groups tested meet the minimum sample size required in the test. Both correctly identified and misidentified sites have more than 20 sites each for each fore st type group analyzed. The output of a Mann Whitney test is a U value and p value The p value is the result of a two sided test. If the p value is less the 0.001 it can be said with confidence that the distributions of the canopy coverage and number of trees associated with correctly identified and misidentified sites are different. The results of the Mann Whittney U Test provide confidence that the data is well suited for the ROC and Area Under the Curve analysis.

PAGE 62

47 3.3.2 Results of the INFyS0 data set A listing of the U values and p value s from the test is included in Appendix B For clarity purposes the p value s are reported here by relat ive significance, 1 refers to the smallest p value possible (Table 3 2) INFyS0 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous High evergreen jungle Erosion presence P_Coverage N/A N/A 4 8 3 1 7 13 15 16 P_Trees N/A 5 2 12 11 10 9 6 14 17 Table 3. 2 : Ranked p value s from MannWhitney; INFyS0 The p value s from the Mann Whitney U test of the canopy coverage (P_Coverage) and number of trees per sites (P_Trees) are reported in ranked order with 1 being the smallest value. 3.3.2.1 Canopy coverage. All p value s were less the 0.001. The Python function did not produce a p value for 3 of the twenty catergories. These forest type groups represent the most numerous forest type groups and it is believed that some limitation in the function may have occurred as a res ult. The function may have been overpowered by the number of features being tested. The Mann Whitney U test is dependent on the number of samples within the group. It is more likely that a variance be present within a group when there is a large number of samples. However, the exact cause of this is not known. The results do imply that the distribution of canopy coverage values is different between the correctly identified and misidentified sites.

PAGE 63

48 3.3.2.2 Number of trees. All p value s were significantly less than 0.001. The p value for the Forest type group, all sites, was not reported by the python functi on. The results imply that the distribution of number of trees is different between the correctly identified and misidentified sites. 3.3.3 Results of the INFyS1 Data Set A listing of the U values and p value s from the te st is included within Appendix B For clarity purposes the p value s are reported here by relative significance, 1 refers to the smallest p value possible (Table 3 3) The p value s from the Mann Whitney U test of the canopy coverage(P_Coverage) and number of trees per sites (P_Trees) are reported in ranked order wi th 1 being the smallest value. 3.3.3.1 Canopy coverage All p value s were significantly less the 0.001. The results imply that the distribution of canopy coverage is different between the correctly identified and misidentified sites. 3.3.3.2 Number of trees. All p value s were significantly less the 0.001. This implies that the distribution of number of trees is different between the correctly identified and misidentified sites. INFyS0 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence P_Coverage 2 3 8 13 7 6 15 16 18 20 P_Trees 1 5 4 14 10 11 12 9 17 19 Table 3. 3 : Ranked p value s from MannWhitney; INFyS1

PAGE 64

49 3.4.4 Receiver Operator Curves (ROC) and Area Under the Curve(AUC) In order to capture the potential of the canopy coverage and number of trees values associated with the INFyS data sets the Receiver Operator Curve (ROC) and Area Under the Curve (AUC) calculations were applied. These analyses were performed on the measured attributes of canopy coverage and number of trees. These attribute values are representative of forest density in the sampling site locations. Since GL30 is based on remotely sensed data, these measurements are believed to be good indicators of the likelihood of a correct classification at a given location by the GL30 Forest class. The logic is that areas with higher forest density will be more likely to be correctly identified as forest by GL30. The ROC is a visual representation of the data whereas the AUC is a numerical representation of the predictive power of the relationship. AUC v alues are comparable across different data set and are the primary reporting tool for this analysis. The best way of interpreting the AUC is as follows. The AUC is the probability that any randomly chosen positively identified site will have a higher cover age and/or number of trees value then a randomly chosen misidentified site for that given forest type group (Fawcett, 2006). An AUC of 0.50 is the equivalent of a random guess or no predictive power. As the AUC approaches 1 the predictive power increases. See Fawcett (2006) for a more complete description of the process. It should be noted that for all forest type groups from both INFyS data sets, the AUC values were always above 0.50. This implies that across the board, canopy coverage and number of trees at a location has a positive predictive value on whether or not GL30 will classify that site as forest.

PAGE 65

50 Identifies the Area under the curve value for both the INFyS0 and INFy S1 data set for the canopy coverage and number of trees per site values. 3.4.5 INFyS0 3.4.5.1 Canopy coverage. The AUC value for coverage varied little between forest groups (Table 3.4 Figure 3.3 ), the range is only 0.09. The best predictor was the erosional class and the worst predictor was the Low dry deciduous jungle. 3.4.5.2 Number of trees Compared to the canopy coverage values the AUC for number of trees is generally lower and has more variability between forest groups. The values range from 0.59 in the Oak to 0.80 in the Medium semi deciduous jungle. This produces a range of 0.21, which is slightly more than two times that foun d in the canopy coverage values (Table 3.4). Table 3. 4 : AUC Values from INFyS0 and INFyS1 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence INFyS0 AUC_Coverage 0.71 0.71 0.71 0.71 0.70 0.72 0.69 0.71 0.75 0.78 AUC_Trees 0.67 0.61 0.76 0.67 0.59 0.61 0.68 0.80 0.77 0.76 INFyS1 AUC_Coverage 0.68 0.69 0.66 0.69 0.68 0.69 0.65 0.70 0.72 0.77 AUC_Trees 0.69 0.64 0.76 0.68 0.63 0.63 0.70 0.80 0.78 0.76

PAGE 66

51 Figure 3. 3 : AUC Values for INFyS0 Area under the curve values for canopy coverage and number of trees for a given forest group. 3.4.6 INFyS1 3.4.6.1 Canopy coverage. The AUC values for canopy coverage varied little outside of the Erosional class. The lowest value of 0.65 was found in the Low dry deciduous jungle group and the highest was found in the Erosion presence group at 0.77. The range in value between the two is 0.12 (Table 3.4) 3.4.6.2 Number of trees. The AUC values for the number of trees at a site varied across the forest groups. The highest value of 0.80 was found in the Medium semi deciduous jungle and the lowest value of 0.63 was shared between the Pine and Pino Oak All Sites Temperate forest Tropical Jungle Pine Oak PineOak Mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Area Under the CurveAUC for INFyS0 Canopy Coverage and Number of Trees Number of Trees Canopy Coverage

PAGE 67

52 mix. The range is 0.17 (Table 3. 4) Figure 3. 4 : AUC Values for INFyS1 Area under the curve values for canopy coverage and number of t rees for a given forest group. 3.5 Comparison of AUC values between the INFyS0 and INFyS1 data sets Both the GL30 data sets, 2000 and 2010, used the same classification methodology. Therefore, differences between the AUC results may be more reflective of the differences in the INFyS data sets. 3.5.1 Canopy coverage Interestingly, all the AUC values fo r canopy coverage decreased between the INFyS0 and INFyS1 data sets. While these differences are minor, never more than 0.05, the ubiquity All Sites Temperate forest Tropical Jungle Pine Oak PineOak Mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 0.55 0.60 0.65 0.70 0.75 0.80 0.85Area Under the CurveAUC for INFyS1 Canopy Coverage and Number of Trees Number of Trees

PAGE 68

53 of the decrease is worth noting (Figure 3. 5) Figure 3. 5 : AUC values for canopy coverage AUC values for canopy coverage for th e INFyS0 and INFyS1 data sets. 3.5.2 Number of Trees The AUC values for the number of trees either stayed the same or increased between the INFyS0 and INFyS1 data sets. The largest increase was found in the Oak forest group at 0.04 (Figure 3.6). All Sites Temperate Forest Tropical Forest Pine Oak PineOak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 0.55 0.60 0.65 0.70 0.75 0.80 0.85AUC for Canopy Coverage INFyS1 INFyS0

PAGE 69

54 Figure 3. 6 : AUC values for Number of Trees Auc values for the number of trees for both the INFyS0 and INFyS1 data sets. All Sites Temperate Forest Tropical Forest Pine Oak PineOak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 0.55 0.60 0.65 0.70 0.75 0.80 0.85AUC for Number of trees INFyS1 INFyS0

PAGE 70

55 CHAPTER IV DISCUSSION The release of the GL30 global land cover data marked a dramatic increase in the resolving power of global land cover data sets (Chen et al. 2015). The ten land cover classes of the GL30 were mapped at 30m resolution. The extra detail that is offered by this finer resolution data set has important implications for future land cover/use studies. Having been released in 2015 there have been only a limited number of studies what have assessed the accuracy of the GL30. The goal of this study was to perform an assessment of the representational accuracy of the forest class of the GL30 for the temperate and tropical forest of Mexico. This work will contribute to the overall understanding of GL30 success as a land cover product because it is the first independen t study to evaluate GL30 in environments south of 30 degrees north latitude. This study is different than other assessments of the GL30 in that it is only assessing a single land cover class and it will add a significant level of understanding as to how we ll the GL30 Forest class identifies various forest types. The INFyS data sets enable a robust and detailed assessment of ten distinct forest type groups. These groups are representative of specific ecological conditions and the results of this study can be extend ed to serve as a proxy for how well GL30 will represent similar forest types around the world. Through the application of the Receiver Operator Curve (ROC) and Area Under the Curve (AUC) assessments this study was able to incorporate forest density metric of canopy coverage and number of tr ees to assess how those values a ffect GL30 Forest class ability to identify forest.

PAGE 71

56 4.1 Summary of Findings 4.1.1 Percent Correct by Intersect Many important findings come from the percent correct by intersect method. Most significantly, all forest types have a producers accuracy of over 72.3% (Figure 3 1). The second factor worth noting is that there is very little variance, less than 3%, when comparing the percent correct values between the INFyS0 and INF yS1 data sets. This consistency across multiple test s give s adds weight to the accuracy of the results. The tropical for ests were better represented tha n the temperate forests by the GL30 fores t class with the medium semi deciduous jungle and high evergreen jungle obtaining percent correct values of over 90%. Forest classes that have a deciduous character such as oak and low dry deciduous jungles were the worst represented of the forest group s suggesting that the seasonal variability of these forests may not be well accounted for in the GL30 process. Lastly, the e rosio n presence forest type class per formed as well as the all sites class which suggest that the presence of marginal land conditions does little to a ffect GL30 representation of the for ests of Mexico. 4.1.2 Percent Correct by Area The reasoning for applying the percent correct by area method was to reduce the uncertainty within the relationship between the GL30 Forest class and the INFyS data. The results show that by reducing the uncertainty the percent correct values increased. Be tween the ten forest type groups, only one forest type group showed a decrease in the percent correct from the percent correct by intersect to the percent correct by area method (Figure 4. 1 ). This was the Erosion presence class of the GL30_2010 INFyS1 rela tionship, which

PAGE 72

57 decreased by 0.3%. This highlights two well understood components of remote sensing based land c over assessment. 1) Internal core areas that represent a homogenous land cover type are well classified by remote sensing based land cover class ification methods. 2: Areas of ecological transitions that relate to changes in land cover classes are difficult to classify (McCallum et al. 2005). The percent correct by area method clearly illustrates this relationship with the GL30 data because all tra nsitional sites that were only partially covered by forest were excluded. This left only core areas of GL30 Forest class or core areas of another GL30 class. Figure 4. 1 : Change in percent correct between intersect and area The difference between the producers accuracy of the percent correct by intersect and percent correct by area methods for both the INFyS0 and INFyS1 data sets. All Sites Temperate forest Tropical Jungle Pine Oak Pine Oak Mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence 2.0 0.0 2.0 4.0 6.0 8.0Forest GroupsPercent Change from Intersect to Area Percent Area Values percent change INFyS1 percent change INFyS0

PAGE 73

58 The increase in percent correct was found in both data sets, but the increases within the INFyS1 data set were relatively modest (Figure 3. 2 ). This is because out of the 14,451 sampling sites only 352 sampling sites were found to be partially covered by forest. Therefore, the impact that these sites could have on the overall change in the percent correct value for the various forest type groups is limited. Percent correct values for the forest type groups increase d between 0.4 and 1.1%, excluding the Erosion presence forest type group, which decreased 0.3%. The GL30_2000 INFyS0 relationship was high ly affected by this method. Among the 60,580 sampling sites 10,801 sampling sites were classified as partially covered by forest. This resulted in a dramatic increase in the percent correct values of the ten forest type groups. The increase range d from the forest type group medium semi deciduous jungle at 2.3% to the forest type group Erosion presence at 7.2%. In general, the forest type groups with the lower percent correct values, such as oak and low dry deciduous jungle were most strongly aff ected by the elimination of partially covered by forest sites. This implies that these forest type groups are more likely to be found in transition environments where land cover is changing frequently o ver space (Figure 4. 2 ).

PAGE 74

59 Figure 4. 2 : Relative Density of Partially Covered by Forest Sites from the GL30_2000 Data Set This map shows the relative density of partially covered by forest sites within the GL30_2000 INFyS0 relationship. The light green area represents the location where GL30 has mapped forest for 2000. A 40% transparency is applied to the partially covered by forest layer to allow the underlying forested areas to show through. 4.1.3 Receiver Operator Curve (ROC) and Area Under the Curve (AUC) For the purpose of comparison and interpretation the AUC values are more general ly use ful tha n the graphical repr esentations of the ROC curves. B ecause the AUC test provides a single value that is comparable across other AUC analyses. The AUC value can be viewed as the likelihood that a randomly chosen correctly identified value will have a higher score Partially covered by for est Site Locations GL30 Forested Area

PAGE 75

60 then a random ly chosen misidentified value. An AUC value of 1 indicates a perfect predictor and an AUC of 0.5 indicates a predictive quality that is the same as random chance. The AUC value of 0.7 implies that a correctly identified site will have a higher value for c anopy coverage or number of trees then a misidentified site 70% of the time Within this study all forest type groups had an AUC value of over 0.5. This was an expected results, based on the results of the Mann Whitney test, which showed that the distributions of the correctly identified and misidentified sites were indeed different. All ROC and AUC tests were performed on the subsets of the INFyS0 and INFyS1 data sets that excluded the partially covered by forest sites. The AUC value for canopy covera ge varied distinctly between the INFyS0 and INFyS1 data (Figure 3 5). In all of the ten forest type groups the INFyS0 data had a higher AUC value then the INFyS1 data. This implies that in general, the canopy coverage between correctly identified and misidentified sites is more distinct within the INFyS0 data set then the INFyS1 data set. The AUC values ranged between 0.65 and 0.78. While the forest type group erosion presence and high evergreen jungle had distinctively higher AUC values the variability among the other forest type groups is rather low. This implies that the GL30 Forest class ability to disting uish between forested and non forested lands based on the canopy coverage is consistent across the majority of the forest type groups. The AUC values for the number of trees were varied between the INFyS0 and INFyS1 data sets with the INFyS1 generally hav ing a higher AUC value then the INFyS0 (Figure 3 6). This implies that there is a greater distinction in the number of trees in the correctly and misidentified sites then in the INFyS0 data within the INFyS1 data set. Four of the ten forest

PAGE 76

61 type groups ha d AUC values above 0.75, implying a very high degree of predictive quality relative to other forest type groups within this study. Tropical forests were distinguished well by the total number of trees at a given location. 4.2 Comparison of Producers Accuracies with Existing Studies The accuracy assessment performed in this study will be contrasted against information contained in the four published research studies; Chen et al. 2015, Brovelli et al. 2015, Arsanjani et al. 2016, and Sun et al. 2016. The values from the percent correct by intersect method can be compared against the results from all four studies, the values from the percent correct by area method can only be compared against those from the Brovelli study, and the AUC and ROC val ues cannot be directly compared to any of the four studies. The values taken from the four studies represent the producers accuracy metric. The calculation of the producers accuracy is equivalent to the percent correct calculations so the terms are inter changeable. The producers accuracy from the percent correct by interest method in this study ranged from were generally lower than the values reported in the four established studies. The Chen et al. 2015 study report a single producers accuracy for the forest class of 92.40%. Considering that value was taken from a region above 30 degree north it represents a much higher level of accuracy then what was seen in the temperate forest classes of this study. Brovelli et al. 2015 reported producers accuracie s of 85% for reclassified class that include forest and 4 other GL30 land cover types, 90% for the same reclassified class that was buffered to account for co location effect, and 81% fo r the forest class by itself. An increase in accuracy seen when accoun ting for potential positional error of the data set matches the

PAGE 77

62 results seen in this study very well. The assessment of the forest class by itself is approximately 7% higher than the value reported for temperate forest within this study. Arsanjani et al. 2 016 reported four producers accuracies from four unique land cover validation sets. These values; 23.65%, 82.89%, 84.63% and 93.97% were all reported for a reclassified class that include forest and 4 other GL30 land cover types. Excluding the lowest valu e, the remaining there were all significantly higher than the percent correct value found for the temperate forest class in this study. Sun et al. 2016 combined the forest and shrubland class of the Gl30 and reported a producers accuracy of 34%. This value is extremely low and was not seen in any of the ten forest type groups within this study. Due to the various assessment methods employed by each study, making a direct comparison between them is challenging. It is important to note that all these stud ies relied on existing land cover data sets or user interpreted high resolution imagery and took place in temperate forests north of Mexico. Overall, the reported values were higher then what we assessed in this study. The variance among all the studies su ggest that the regional level assessment provide important perspective on how the GL30 Forest class identifies forest in the different geographic locations. 4.3 Evaluation of the Study The introduction of GlobeLand30 to the world represents a significant increase in the resolving power of global land cover data sets. As GL30 is validated across different locations the significance and importance of the data will become clear to settle in to the land cover research community.

PAGE 78

63 This project provides a detailed perspective on how GL30 represents various types of forest. The INFyS data were collected to provide a spatially continuous survey that accounted for the various forest types of Mexico. As a result, there is good coverage across the forested areas of Mexico. Because of all the sampling sites contain spatial information it is very clear where misrepresentation occurs. This is an important advantage compared to raster based assessments. GlobeLand30 contains a single land cover class for forest. This simplification can mask details about the accuracy of specific forest types. For instance, in this study it was found the medium semi deciduous jungles were very well represented by the GL30 Forest class but the low dry deciduous j ungle was not well represented. The level of detail of this analysis may be the most valuable aspect of this study. All the forest types within the study are tied to certain ecological conditions. In other regions around the world, were similar forest type s are found in similar ecological conditions the results of this study could be used as a general proxy for how the GL30 Forest class will represent that class of forest. The primary challenge in comparing the results of this study to other evaluations of the GL30 data set is that the INFyS data set only contains information on where forest is present and nothing about where it is not present. This means that only the producers accuracy can be addressed. The other common evaluation methods which include users accuracy, overall accuracy and derivatives of those cannot be addressed as part of this study. This work is unique with respect to the sheer number of validation sites present. The re are a total of 72,056 ground verified sites from against which the Forest clas s of the GL30 was tested The initial accuracy assessment of the GL30 reported in Chen et al. 2015

PAGE 79

64 was done using 159,874 pixels for all ten land cover classes. These pixels were distributed across a selection of the 847 map sheets tha t were used to create the GL30 (Chen et al. 2015). As a result, the density of validation sites used in this study is significantly greater than any existing study that employing user interpreted or ground verified sites for validation. The ROC and AUC pr ocess shows the significance the forest density metrics of number of trees and canopy coverage in understanding of how GL30 Forest class represents a location as forest or not. For instance, the GL30 Forest class is more likely to represent a tropical fore st as forest if there is high number of trees at the site. In the same sense, a temperate forest with a high value for canopy coverage is more likely to be represented as forest by the GL30 Forest class. The most unique aspect of this study is the use of the ROC and AUC methodology. These methods capture the quality of the forest inventory data sets. The ROC and AUC process shows the significance the forest density metrics of number of trees and canopy coverage in understanding of how GL30 Forest class rep resents a location as forest or n ot In the same sense, a temperate forest with a high value for canopy coverage is more likely to be represented as forest by the GL30 Forest class. One of the most powerful aspects of the ROC test is that it can provide a means for predicting the likelihood that the GL30 Forest class will identify an area as forest based on the forest type and either the number of tree or canopy coverage at the site. If a forest density metric of canopy coverage or number of trees per 400m2 area is know n it can then be deter mined how ma n y sites belonging to that forest type group were correctly identified above that threshold and how many sites were

PAGE 80

65 misidentified below that threshold. From these two values a probability of that location being accurate could be represented in the GL30 Forest class. The methodology for this process is still being developed. In concept, it offers a unique means of predicting how well an unvalidated region ma y be represented by the GL30. 4.3.1 L imitations of the Areal Representation of the Tro pical Forest Class The INFyS data set defined specific site survey area based on if the forest type was temperate or tropical. Temperate sites were represented as a 400m2 circle centered on the sampling site centroid. Tropical sites were represented as a 400m2 rectangle with a primary access of 40m and secondary axis of 10m. An orientation was also applied based on the sampling sites position within the cluster Do to the computational challenges of defining the orientation of the sampling site it was determined that the circular area would be used to represent both temperate and tropical sites. Because the centroid is known a circle will always include approximately 212m2 of the 400m2 sampling area regardless of the orientation of the tropical sampling site. Alternatives such as using a rectangular form and selecting a known orientation could at most account for 400m2 of the sampling area once per cluster and between 100m2 and approximately 135m2 for the other sites. The sum of these options so that the circle (212 *4 = 848) and rectangle (400 + 100 + 135 +135 = 770) indicates that on average the circular area will capture more of the true sampling site area then use a rect angular form (Figure 4. 3 ). Accounting for the orientation of the sampling sites is a major opportunity for improvement in the accuracy of the interpretation of the tropical forest types.

PAGE 81

66 Due to the limitation in the ability to represent the correct orientation of the tropical forest sampling sites area for the INFyS data sets a cirle was used to capture the most posible area of the sampling sites. The circular s ite will capture 212m2 (Purple area on the left image) of the 400m2 area of all four rectangular sampling site areas. The purple area of the two rectangles represents approximately 135m2. The circle is the optimal shape to including the most possible area. 4.3.2 Limitation of Temporal Mismatch of Data sets One factor of the data that represents a source of error is the temporal difference between the collection date of the images used within the GL30 classification and the time that the INFyS data was co llected. The collection data of the GL30 images is known but the INFyS data is only loosely defined by the date ranges of 2004 to 2009 and 2009 to 2013. This temporal variance will result in errors in representation because the forest does change over time due to natural and anthropogenic processes. An important factor to considered is that sampling site selection for the INFyS0 data set was determined based on the presence of a forest cla ss on the INEGI Series III land cover data. The INEGI Series three data was map and published in 2002. While not explored within this study multiple methods are outlined Figure 4. 3 : Tropical Sampling Sites Error in representation

PAGE 82

67 below that could be used to understand the potential error associated with the tempora l misma tch of the two data sets. The first resource for attempting to understand the potential for temporal change in forest cover over a five to ten year period in Mexico is through an examination of the existing scientific literature. Studies that summarize the overall change in forest or forest fragmentation will be the primary sources for gathering quantitative measures of how much change is occurring and where it is occurring. This will provide the most distinctive assessment of forest change. A second resource is a cross tabulation of the GL30_2000 and GL30_2010 data sets. This will provide distinct values of how much forest was mapped in both time periods, how much new forest was classified, and how much forest is no longer classified as fore st. This work is currently being published by Moreno et al. 2017. Lastly, some cluster locations are sampled in both the INFyS0 and INFyS1 data sets. This means that data regarding the number of trees and canopy cover is present over the two temporal periods. This overlap of sampling sites provides information about where forest has remained and how much it has changed due to the detailed attribute data associated with each site. While it is unclear how valuable this process may be there is potential that it can add to the understanding of how the forest has changed over time. While close consideration must be given to addressing the temporal variance between the validation set and the GL30 data recent work has shown that rates of forest change across Me xico have lowered into the 2000's(Moreno et al. 2014). This implies that

PAGE 83

68 the temporal mismatch may not represent a significant source of error within the validation of the GL30 data set. 4.3.3 Image Capture Date and Leaf on/Leaf Off Character of Decidu ous Forest The spectral reflectance of many forests changes seasonally. This is especially defined in deciduous forests that lose their leaves. Given that the image capture date is known for all images used in the GL30 process a simple test could be per formed to see if the leaf on or leaf off character of specific forest types was captured by the image. Research into the ecological character of the forest types defined within the INFyS data will show which forests lose leaves seasonally. The general dat e ranges for when the forests have no leaves will be recorded. Using this information, the image capture date for forest types that seasonally loose leaves will be compared to the date for the leaf off season. One can then predict if the forest had leaves or not when the image was captured. This will greatly change the spectral reflectance of the landscape. This represents an important first step in any future line of research. In all assessments the low dry deciduous jungle forest type groups which cont ains multiple deciduous forest types and oak which is a deciduous forest have had the lowest representational accuracies of any of the ten forest type groups. An initial examination of the image capture from the GL30_2000 data set shows that the majority o f images were captured from September to December during the 1999 and 2001 time periods ( Figure 4.4 ). This coincides with the dryer and cooler period of the year when the forests drop their leaves. Further analysis of this step will add to the understanding of how GL30 classifies deciduous forest.

PAGE 84

69 Figure 4. 4 : Image capture date for the INFyS0 data set This plot illustrates the collection time for the 191 Landsat images that were used to create the GL30_2000 land cover classification. Each point represents a single Landsat image. There is a high concentration of images during the time around November in both 1999 and 2001. 4.3.4 Variation within the Cluster Groups Within the INFyS data set the cluster is a rigidly defined structure. There is however variation, within this sampling structure due to accessibility limitations from terrain and private land and/or there being no trees present within a specific site. What this amounts to is the majority of c luster s have 4 sites yet cluster s do exist with three, two, and even one site. Since the lack of a sampling site could suggest no forest, it is possible that the cluster s with less than four sites are more likely to be representative of transitional areas within the forest and hence more l ikely to be misclassified. To address this query, cluster s could be grouped based on the number of sites within them. From these groups a similar set of test; percent correct intersect, percent correct by area and ROC AUC analysis could be performed. Testing to see how many of sites within cluster s that have less than 4 sites total or how many of the sites are partially covered by

PAGE 85

70 forest sites would may demonstrate the forest variability of the location. An important a spect of these results would be that it is not possible to tease out if any of the missing sites within the cluster are directly contributed the lack of trees. Terrain and political boundaries could be the limiting factor. Due in part to the planning assoc iated with the site selection; it is the thought of the author that it is far more likely that the cluster s with less than four sites represent transitional zones in the forest where no trees were found within. 4.3.5 Semantic Difference in the Definitio n of Forest between the Two Data Sets These data sets are produce by different processes and different organizations. It must be noted that some of the error in representation is likely due to semantic variation in what qualifies as forest between the t wo data sets. The exact spectral threshold by which GL30 defines forest is unknown. The classification process does provide some framework and it is know that the forest type groups must have at least 30% ground cover. Sampling sites for the INFyS data are defined by the national level 5 by 5km grid and the location of that grid point within an INEGI defined forest type group. Beyond this sites must have at least 1 tree. It is unclear how often a single tree in a 400 square meter area can match the 30% grou nd cover in a 900 square meter cell required to meet GL30 definition of forest. To account for this it may be possible to remove sites that have a corbutura values that does not equate to 30% of the sampling sites. This was avoided within this study becaus e there is too much uncertainty in attempting to take an average value for a 400 square meter values and compare it to one or more 900 square meter cells. It may be possible to treat all four sites in the cluster as a one which gives a 1600 square meter ar ea. Accepting this semantic

PAGE 86

71 difference exist was deemed more appropriate then applying this method. Future evaluations of this may change this interpretation. 4.3.6 Importance of the Validation Techniques The use of the INFyS data set provides many advantages over the validation methods that use existing remote sensing derived data sets. The primary advantage is that all sites were verified as forest by a person who physically visited the location. Adding to this it is known where GL30 correctly ide ntified sites and where it did not. The inclusion of specific forest types allows for the extension of these results to other areas with similar forest types. These specific qualities greatly increases the importance of the validation. 4.3.7 Ground ver ified sites verse remotely sensed sites The standard practice for validating a land cover data sets is to either use an existing land cover data set or create validation sites through user interpretation of high resolution remotely sensed data. Both these methods offer advantages and disadvantages. The most important aspect of the chosen validation set is that it should be highly accurate to insure the assessment of the land cover data set is accurate as well (Stehman, V.S. and Czaplewski, L.R., 1998). Whil e the sematic and temporal variance between the GL30 and the INFyS data does incorporate an inherent level of error, the data represent a very accurate validation set.

PAGE 87

72 CHAPTER V CONCLUSION The goal of this project was to evaluate how well Gl30's single Forest class represented the various forest types of Mexico. Two geographically distributed and ground verified forest survey data sets were used to validate the GL30. The producers accuracy o f the GL30 was reported for ten major forest type groups based on a simple intersect and a percent area of forest between the GL30 and INFyS sites. How canopy coverage and number for trees per site effects GL30 representation of the forest was tested using the ROC and AUC methods which give a unique insight into why GL30 may be misrepresenting certain areas. This study is exclusive to the country of Mexico, but do to the level of detail of the analysis the results and conclusion can be applied to other geog raphic locations with similar forest classes. The GlobeLand30 global land cover data set offers a dramatic increase in the minimum mapping unit over other currently available global land cover data sets. This increase in resolution provides the potential for improved precision for regional and smaller scale land cover studies. Created at two time stamps, 2000 and 2010, the GL30 could also be useful for temporal based studies such as land use change. Currently there have been no targeted evaluations of the representational accuracy of the GL30 in the sub tropic to tropical regions of the world. This study is the first external review of this product specifically for a land mass south of 36 degrees north. A detailed forest inventory survey produced by the Co mission Nacional Forestal is

PAGE 88

73 used to validate the representational accuracy of GL30 Forest class. This ground verified data contains 72,056 sampling sites between the years 2004 and 2013 which is split into two data sets, INFyS0 and INFyS1. Each sampling l ocation contains information on the forest type, canopy coverage, and the number of trees which allows for a very detailed assessment of the GL30. A group of ten specific forest combinations was chosen based on their overall representation of the diversity of Mexico's forest. All analysis was performed on these ten classes. The first method used was the percent correct by intersect. An intersect was performed between the centroid of the sampling site and the GL30 data. From this the percent correct value w as determined for the ten forest classes. Due to the nature of the data set this value is equivalent to the more commonly reported assessment value of producer's accuracy. The values found within this study were generally lower than those reported in previ ous assessments of the GL30 Forest class. Medium semi deciduous and high evergreen jungles had a producer's accuracy of 90% or higher. These values were on par with or above the values report in the other assessments of the GL30 Forest class. The temperat e forest had a consistent producers accuracy across the three primary groups which was in the mid seventies This method provide a good base level evaluation but rests on the two major assumptions. One being perfect positional accuracy between the two d ata set s and the second being that sampling site are represented as a point not an area in this method. The percent correct by area methodology was developed to account for those assumptions and provide a more accurate measurement of the producer's accurac y of the GL30.

PAGE 89

74 The second methodology used was the percent correct by area. This was done to account for two primary assumption of the percent correct by intersect method. This method accounted for both the area of the sampling sites and the potential pos ition error associated with the Landsat imagery from which the GL30 was developed. Individual sampling site centroid were buffered to a circle with a radius of 61.28m. This buffer was rasterized and a cross tabulation was performed to determine how much of the GL30 land cover classes fell within the given area. A percent forest value was calculated and joined back to the forest inventory data. Three groups were defined based on the percent forest value; Forest = 100% percent forest, Not forest = 100% percen t forest, and partially covered by forest sites = any value between 0 and 100 percent forest. The partially covered by forest sites are locations which there is degree of uncertainty as to if the sites are truly represented as forest or not. As a results they were exclude from further analysis. A percent correct was calculated based on the forested and not forested sites for all ten forest type groups. This resulted in an increase in an overall increase in the percent correct values of the data set A sign ificant increase in the percent correct values was seen within the INFyS0 data set. This is because just over 10,000 sites were identified as partially covered by forest within this data set. The 17% reduction in the number of sampling sites greatly altere d the results of the percent correct calculations. By better accounting for the spatial character of the data the values from this analysis are a more accurate representation of the producer's accuracy of the GL30. The third and final assessment methodolo gy use the continuous data of canopy cover and number of trees per site to test how well those values could be used to

PAGE 90

75 distinguish between correctly represented and misrepresented sites. The ROC curve is developed by moving through various threshold values (canopy coverage or number of trees) of the continuous data and plotting the proportion of correctly represented points (True Positive) against the proportion of incorrectly represented points False Positive). From this plot an area under the curve value can be calculated. The AUC can be viewed as the likelihood that a randomly selected correctly identified site will have a higher value then a randomly selected misidentified site. The AUC value is directly comparable across data set s. All AUC values within this study were over 60%. This means that for all tested cases the number of trees and canopy coverage at a site can be used to predict how GL30 will represent that location as forest or not. For tropical forest the number of trees at a site was a more ef fective distinguisher between correctly and misrepresented sites. For the temperate forest the canopy coverage value performed better. Based on the results of this study, GL30 would be an effective land cover data set for forest studies within Mexico. Per cent correct values from both the percent correct by intersect and percent correct by area methodologies were above 70%. The increase of percent correct when the area was accounted for implies that like most land cover data sets GL30 does not classify the transitional zones as well as the core areas. Specifically, regions with rapid ecological change, such as the temperate forest regions of north central Mexico in the states of Durango and Chihuahua. Evergreen tropical forests performed extremely well with producer's accuracies consistently above 90%. More modest returns were found in the temperate forest classes. The ROC and AUC analysis show that number of trees and canopy coverage of a site are good indicators of GL30 likelihood of representing that

PAGE 91

76 locat ion as forest. 5.1 Critical Evaluation of Research P roject While the number of sampling sites and consistency of the INFyS data sets does provide a solid framework for the understanding of the accuracy of the GL30s Forest class this work is not without limitations. The primary limitation of this work is that due to the nature of the INFyS data set it is only known where forest is and nothing is known about where forest is not. This limits the evaluation potential to the Forest class of the GL30 and also eliminates this studies ability to use other evaluation metrics such as user accuracy, overall accuracy, and similar derivatives. The producers accuracy represents the only direct comparison that can be made between the results of this study and the results of existing evaluations of the GL30. This study represents a focused analysis of the representational accuracy of the GL30 Forest class. The Forest class represents one of ten GL30 land cover classes. Of these ten classes only 8 of them were represented in Mexico by the GL30. The GL30 land cover classes of tundr a and permanent ice and snow were not included in the classification. Since this study only speaks to the Forest class limited to no conclusions could be drawn about the success of GL30s identification of the 7 other land cover classes that it classified within Mexico. The INFyS data sets are unique as validation sets due to the fact that there were all verified as forest by a person who physically visited the site. Yet it is likely that the distinction between what was considered a forest is different between the INFyS data set

PAGE 92

77 and the GL30 data set. This is in part due to the nature of the collection method of the INFyS and GL30 data sets. The INFyS data set is ground verified and the GL30 data set is derived from remotely sensed imagery. Add to this th e temporal variance in the collection time between the GL30 and INFyS data sets and it must be understood that errors in this evaluation process are present and are not able to be accounted for at this time. Any study will contain limitations and this one is no different. Much time was spent attempting to account for these limitations and addressing just what effect they will have on the overall interpretation of the accuracy at which GL30 has classified the forest of Mexico. Methods to deepen this assessm ent are suggested as future work. Possible the most intriguing of the proposed processes for future work is the assessment of leaf on and leaf off characteristics of the deciduous forest in relation to the image capture data. In general the deciduous fore st types were poorly represented compared to the evergreen and semi deciduous forest types. This could be a product of the environment in which these forest or found or it could be a limitation GL30s ability to determine the appropriate time frame for image selection. The methods for this process could be applied for any forested area classified by because GL30 provides the image capture dates within the land cover data. One significant source of error in the current methodology is the representation of tropical forest survey sites as circles rather than the directional orientated rectangles which they are. To determine the orientation of the sampling sites based information about the cluster that it is a part of must be referenced. The process would require advanced data

PAGE 93

78 analysis work that is current beyond the capability of the primary researcher. The current assessment method only accounts for just over half of the actual area of the tropical forest sampling sites. All the values of this report are ge neralized statements about the character of the GL30 Forest class as a whole. To expand upon the potential of the ROC analysis it could be possible to develop a process that could predict if GL30 would classify a given location as forest based on the speci fic values for canopy coverage or number of trees at a sites. The use of such and application is unknown and so was not perused as part of this analysis. Given the right motivation is may provide a useful tool for further on the ground assessments of the GL30 Forest class. The ground work of the assessment has been laid and the future of this analysis is dependent in large part on the needs of the users involved with the project. 5.2 Summary of Project Going back to the introduction, the importance of an y land cover data set is due in part to how well it represents the actual features on the ground. The most important aspect of this assessment is the inclusion of the specific forest type groups within the encompassing GL30 Forest class. This precision allows for the interpret ation of the representational accuracy of the GL30 Forest class at different spatial and ecological scales. This is due to the fact that the selected forest type groups are generally tied to specific geographic locations. This specificity allows for the in terpretation of the results of more regional scales of analysis. For example a study looking at forest fragmentation could use this product with the understanding that the results of the analysis from the Yucatan

PAGE 94

79 Peninsula will be much more accurate than t he results from the eastern slope of the Serria Madres. Being able to alter the interpretation of the results of a study because on forest class of interest adds assessment value to the GL30 as a whole. The GL30 Forest class has very successfully represe nted the medium semi deciduous jungles and the high evergreen jungles of Mexico. These tropical forest types are characterized by dense canopies and complex understory. These forest type groups are common throughout Central and South America and it is like ly the GL30 Forest class will represent those forest well. The GL30 is the first comprehensive global land cover data set to map land cover at the 30m resolution. This increase resolving power of this data sets make it an important resource for regional scale analysis of land cover / land use change studies. This study represents the first independent assessment of the GL30 data set at latitudes south of 30 degrees north and is an important step in determining how well GL30 represents the subtropical and tropical environments. The unique and detailed assessment of the GL30 Forest classes provides an evaluation of multiple different forest types. The results indict a producers accuracy ranging from 72.3% to 97.3% depending on the forest type group. The RO C and AUC analysis illustrated how sensitive is the representational accuracy of the GL30 Forest class to the measures of canopy coverage and the number of trees in the location. As a whole tropical forest were better represented then temperate forest. Dec iduous forest types were the poorest represented classes. Compared to existing validation of the GL30 Forest class these results are generally lower than those found in 3 of the 4 validation studies. Overall the GL30 did capture the diversity of the forest of Mexico well and the

PAGE 95

80 increase spatial resolution it offers make it an appealing data set for future forest cover change studies in Mexico.

PAGE 96

81 REFERENCES Arino, O., et al. (2007). GlobCover: ESA service for global land cover from MERI S Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. IEEE International, IEEE. Arsanjani, J. J., et al. (2016). "GlobeLand30 as an alternative fine scale global land cover map: challenges, possibilities, and implications for developing countries. Habitat International 55 : 2531. Brovelli, M. A., et al. (2015). "The first comprehensive accuracy assessment of GlobeLand30 at a national level: Methodology and results." Remote Sensing 7 (4): 4191 4212. Chen, J., et al. (2015). "Global land cover mapping at 30m resolution: A POK based operational approach." ISPRS Journal of Photogrammetry and Remote Sensing 103: 7 27. Clay, E., et al. (2016). "National Assessment of the Fragmentation Levels and Fragmentation Class Transitions of the Forests in Mexico for 2002, 2008 and 2013." Forests 7 (3): 48. Comber, A., et al. (2004). "Integrating land cover data with different ontolo gies: identifying change from inconsistency." International Journal of Geographical Information Science 18 (7): 691 708. Costa, H., et al. (2014). "Combining per pixel and object based classifications for mapping land cover over large areas." International Journal of Remote Sensing 35(2): 738 753. Fischer, J. and D. B. Lindenmayer (2007). "Landscape modification and habitat fragmentation: a synthesis." Global ecology and biogeography 16(3): 265 280. Foley, J. A., et al. (2005). "Global consequences of land u se." science 309 (5734): 570 574. Foody, G. M. (2002). "Status of land cover classification accuracy assessment." Remote Sensing of Environment 80 (1): 185201. Frazier, P. S. and K. J. Page (2000). "Water body detection and delineation with Landsat TM data. Photogrammetric engineering and remote sensing 66 (12): 1461 1468. Friedl, M. A., et al. (2010). "MODIS Collection 5 global land cover: Algorithm refinements and characterization of new data set s." Remote Sensing of Environment 114 (1): 168 182. Fritz, S., et al. (2012). "Geo Wiki: An online platform for improving global land cover." Environmental Modelling & Software 31 : 110 123. Giri, C., et al. (2013). "Next generation of global land cover characterization, mapping, and monitoring." International Journal of Applied Earth Observation and Geoinformation 25 : 3037. Gong, P., et al. (2013). "Finer resolution observation and monitoring of global land cover: First mapping results with Landsat TM and ETM+ data." International Journal of Remote Sensing 34(7): 260 7 2654. Groombridge, B. and M. Jenkins (2002). World atlas of biodiversity: earth's living resources in the 21st century Univ of California Press. Hansen, M., et al. (2000). "Global land cover classification at 1 km spatial resolution using a classification tree approach." International Journal of Remote Sensing 21 (6 7): 1331 1364. Hansen, M. C., et al. (2013). "High resolution global maps of 21st century forest cover change."

PAGE 97

82 science 342 (6160): 850 853. Instituto Nacional de Geografa e Informtica. Gua Para la Interpretacin de la Cartografa uso del Suelo y Vegetacin, Escala 1:250,000 Serie III; INEGI: Aguascalientes, Mexico, 2009; p. 77. Available online: http://www.inegi.org.mx/prod_serv/contenidos/ espanol/bvinegi/productos/geografia/publi caciones/ guias carto/sueloyveg/1_250_III/Suelo_Vegeta.pdf (accessed on 1 August 2015). Instituto Nacional de Geografa e Informtica. Gua Para la Interpretacin de la Cartografa uso del Suelo y Vegetacin, Escala 1:250,000 Serie IV; INEGI: Aguascalientes Mxico, 2012; p. 132. Available online: http://www.inegi.org.mx/prod_serv/contenidos/espanol/bvinegi/productos/geografia/publi caciones/ guias carto/sueloyveg/1_250_IV/1_250_IV.pdf (accessed on 1 August 2015). Lo s, S. O., et al. (1994). "A global 1 by 1 NDVI data set for climate studies derived from the GIMMS continental NDVI data." International Journal of Remote Sensing 15(17): 34933518. Loveland, T. R., et al. (2000). "Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data." International Journal of Remote Sensing 21 (6 7): 1303 1330. Matthews, E. (1983). "Global vegetation and land use: New high resolution data bases for climate studies." Journal of climate and app lied Meteorology 22(3): 474487. Moreno Sanchez, R., et al. (2012). "National assessment of the fragmentation, accessibility and anthropogenic pressure on the forests in Mexico." Journal of forestry research 23(4): 529. Myint, S. W., et al. (2011). "Per pi xel vs. object based classification of urban land cover extraction using high spatial resolution imagery." Remote Sensing of Environment 115 (5): 1145 1161. Ran, Y. and X. Li (2015). "First comprehensive fine resolution global land cover map in the world fr om China Comments on global land cover map at 30 m resolution." Science China Earth Sciences 58 (9): 1677 1678. Roth, D., et al. (2016). "Estimation of human induced disturbance of the environment associated with 2002, 2008 and 2013 land use/cover patterns in Mexico." Applied Geography 66 : 2234. Smith, G. (2013). "Hybrid pixel and object based approach to habitat condition monitoring." Sun, B., et al. (2016). "Uncertainty Assessment of GLOBELAND30 Land Cover Data Set Over Central Asia." ISPRS International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences : 13131317. Tong, X. and Z. Wang (2012). "Fuzzy acceptance sampling plans for inspection of geospatial data with ambiguity in quality characteristics." Computers & Geosciences 48 : 256 266. Tong, X., et al. (2011). "Designing a two rank acceptance sampling plan for quality inspection of geospatial data products." Computers & Geosciences 37(10): 1570 1583. Townshend, J. G. (1992). "Land cover." International Journal of Remote Sens ing 13(6 7): 1319 1328. Townshend, J. R., et al. (2012). "Global characterization and monitoring of forest cover using Landsat data: opportunities and challenges." International Journal of Digital Earth 5 (5): 373397.

PAGE 98

83 Verburg, P. H., et al. (2011). "Challenges in using land use and land cover data for global change studies." Global Change Biology 17 (2): 974 989. Yu, L., et al. (2013). "FROM GC: 30 m global cropland extent derived through multisource data integration." I nternational Journal of Digital Earth 6 (6): 521533. Yu, L., et al. (2013). "Improving 30 m global land cover map FROM GLC with time series MODIS and auxiliary data sets: a segmentation based approach." International Journal of Remote Sensing 34(16): 5851 5867.

PAGE 99

84 APPENDIX A Appendix Figures This appendix contains the detailed information on the structure of the original and create data that is a part of this work. Frequency diagrams display number of sampling sites associated with a given forest type. Forest type groups are defined in this se ction as well. ID_CONGLOMERADO ID_SITIO TIPO_VEGETACIN #ARBOLES ARBOLES_DAADOS AREA_BASAL COBERTURA_ARBOREA VOLUMEN GRADOSLONG MINUTOSLONG SEGUNDOSLONG GRADOSLAT MINUTOSLAT SEGUNDOSLAT 154 44302 Bosque de pino 2 0 0.0402124 8 0.0001963 5 0.0564625 92 116 1 25.8 32 27 9.9 154 44303 Bosque de pino 4 0 0.0250487 62 0.0033780 05 0.0743161 45 116 1 25.7 32 27 11.4 154 44305 Bosque de pino 2 0 0.0380919 0.0001068 14 0.0508977 66 116 1 27.5 32 27 9.5 155 120879 Bosque de pino 5 4 0.0595089 73 0.0006071 14 0.1313872 54 115 58 16.6 32 27 27.5 155 120880 Bosque bajo y ab 3 0 0.0446806 23 0.0011404 01 0.1012257 9 115 58 15.5 32 27 25.1 156 238250 Bosque de pino 3 0 0.0186681 72 0.0028007 36 0.0551045 06 115 55 23 32 27 51.1 Example of the INFyS0 data orignally in a csv format

PAGE 100

85 FID ID_CONGLO M ID_SITIO TIPO_VEGET F_ARBOLE S COBERTURA_ Longitude Latitude 0 154 44,302 Bosque de pino 2 0.00019635 116.023833 3 32.45275 1 154 44,303 Bosque de pino 4 0.003378005 116.023805 6 32.4531666 7 2 154 44,305 Bosque de pino 2 0.000106814 116.024305 6 32.4526388 9 3 155 120,879 Bosque de pino 5 0.000607114 115.971277 8 32.4576388 9 Example of the values from the original data that were store in the shapefile for the INFyS0 data set.

PAGE 101

86 Frequency diagram showing all the unique vegetation classes and the number of sites belonging to each class for the INFyS0 data set. There are 58 unique vegetation classes.

PAGE 102

87 Frequency diagram showing all the unique vegetation classes and the number of sites belonging to each class for the INFyS1 data set. There are 35 unique vegetation classes.

PAGE 103

88 Definition of Forest type groups 1. All sites a. All sampling sites where included in the group. 58 for INFyS0 and 35 for INFyS1. 2. Temperate forest a. All sties wh ich contained the word Bosque in the TIPO_VEGET field. 21 for INFyS0 and 17 for INFyS1. 3. Tropical Forest a. All sites which contained the word Selva in the TIPO_VEGET field. 19 for INFyS0 and 13 for INFyS1. 4. Pine Forest a. The unique forest type Bosque de Pino 5. Oak Forest a. The unique forest type Bosque de Encino 6. Pine Oak Mix a. The two unique forest types Bosque de pino encino and Bosque de encino pino 7. Low dry deciduous jungle a. All sites which contained the word Baja in the TIPO_VEGET field. 9 for INFyS0 and 6 for INFyS1. 8. Medium semi deciduous jungle a. All sites which contained the word Mediana in the TIPO_VEGET field. 6 for INFyS0 and 4 for INFyS1. 9. High evergreen jungle a. All sites which contained the word Alt a in the TIPO_VEGET field. 3 for INFyS0 and 3 for INFyS1. 10. Erosion presence a. All sites which contained the word Erosion in the TIPO_VEGET field. 18 for INFyS0 and 9 for INFyS1.

PAGE 104

89 APPENDIX B This appendix contains the raw results of the Mann Whitney U test and the AUC test for both the INFyS0 and INFyS1 data sets. INFyS0 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence U_Coverage 9.4E+ 07 3.9E+0 7 7.6E+0 6 5.6E+ 05 5.9E+0 6 6.7E+0 6 1.2E+ 06 1.2E+ 06 1.0E+ 05 1.6E+ 04 P_Coverage 0.0E+ 00 0.0E+0 0 3.3E 152 1.7E 66 4.3E 196 4.3E 248 9.3E 72 2.5E 38 1.0E 26 7.6E 21 U_Trees 1.1E+ 08 5.3E+0 7 6.3E+0 6 6.5E+ 05 8.0E+0 6 9.4E+0 6 1.2E+ 06 8.1E+ 05 9.3E+ 04 1.8E+ 04 U_Trees 0.0E+ 00 3.2E 151 3.5E 235 1.8E 41 2.6E 45 2.3E 61 3.1E 68 1.3E 75 2.6E 30 1.0E 18 AUC_covera ge 0.71 0.71 0.71 0.71 0.70 0.72 0.69 0.71 0.75 0.78 AUC_numTr ees 0.67 0.61 0.76 0.67 0.59 0.61 0.68 0.80 0.77 0.76 total_sites 49779 28968 20271 3673 10681 12030 5266 12353 2652 777 Results of the Mann Whitney test and AUC for the canopy coverage and number of trees for the INFyS0 data set. U = U value from Mann Whitney test. P = p value from Mann Whitney. AUC = Area under the curve.

PAGE 105

90 INFyS1 All Sites Temperate Forest Tropical Forest Pine Oak Pine Oak mix Low dry deciduous jungle Medium semi deciduous jungle High evergreen jungle Erosion presence U_Coverage 6.3E+0 6 2.6E+0 6 5.7E+ 05 4.0E+ 04 4.1E+ 05 4.3E+ 05 9.4E+ 04 8.1E+ 04 4.6E+ 03 8.3E+ 02 P_Coverage 3.8E 146 7.4E 126 9.0E 29 3.6E 16 4.0E 45 1.3E 50 3.5E 14 2.2E 12 3.4E 06 1.0E 06 U_Trees 5.9E+0 6 3.0E+0 6 4.1E+ 05 4.1E+ 04 4.7E+ 05 5.1E+ 05 8.1E+ 04 5.4E+ 04 3.6E+ 03 8.7E+ 02 P_Trees 3.8E 176 2.6E 68 1.7E 70 1.7E 15 2.0E 25 2.4E 25 5.1E 24 4.6E 26 2.5E 09 2.2E 06 AUC_cover age 0.68 0.69 0.66 0.69 0.68 0.69 0.65 0.70 0.72 0.77 AUC_Trees 0.69 0.64 0.76 0.68 0.63 0.63 0.70 0.80 0.78 0.76 total_sites 11124 6733 4283 834 2551 2789 1206 2616 461 141 Results of the Mann Whitney test and AUC for the canopy coverage and number of trees for the INFyS1 data set. U = U value from Mann Whitney test. P = p value from Mann Whitney. AUC = Area under the curve.