Use of crowdsourcing in evaluating post-classification accuracy

ABSTRACT “Crowdsourcing” uses masses of people to solve a specific problem, usually focusing on research strategies to reduce time, cost and effort to create data. Crowdsourcing intrinsically claims that groups can make relatively smarter and better decisions than the most intelligent individual among them. We investigated to see if crowdsourcing could be used to collect control points for usage in calculating post-classification accuracy assessments. For this purpose, a test was done using class values of randomly generated 1000 control points. Its goal was to explore the accuracy of a specific class values to be entered by three different users by utilizing majority voting method. While examining 3 data sets containing 1000 points, it could be observed that the class values of only 4 points were entered incorrectly. When the support vector machine classification results were evaluated, using the same 1000 control points generated by the experts and the crowdsourcing (containing 4 faulty points), the classification accuracies were found to be 85.0487% and 84.6154%, respectively. Results show that crowdsourcing offers a quicker and more reliable post-classification accuracy assessment for high-spatial resolution multispectral images.


Introduction
Crowdsourcing is commonly described as "the practice whereby an organization enlists a variety of people, paid or unpaid, to work on a specific task or problem". The term "crowdsourcing", however, was first used by Jeff Howe in his article, "The Rise of Crowdsourcing", published in WIRED (Howe, 2006) describes this method of using masses of participants to solve a problem goes back hundreds of years (Thenkabail, 2015). For example, in 1714 in Britain, a newly created commission offered a 20,000 pound sterling prize to anyone who could propose a method that accurately calculates longitude, allowing ships to pinpoint their positions at sea for the first time in history (Howe, 2008). A lesser known example of successful crowdsourcing involves Toyota. In 1936, the Japanese auto manufacturer organized a contest to design a new corporate logo. From the 27,000 entries, a winner was selected and the company's name permanently changed from the family name, Toyoda, to its present corporate identifier, TOYOTA.
So far, crowdsourcing has been used to solve the problems in a diverse variety of disciplines. As a result, many new terms have emerged in scientific literature. Among these, "volunteered geographic information" (Goodchild, 2007) and "neo-geography" (Turner, 2006) focus on the spatial nature of the data, while "crowdsourcing" (Howe, 2006), "citizen science" (Bonney et al., 2009) and "usergenerated content" (Krumm, Davies, & Narayanaswami, 2008) have a much wider range of applications. Albeit very different in nature, these terms are often used interchangeably for various activities related to geographic information science with the participation of citizens (See et al., 2016).
The biggest and fastest development in crowdsourcing was the emergence of the World Wide Web, and especially Web 2.0 technologies. No matter where you live on earth, the Internet allows you to make contributions, share resources easily and deliver results quickly. Today, the "Amazon Mechanical Turk" website is one of the most wellknown crowdsourcing sites. Initially, it emerged based on the idea that there are many tasks that people can do better than computers, such as describing and listing objects in a photo. On Amazon.com's Mechanical Turk forum, individuals or businesses specify the work to be done, give explanations on how to do it, and declare the fee to be paid per piece/hour. Persons who want the work must apply, and if they land the job, complete it within the specified time frame to get paid. (Behrend, Sharek, Meade, & Wiebe, 2011 On the other hand, one of the first examples of studies on crowdsourcing being used to solve geospatial problems was the "Did You Feel It" project, which was initiated in the early 1990s by United States Geological Survey (Wald, Quitoriano, Worden, Hopper, & Dewey, 2012). This project was the first significant example of crowdsourcing applied to natural disaster management (Thenkabail, 2015). The ClickWorkers project used The Global Earthquake Model as an international forum where organizations and individuals came together to develop, use and share tools and resources for the unbiased assessment of the earthquake risk (Thenkabail, 2015) resulting in another great example of crowdsourcing.
Since 2009, the ongoing Geo-Wiki project has focused on the solution of geospatial problems utilizing contributions from "crowds". Due to the large discrepancies in certain places in MODIS, GlobCover and GLC-2000 land cover maps, the Geo-Wiki project was initiated as an online volunteer network to increase the accuracy of global land cover maps. Another good example of crowdsourcing is the Global Earth Observation System of Systems (GEOSS) that was built by the Group on Earth Observations (GEO). Working in isolation from each other, GEOSS is a set of coordinated, independent earth observation, information and processing systems that interact and provide access to a diverse information resource for a broad range of users from both public and private sectors. As a crowdsourcing application, the "GEOSS Portal" offers a single Internet access point for users seeking data, imagery and analytical software packages relevant to all parts of the globe by connecting users to existing data bases and portals to provide reliable, up-to-date and user-friendly information for decision-makers, planners and emergency managers. Commercial remote sensing satellite company Digital Globe initiated another successful crowdsourcing application. Thanks to the Tomnod platform, more than two million Internet users have tried to help search and rescue teams to find aircraft for Malaysian airways lost in 2014 (Karaman et al., 2015).
The spatial resolutions of remote sensing images, which started at 80 m with the launch of the first Landsat satellite, have reached 30 cm with WV-3 and WV-4 satellites, and a couple of centimeters or better with the emerging of unmanned aerial vehicles (UAVs). The increase in spatial resolution indicates that the resulting image size and data processing time have also increased, creating unexpected difficulties in analyzing the data. Although most of the data collection and processing can be completed automatically, analysis and interpretation of remote sensing data is a complex task and still may not be entirely fulfilled by machine-driven algorithms.
One of the most common users of remotely sensed images can be found in the creation of thematic maps, such as land cover maps based on image classification. This is done by assigning pixels to predefined classes either by detecting groups of pixels (clusters) in image data that have similar spectral characteristics (unsupervised classification) or by selecting representative sample sites of known cover types to classify the entire image (supervised classification). In either case, the resulting classified image can be considered a thematic map, showing the land cover type found in the region. These thematic maps are used to create and update land cover maps for different applications related to monitoring the earth on a regional and global scale (Pasolli, Melgani, Alajlan, & Conci, 2013). There are two approaches for estimating thematic correctness of these maps. While the first approach uses control points directly on the ground, the second one establishes control points on aerial photography and/or high-resolution satellite imagery (Carlotto, 2009). Depending on the spatial resolution of the imagery, it is critical to detect the number of control points to be selected. It is known that for 30-m spatial resolution Landsat imagery, more than 250 control points are needed to estimate the average accuracy of a class at ±5% (Congalton, 1991). Akar & Güngör, 2015 used WorldView-2 (WV-2) image with a size of 1000 × 1000 and a spatial resolution of 2 m, and they calculated the number of necessary control point for accuracy assessment as 735 using the multinomial distribution equation given by Congalton and Green (1999). This number would be much higher in WV-3 and WV-4 images because of their higher spatial resolutions, making it difficult and time-consuming to enter the class values of large amounts of control points needed for accuracy assessment. Therefore, it may be appropriate to use collective intelligence, employing human perception and interpretation ability, to cope with such a large amounts of data (Hu et al., 2012).
This study investigates the usability of a crowdsourcing concept in a post-classification accuracy assessment of satellite images with high-spatial resolutions (1 m or better spatial resolution). For this purpose, randomly generated 1000 points scattered across a pan-sharpened WV-2 image are provided to volunteers through a web interface so they can determine and enter real class values of these points through a web interface. Users can, if they wish, use the help document that was prepared and presented in the web interface to learn how to enter the class value of each point. The class value of each of the 1000 points is entered three times by three different users, and the class value of each point determined by a majority voting method. For a final verification, the resultant data set is also compared to the reference data set created by an expert and controlled by another expert. This verification process reveals that only 4 of the 1000 points were incorrectly entered by the users. These control points are used to determine the classification accuracy of a WV-2 image classified using support vector machines (SVM) algorithm. The results show that the crowdsourcing approach can be used with post-classification accuracy assessment of high-resolution satellite images.

Data set
WV-2 multispectral (MS) image of Sürmene, a small town within the Trabzon province (located on the Black Sea coast in Northeastern Turkey) was used. The WV-2 satellite provides a panchromatic band covering the spectrum from 450 to 800 nm with 0.50 m spatial resolution. It also offers eight MS bands (i.e. four standard bands namely red, green, blue and near-infrared-1 and four additional bands namely, coastal, yellow, red edge and nearinfrared 2) covering the spectral range from 400 to 1050 nm at a spatial resolution of 1.84 m (Padwick, Deskevich, Pacifici, & Smallwood, 2010). WV-2 images are used for enhanced spectral analysis, mapping and monitoring applications, land-use planning, disaster relief, exploration, defense and intelligence, and visualization and simulation environments.
With the QGIS program, randomly selected 1000 points were created in GeoJSON format on the WV-2 image over the study area. In the attribute table of each point, a "class value" field was added to enter the real class value for each control point.
The verification data set, containing class values of 1000 control points created by an expert and controlled by another expert, was used for an accuracy assessment. The expert class values of 77 of the 1000 points were not identified. Hence, no class value was entered for these points.
The world map with open layers was provided as the base map. Cadastral boundary data and classified image data with SVM algorithm are other layers presented on the web interface.

Processing steps
Since its aim was to implement a post-classification accuracy assessment analysis with a crowdsourcing approach rather than a single person, this study needed a classified image. For this purpose, instead of using the original WV-2 MS data, pan-sharpened WV-2 MS data (pan-sharpened using WV-2 panchromatic band) were classified by the SVM classification algorithm so that the class values of the control points can be more clearly detected by users. This pan-sharpened WV-2 MS data were used in the accuracy assessment section of the study. To evaluate post-classification accuracy, conventionally, 1000 random sample points were selected on fused WV-2 image. As an alternative to conventional post-classification accuracy assessment approach, a crowdsourcing interface was created, designed to be used by multiple online users to identify class values of each sample point on the fused MS image.

Gram-Schmidt pan-sharpening method
Gram-Schmidt pan-sharpening method is similar to Principal component analysis (Karathanassi, Kolokousis, & Ioannidou, 2007). The Gram-Schmidt method combines the high-spatial resolution panchromatic image with low-spatial resolution MS one to increase the spatial resolution of the MS image (Yuhendra & Kuze, 2011). The first step is to compute a simulated panchromatic band from combination of the lower spatial resolution MS bands. In general, this is achieved by averaging the MS bands. This simulated panchromatic band is employed as the first band of the MS image. Second, a Gram-Schmidt orthogonalization transformation is performed on the simulated panchromatic band and the MS bands to get Gram-Schmidt bands. Then, the high-spatial resolution panchromatic band is replaced with the first Gram-Schmidt band. Finally, the inverse Gram-Schmidt transform is applied to create the pan-sharpened MS bands (Laben et al. 2000).
The Gram-Schmidt method is an effective pansharpening method (Karathanassi et al., 2007;Klonus & Ehlers, 2009;Saralioglu et al., 2016, Yuhendra & Kuze, 2011. This method has two main advantages. First, there is no limit on the number of bands that can be processed in one shot using this technique. Second, the spectral properties of the original MS data are preserved in the fused image satisfactorily (Li, Liu, Wang, Zhao, & Wang, 2004). As given in Figure 1, the MS bands of the WV-2 image are fused with the panchromatic band, resulting in a pan-sharped high-resolution image.
Since the pan-sharpened image obtained has a spatial resolution of 0.50 m, the class values are more distinguishable by the users.
The SVM method was applied to the fused WV-2 MS image with a 0.50 m spatial resolution. Kernel functions commonly used in SVMs can be generally aggregated into four groups; namely, linear, polynomial, Radial Basis Function (RBF) and sigmoid kernels (Huang et al., 2002;Kavzoglu & Colkesen, 2009;Keerthi & Lin, 2003, Pal & Mather, 2005. RBF has some advantages compared to the other kernels. In contrast to the linear kernel, the RBF moves samples from non-linear maps to a higher dimensional space to deal with situations in which the relationship between class labels and attributes is not linear. The number of hyper parameters that affect the complexity of model selection is less in RBF when compared to the polynomial kernel. Sigmoid kernel is not valid (i.e. not the inner product of two vectors) under some parameters (Hsu, Chang, & Lin, 2003;Vapnik, 1995). RBF also requires defining a small number of parameters that are widely known to generally produce good results (Keerthi & Lin, 2003;Pal & Mather, 2005;Petropoulos, Kalaitzidis, & Vadrevu, 2012). Therefore, the RBF kernel was chosen for the study.
In order to run SVM classifier in ENVI (Environment for Visualizing Images), it is necessary to enter the penalty parameter, the number of pyramid levels to be used, the classification probability threshold value and the "gamma (g)" value in the kernel function. Generally, very little guidance exists in traditional visualization data, concerning the criteria to be used in selecting the kernel-specific parameters (Carrao et al., 2008;Li & Liu, 2010). In our case, parameterization of the above RBF kernel function was based on performing a number of permutations and combinations using classification accuracy as a measure of quality. A similar approach was followed by the researchers (Fauvel, Chanussot, & Benediktsson, 2006;Kuemmerle, Chaskovskyy, Knorn, Radeloff, & Kruhlov, 2009;Pal & Mather, 2006). In addition, suggestions provided in the ENVI User's Guide (2008) were also taken into account when doing kernel function parameterization. As a result, the gamma (g) parameter was set to a value equal to the inverse of the number of the spectral bands of the WV-2 (0.125), whereas the penalty parameter was set to its maximum value (100), forcing no misclassification during the training process. The pyramid parameter was set to a value of zero, forcing the WV-2 imagery to be processed at full resolution. Whereas, a classification probability with a threshold of zero was used, requiring that all image pixels had to be sorted into one of the classes.
To initiate the training process in the ENVI program, at least 240 sample pixels were collected for each class by selecting more than 1 polygon per class, resulting in a total of 1755 pixels collected to train 7 classes namely, "forest", "hazelnut", "shade", "soil", "tea", "building" and "road" classes (see Figure 2).

Generating reference points
The Accuracy Assessment utility allows analysts to compare certain pixels in the thematic raster layer with reference pixels that already have known class values. To determine the post-classification accuracy in a non-biased way, a sufficient number of reference pixels must be randomly selected in the image to be classified. (Congalton, R., 1991). Congalton and Green (1999) offered the following multinominal distribution equity pixels to calculate the accuracy of the classification (Akar & Güngör, 2015).
where "n" represents the number of reference pixels, B = (a/k) × 100th, "α" is the confidence interval. "k" is the number of class intervals, "i" is the ratio of the area of the ith value of the entire area and "b_i" is the required accuracy. According to the above equation, a minimum of 735 reference pixels are needed to estimate the mean accuracy of a class within a plus or minus five confidence intervals.
A 1000 points were randomly generated by the QGIS program to exceed the required minimum number and a column-labeled "class" was added to the attribute table, allowing users to enter the correct class values for each reference point. The result was ultimately recorded in GeoJSON format.

Crowdsourcing web application
In the crowdsourcing web interface that uses MySQL database for database operations, vector layers are generated in a GeoJSON format. In the interface, users are asked to fill an empty "class" field for each reference point. Site domain, www.crremotesensing.app, is purchased and a web site designed, using a Visual Studio environment. ISS server is used as the web server. On front-hand side, html, CSS, Bootstrap and JavaScript are used, while C#, jQuery and various libraries are used on back-hand side. The architecture of crowdsourcing web interface is illustrated in Figure 3.
When a user visits the web page www.crremotesensin gapp.com, registration is mandatory so that the user can participate in the crowdsourcing application. Their names, surnames and contact information are requested on the registration form. If the administrator approves the request, users can access the interface and carry out their assigned task. The available layers in the interface include World map created with OpenLayers, a fused WV-2 MS image, 1000 randomly selected reference points, cadastral boundaries and a classified image.
The following rules and restrictions apply to the web application: • Multiple users can access the web page simultaneously. • 1000 randomly selected points generated at the beginning are used throughout the whole process. • When the first user logs into the system, 20 out of 1000 randomly selected points are displayed on the screen. By clicking on any of the 20 points on the WV-2 image, the user needs to enter the class value corresponding to each point. If the users want, they can also use the help document created at the interface. • A user sometimes may not be able to decide what the class value of a reference point should be; however, the user can enter and save as given points and the same/or another user makes a new request, the system displays 20 new points out of the minimum 980 points needed. • The system will continue to display 20 points for each new request until 1000 reference points are completed. • Suppose there are 14 points left to complete a total of 1000 points but more than 1 visitor is requesting work. In response, the system will display 14 points to each of the requesters. Note: Only the results of the person who finished the work first are stored in the database. • It is only after all 1000 points are processed that the first data set is created. The same number -1000 reference pointsis used for the second data set, following the same process. These steps are repeated until desired number of data sets are created.
In this study, a total of three data sets were created by users, i.e. three values are entered by users for a single reference point. (Figure 4 shows the crowdsourcing interface.)

Help document
A document has been provided in the HELP section, explaining clearly how users can input the reference point values correctly. This document also shows exactly how the classes should appear in the WV-2 image and which values should be entered into the reference points for each class. The Help document is shown in Table 1.

Results and discussions
As explained above, 3 sets of data for 1000 reference points were generated. The reason for entering more than one value for a reference point is to detect incorrectly entered values and get rid of them. If only two of the three users enter the same class value for the same reference point, then the majority of the three values are considered the class value given by the users. If all three users entered different values for a reference point, it means that point is not fully distinguishable. So, the point is left unclassified. In this study, 73 points were labeled as "unclassified". ( Figure 5 illustrates a reference point that is distinguishable, and an indistinguishable one labeled as an "unclassified reference point".) In order to see how accurate the resultant reference data sets are, after being generated according to the users' answers, a fourth data set was created for verification by determining the class values of the same 1000 reference points. During this process, it was found that it was somewhat difficult to determine the class values of 77 points. The reason for this was that the resolution of the image was too degraded after zooming-in and that the points were located on the line, separating the two classes. The generated verification set was then compared to the resultant set generated by individual users. As a result of this comparison, it was found that the class values of two points were incorrectly entered in the verification data set. This may have been due to the user fatigue (combined with examining too many points), carelessness, mistyping or not zooming-in enough to clearly see the class values. It can be interpreted that when a single user has to deal with too many reference points, such errors often occur. This result also proves the theory that the decisions derived from a crowd are more accurate than the decisions of a single expert. In addition, when the validation data set was compared with the user-generated data set, researchers discovered the values of the four reference points were incorrectly entered by the users. If it is desired to further reduce such errors, more than three data sets must be created with different users in order for the most appropriate results to be obtained, using statistics.
In this particular study, the verification data set was generated by an expert and controlled by another expert. The resulting data set generated by crowdsourcing was used for an accuracy assessment for the SVM classification product as shown in Table 2 below.
As shown in Table 2, the effect of four points misinterpreted by the crowdsourcing on the classification accuracy is seen to be less than 1% compared to the higher percentage of errors revealed in the accuracy assessment created by the expert.
Additionally, double and triple combinations of the class values were examined in order to see which reference points users could not distinguish from the reference points seen in the WV-2 image. As explained in the help document, class values are: forest = 1, hazelnut = 2, shadow = 3, soil = 4, tea = 5, building = 6 and road = 7. Thus, reference points whose class values are entered differently in all three data sets generated by user responses were more easily evaluated. Triple combinations of the class values are given in Figure 6.
In the Figure 6, the horizontal axis represents the three different class values given by three different users for each reference point. The vertical axis represents the total number of points. Here, 125 means that users entered 1 (forest), 2 (shadow) and 5 (tea) for the same reference point in 9 out of 1000 reference points. It's clear that users could not differentiate between forest, shadow and hazelnut classes, 19 times. This figure indicates that when determining the class values of reference points, users had the most difficulty with forest-hazel-shadow, which is not surprising since hazelnut and other trees  have similar spectral characteristics. Moreover, it's even more difficult to distinguish dark green or blackish forest areas when they're close to black or shaded areas. The double combinations of such class values are also examined to determine which two classes are most difficult for users to distinguish (see Figure 6).
In fact, Figure 7 is derived from Figure 5. The value 46 (4 (land) and 6 (building) classes) on x-axis means 4 and 6 together appear once in 146, 346, 456 and 467 in Figure 5, which makes this double combination appear four times. As seen in Figure 6 that users mostly have difficulty (34 times) distinguishing between classes 1 (forest) and 2 (hazelnut). The second biggest problem is seen between Class 2 (hazelnut) and Class 3 (shadow) where users had trouble 26 times in making consistent decisions between them. The third biggest confusion lies between Class 1 (hazelnut) and Class 3 (shadow) where it appeared 24 times. The results show that the least problem occurs between class 4 and class 7 (only 2 times).

Conclusion
Fifty non-trained people or users participated in this study. It concluded that non-professionals can assign class values to reference points on high-resolution satellite images with the use of a Help document. This process had to be repeated three times due to recurring issues with incorrect data entry. The resulting data set was created by taking the majority of the users' votes. Each user spent about 10 min during this process, taking a total of 9 h. When post-classification accuracy analysis is performed by a single person, the whole process becomes prolonged and ripe with errors due to human factors, such as fatigue and distraction. Nevertheless, this study indicates that the post-classification accuracy analysis can be successfully implemented utilizing a crowdsourcing approach. In this application, the class value of each reference point was assigned by three different users. It should be noted that by increasing the number of users   not only increases accuracy but makes the process easier for anyone, including non-professionals, to use statistics to determine the most fundamentally, correct choices.

Disclosure statement
No potential conflict of interest was reported by the authors.