Understanding satellite images: a data mining module for Sentinel images

The increased number of free and open Sentinel satellite images has led to new applications of these data. Among them is the systematic classification of land cover/use types based on patterns of settlements or agriculture recorded by these images, in particular, the identification and quantification of their temporal changes. In this paper, we will present guidelines and practical examples of how to obtain rapid and reliable image patch labelling results and their validation based on data mining techniques for detecting these temporal changes, and presenting these as classification maps and/or statistical analytics. This represents a new systematic validation approach for semantic image content verification. We will focus on a number of different scenarios proposed by the user community using Sentinel data. From a large number of potential use cases, we selected three main cases, namely forest monitoring, flood monitoring, and macro-economics/urban monitoring.


Introduction
The Copernicus Access Platform Intermediate Layers Small Scale Demonstrator (CANDELA) project is a European Horizon 2020 research and innovation project for easy interactive analysis of satellite images on a web platform.Among its objectives are the development of efficient data retrieval and image mining methods augmented with machine learning techniques as well as interoperability capabilities in order to fully benefit from the available assets, the creation of additional value, and subsequently economic growth and development in the European member states (Candela, 2019).
The potential target groups of users of the CANDELA platform are: space industries and data professionals, data scientists, end users (e.g., governmental and local authorities), and researchers in the areas covered by the project use cases (e.g., urban expansion and agriculture, forest and vineyard monitoring, and assessment of natural disasters) (Candela, 2019).
When it comes to image analysis and interpretation, the main objectives of our application-oriented data mining CANDELA platform can be grouped into five activities (Candela, 2019): • Activity 1: A Big Data analytics building block allowing the analysis of large volumes of Earth observation (EO) data.
In our case, this activity will generate a large geographical and temporal volume of EO data to be ingested into the data analytics building blocks.• Activity 2: Tools for the fusion of various multi-sensor Earth observation satellite data (comprising, besides Sentinel, also several other contributing missions) with insitu data and additional information from the web such as social networks or Open Data, in order to pave the way for new applications and services.
Our achievements will be measured by the capability to ingest data from various and heterogeneous sources (EO data and non-EO data).
• Activity 3: Compatibility of the analytics building blocks with any cloud computing back-office layers in order to run our applications on a distributed architecture with complete scalability and elasticity, and eventually to be deployed on top of our Sentinel data interface (DIAS) (2020).
Our goal is the compatibility between the CANDELA platform, other existing European assets, and future DIAS developments.
• Activity 4: Analytics tools developed for the platform that will have state-of-the-art performance, and allow us to obtain optimal veracity.
This can be verified by checking the attainable accuracy of our results.• Activity 5: Development of realistic reference scenarios that demonstrate the platform capabilities and use cases, and their functionality to new external users.
This can be checked by validation of the given use cases.
The focus of this paper is the definition of real scenarios/use cases (cf.Activity 5) using as much Earth observation data as possible being available for each use case (cf.Activity 1).Our project uses the data provided by Sentinel-1 and Sentinel-2.
Until the date of the submission of this paper [June 2020], the Copernicus Sentinels generated more than 27 million Earth observation (EO) products.More than 300,000 users have downloaded this Big EO data.Due to their high spatial resolution, Sentinel-1 and Sentinel-2 data represent ca.90% of the total Copernicus EO data volume.The data are free and the access is open.Systems as the European Space Agency (ESA) Copernicus Open Access Hub (Candela, 2019;Sentinel-1, 2019), the Thematic Exploitation Platforms (TEPs) (Thematic Exploitation Platforms, 2020) or the Copernicus Data and Information Access Services (DIAS) (DIAS platform, 2020) provide access to the data.
The very recent Machine Learning advent as a general-purpose methodology is presently converting the entire landscape of technology in any field.In this context, we based the Data Mining component in CANDELA on Active Machine Learning for EO, in a hybrid paradigm with parameter estimation, information retrieval, and specific aspects of EO image semantics, including elements of ontology focused on Sentinel-1 and Sentinel-2 observations.The Data Mining is changing the "data access" into "information and knowledge" extraction.The fast and adaptive operation of the Data Mining component is one of the assets to increase the valorisation of the Sentinel-1 and Sentinel-2 data and broadening their application areas.
The paper is organized as follows: Section 2 presents the proposed use cases, and the targeted data together with their characteristics.Section 3 explains the scientific background of our approach, and the CANDELA platform.Typical validation results are presented in Section 4, followed by some conclusions, and future work in Section 5.

Presentation of the use cases
In this section, we start by explaining the characteristics of the data set, and by introducing the selected use cases.

Characteristics of sentinel data
The Sentinel-1 mission comprises a constellation of two satellites (called A and B), operating in C-band for synthetic aperture radar (SAR) imaging.Sentinel-1A has been launched on April 1st, 2014, while Sentinel-1B has been launched two years later on April 25th, 2016 (Sentinel-1, 2019).The repeat period of each Sentinel-1 satellite is 12 days, which means that every 6 days, there may be an image acquisition of the same site by one of the two satellites.As SAR has the advantage of operating at wavelengths not impeded by thin cloud cover or a lack of solar illumination, one can acquire data over large areas during day or night time with almost no restrictions due to weather conditions.
From the multitude of product options that exist, we selected Level-1 Ground Range Detected (GRD) products with high resolution (HR) taken routinely in Interferometric Wide swath (IW) mode (Sentinel-1, 2019).These data are produced (prior to geo-coding) with a pixel spacing of 10 × 10 m and correspond to about 5 looks and a resolution (range × azimuth) of 20 × 22 m.For these products, the data are provided in dual polarization, namely VV and VH in WGS 84 geometry.For rapid and efficient high-resolution feature extraction with a good signal-to-noise ratio, we simply used the VV polarization data.
In contrast, the Sentinel-2 mission also comprises a constellation of two satellites (also called A and B), but collects multispectral (optical) data being affected by the actual weather conditions (e.g., cloud cover).The Sentinel-2A satellite has been launched on June 23rd, 2015, while Sentinel-2B has been launched on March 7th, 2017(Sentinel-2, 2019).Both Sentinel-2 instruments have 13 spectral channels (in the visible/near infrared, and in the short wave infrared spectral range).The repeat period of each Sentinel-2 satellite is 10 days.That means every 5 days there can be an image acquisition of the same site by one of the two satellites.
In this case, we selected Level-1 C products which were radiometrically and geometrically corrected WGS 84 images with ortho-rectification and spatial registration on a global reference system with sub-pixel accuracy (Sentinel-2, 2019).For visualization, the RGB bands of Sentinel-2 (B04, B03, and B02) were used to generate quick-look quadrant images.For feature extraction, the user can choose different band combinations.In this paper, we selected the four high-resolution 10 m bands of Sentinel-2 mages with manmade infrastructures content and all 13 bands (at 10 m, 20 m and 60 m) for Sentinel-2 images with natural vegetation content.This selection was made based on our experience/validation and the observations seen during the analysis of Sentinel-2 data.
This selection of Sentinel products mostly will result in a larger number of semantic labels for Sentinel-2 data, in contrast with Sentinel-1 data.This is due to the higher resolution of the Sentinel-2 and the capability to identify more classes from the image content.

Selected use cases
Based on different European user workshops, we selected half a dozen use cases for Sentinel images.The presented use cases in this paper are grouped in three main categories and are linked to Activity 5 of the project.Each category is divided into subcategories in order to demonstrate the complexity of the problem and the diversity of cases that we may encounter.These use cases are: monitoring of forests (fires, windstorms and deforestation), monitoring of floods (river, sea, and ocean), and monitoring of urban areas.
For each use case (see Table 1), the user communities provided us with objectives and their requirements, consolidated solution approaches, and typical image examples.

Forest monitoring
The objective of the forest use case aims to present how Earth observation satellite data collection can be used for the monitoring of the forests in different conditions such as fires, windstorms, deforestations, etc.

Fires in the Amazon rainforest area.
In August 2019, many fires affected the Amazon rainforest.Based on a report by the European Space Agency (ESA), the numbers of fires were four times higher than in the year 2018.The fires that occurred together with some (legal or illegal) deforestation left the land for future agricultural use but may result in rising global temperatures (ESA: Fires ravage the Amazon, 2019).
In this use case, we focused on the area between Brazil, Bolivia, and Paraguay.For the fire period in August 2019, we were able to acquire both Sentinel-1 and Sentinel-2 satellite images.As for Sentinel-1, the selected images were acquired on August 2nd, August 26th, and September 7th, 2019, while for Sentinel-2, the selected images were acquired on August 5th, August 20th, August 25th, and September 9th, 2019.
The location of the affected area is shown in Figure 1.

Windstorms in Poland.
In mid-August 2017, a large area of forest near the Bory Tucholskie National Park in Poland was affected by windthrow.The location of the affected area is outlined in Figure 2. Initially, the park was created in July of 1996 and now covers an area of 46 km 2 of forests, lakes, meadows, and peatlands.The park is located in the northern part of Poland in the heart of the Tuchola Forest, the largest woodland in Poland.In 2010, this park was included in the UNESCO Tuchola Forest Biosphere Reserve.
Also for this area, both Sentinel-1 and Sentinel-2 images were available during several intervals that could be used for investigation.Based on the available data, we selected Sentinel-1 images that were acquired (prior to the windstorm) on July 30th, 2017, and (after the windstorm) on August 29th, 2017, while the Sentinel-2 images were acquired (prior to the windstorm) on July 30th, 2017, and (after the windstorm) on September 28th, 2017.

Deforestation in Romania.
From 2005 to 2009, more than 1,000 hectares of forest around Tarnita Balasanii were illegally decimated, although the whole area is a fully protected area of the Maramures Natural Park Mountains.In 2016, 220 hectares of forest were cut in this area.As a consequence, a number of natural disasters took place (e.g., landslides and floods), and many houses and agricultural crops were destroyed.
For this area, only Sentinel-1 images were available.Thus, we chose an image taken on June 27th, 2015 before the deforestation, and an image acquired on September 1st, 2016 after the deforestation was discovered.
The location of the deforested area is shown in Figure 3.

Flood monitoring
The objective of the second use case aims to show the use of the Earth observation satellite data for monitoring the area affected by floods and how this is evolving over the time.

Floods in Omaha, Nebraska.
In March 2019, floods occurred in and around the city of Omaha, Nebraska (United States of America) near the Missouri river.A large area of the city and of its surroundings were affected by the floods.
From the available Sentinel images, we only selected some multispectral Sentinel-2 images as they provide the only complete coverage of the affected area.These products  were acquired as a pre-disaster image, an image recorded during the floods, and a postdisaster image.
Due to the required cloud-free imaging, the selected pre-disaster image had already been acquired on March 1st, 2018.The image during the floods was acquired on March 21st, 2019.Due to the spring season, the image has a different appearance than the postdisaster image taken some weeks later.This image was acquired on June 24th, 2019, and represents a summer image.
The location of the affected area is depicted in Figure 4.The selection of Sentinel data was more difficult in this case.In the case of Sentinel-1, we were only able to find a single image acquired on March 19th, 2019 that covers the entire area of the flooding.As for Sentinel-2, the only available image was acquired on March 22nd, 2019; however, the image is covered by clouds and could not be used for land cover classification.

Floods in
The extent of the flooded area is shown in Figure 5.

Macro-economics
The objective of the "macro-economics" use case is to show how remote sensing capacities to extract adequate information from urban images which will be used to feed economical models.

Monitoring of urban areas over the world.
We selected a number of cities and their surrounding areas from different countries, with different architectures, and recorded by Sentinel-1 and/or Sentinel-2.This use case demonstrates the impact of the definition and selection of semantic categories for different geographical locations and architectures of the cities combined with the influence of the type of instrument being used for the image acquisition.The cities were grouped per continent and by imaging technique.The list of the selected cities is further detailed together with their locations marked in Figure 6.

Description of the CANDELA Platform
CANDELA's main objective is the creation of additional value from Sentinel images through the provisioning of modelling and analytics tools assuming that the tasks of data collection, processing, storage, and access will be carried out by the Copernicus Data and Information Access Service (DIAS) (DIAS platform, 2020).After the integration of all components, CANDELA will be deployed on top of CreoDIAS (CreoDIAS, 2019).CreoDIAS is an environment that brings the algorithms to the EO data.This platform contains online almost all the Sentinel satellite data (Sentinel-1, Sentinel-2, Sentinel-3, and Sentinel-5P), and other EO data (e.g., Landsat-5, Landsat-7, Landsat-8, and Envisat).
The CANDELA platform allows the prototyping of EO applications by applying efficient data retrieval and data mining tools augmented with machine learning techniques as well as the interoperability among Sentinel-1 and Sentinel-2 in order to fully benefit from their potential content-related data, and thus, to add more value to the satellite data.It also helps us to interactively detect many objects or structures, and to classify land cover categories (Candela, 2019).
The design, implementation or operation of high-complexity systems require an analysis from different perspectives.For CANDELA, we proposed View Model, which is a standardized engineering system (e.g., IEEE Standard 1471-2000).This model has three perspectives (see Figure 7): • Information Processing: dealing with the basic information content transformation by algorithms, and their use and interoperation; • Software Architecture: performing the computational and functional decomposition of the system architecture; • Operations to be Performed: monitoring the sequences of operations and running the use cases.

Information processing
In CANDELA, the EO data are analyzed by two processing chains (see the data flow in Figure 8): by "Data Mining" together with "Data Fusion", and by "Change Detection".The Data Mining and Data Fusion module extracts textual content descriptors (i.e., "semantic land cover labels" (Dumitru, Schwarz, & Datcu, 2016)) in the actual EO product, whereas Change Detection extracts information for "change indicators" to be provided to the users (see "Thematic Applications" in Figure 8).Some non-EO data (e.g., cadastral maps, weather parameters) can be objects of search and semantic indexing that, if necessary, can be combined with EO data semantics.The resulting data taxonomy is transferred to use case-dependent Thematic Applications (e.g., outlines of affected areas).
In this paper, we focus only on the EO Data Mining assets, and how to add more value to the satellite data.It also helps interactively detect objects or structures, and to classify land cover categories, while the design, implementation or operation of high-complexity systems require an analysis from different perspectives.
In the Data Model Generation-Data Mining (DMG-DM) container, the data model generation processes for data mining are run for each selected product.After the completion of DMG-DM, the extracted metadata and features are ingested into the Database Management System (DBMS) database on the platform that can be used for querying product metadata, features, and semantic labels (Datcu et al., 2019a).The information is available on the platform and can be downloaded by local users via a Representational State Transfer (RESTful) service (REST API, 2019).
The users, accessing the GUI interface, have to perform numerical classifications (i.e., feature grouping) via active learning methods.Later, during an annotation step, these classification results can be converted into semantic labels (sometimes called categories).
The annotated semantics (Dumitru, Schwarz, & Datcu, 2018;Dumitru et al., 2016) are ingested via Internet into the remote database on the platform.

Operations to be performed
The EO standard (i.e., pre-processed) products are decomposed into features and metadata within the Data Model Generation (DMG) module.Then, the extracted actionable information and metadata are ingested into the database management system (via a MonetDB database (MonetDB, 2019)).
The DMG module transforms the original format of the original EO products into smaller and more compact product representations that include features, metadata, image patches, etc.The database management system module is used for storing all the generated information, and allows for querying and retrieval within the available feature and metadata space.In contrast, the Data Mining module is in charge of finding user-defined patterns of interest via machine learning algorithms within the processed data and presenting the results to the users for final semantic annotation.The proper selection of the appropriate semantic annotation (label/category) for a patch is based on the majority of the content of the selected patch (burnt forest areas, flooded areas, etc.).
Data Mining (see the general overview in Figure 10) is operated in two modes: EO Image Mining and EO Data Mining.The outputs of Data Mining are common semantic maps.
• EO Image Mining: Here, the users run a machine learning tool/component via its interactive GUI (in Figure 11) based on an Active Learning module (Blanchart, Ferecatu, Cui, & Datcu, 2014) (in a form of supervised machine learning) which is using all actionable information.The learning algorithm is able to interactively interrogate a user (information source) to label new data points with the desired outputs.
The  During the Active Learning two goals are achieved: 1) learn the targeted image category as accurately and as exhaustively as possible and 2) minimize the number of iterations in the relevance feedback loop.
Active Learning has important advantages when compared with Shallow Machine Learning or Deep Learning methods, as presented in Table 2.
Particularly for the EO image application Active Learning with very small training samples makes possible their detailed verification; thus, the results are trustable, avoiding the plague of training database biases.Another important asset is its adaptability to the user conjecture.The EO image semantics is very different from other definitions in geoscience, as cartography for example.The EO image is capturing the actual reality on ground, and the user can discover and understand it freely, extracting the best meaning, thus enriching the EO semantic catalogue.The same interactive interface, but in this case, the users can verify the selected training samples by checking their surroundings as there is a link between the patches in the upper left half and the right half.Here, the magenta color on the big quick-look panel shows the retrieved patches being similar to the ones provided by the user.The user can also see the selected patches selected by him/her as relevant and irrelevant patches (bottom part of the GUI).
• EO Data Mining: This is performed via SQL searches (see Figure 12(a-c)), queries, and browsing extracting the data analytics information.Data Mining uses image features, image semantics, and selected EO product metadata.The state diagram of the user operations is depicted in Figure 13 and comprises the following sequence of steps: (1) Identify the imaging instrument: Here, the user decides which Sentinel products shall be selected.
(2) Identify the Sentinel products and transfer them to Data Mining: Choose the area to be processed via the CreoDIAS platform.
(3) Process the Sentinel products in Data Model Generation: Use the DMG module in order to extract the metadata and select the algorithm appropriate to the Sentinel data for feature and descriptor extraction.Select also the number of grids/levels.
(4) Extract the descriptors and transfer them to Data Mining: Compute the features and ingest the results into the database for further use.
(5) Run the Image Mining function: Users can search and mine for selected content based on their requirements.
(6) Extract Sentinel semantics: Ingest the semantically annotated content (i.e., the labelled patches) into the database.The used taxonomy for annotation is like a list of labels from which the user can choose or define some.
(7) Query and combine the Sentinel semantics with the metadata: The user can now run queries based on metadata of the Sentinel products, based on the semantics (annotated by the user or available via the database) or by combing both query types.
(8) Generate analytics results: The output of the results will be in the form of statistical results, semantic classification maps, etc.

Testbed approach
We assume that we can rely on high-quality validation data.Therefore, when we use the CANDELA web platform, after the selection of the data for each use case, the images are processed directly by Data Model Generation and are ready for the Data Mining module.Each Sentinel image is cut into patches with a pre-selected size depending on the actual image ground sampling distance in order to cover an area of about 200 × 200 m on the ground.Based on the characteristics of the data, we selected for Sentinel-1 a patch size of 128 × 128 pixels, while for Sentinel-2 the patch size was 120 × 120 pixels (EOLib project, 2019).
The Active Learning in the Data Mining module is also powered with our spatial multigrid strategy.The EO image patches are partitioned in a pyramid, e.g., at a first scale in a 120 × 120 pixel grid, at a second finer scale in 60 × 60 pixel grid and in third even finer grid, a 30 × 30 pixel grid (in the case of Sentinel-2).As for Sentinel-1, the grids are 128 × 128 pixels, 64 × 64 pixels, and 32 × 32 pixels.
The Active Learning has a mechanism to hierarchically make semantic annotations, from coarse to fine grids.The mechanism is supported by a statistical decision, which discards not relevant patches when going to a finer grid.This is a specific Big Data solution.It is possible to enlarge the labelled data by up to three orders of magnitude using a very small training data set, typical 10s of samples.
For example in (Datcu et al., 2020), for the "Water bodies" category, we are using about 12% from the entire amount of patches (at the first grid/level), while the rest of the patches are assigned to other categories and discarded from the classification.These patches are split again (in the second grid), classified, and the residues that do not belong to the desired category are removed (we keep 65% of all patches).On the third grid/level, we repeated this procedure and we were finally annotating 94% of the patches with the category we are looking for.
The quick-look views of the patches are stored in a database for further use via the GUI of the Data Mining module (Dumitru et al., 2016).
The extracted features describing each original patch can then be extracted.The available libraries of algorithms implemented in the platform are Gabor filters with linear moments or logarithmic cumulants (MPEG7, 2019), Weber local descriptors (Chen et al., 2010), and multispectral histograms (Georgescu, Vaduva, Raducanu, & Datcu, 2016).The experiments show that for SAR images the best feature extraction method is Gabor Linear Moments (e.g., with five scales and six orientations) (MPEG7, 2019) for man-made infrastructure categories (e.g., Urban and Industrial areas, Transportation), while for natural categories (e.g., Agriculture, Forest, Natural vegetation) the Adaptive Weber Local Descriptor is the best performer (e.g., with 8 orientations and 18 excitation levels) (Chen et al., 2010).A comparison between different feature extraction methods is already described in (Dumitru & Datcu, 2013) for high-resolution TerraSAR-X images.For multispectral images, the best feature extraction method is the Weber Local Descriptor (e.g., with 8 orientations and 18 excitation levels) (Georgescu et al., 2016).The extracted features of each patch are then stored in our database.
These features are then routinely classified (i.e., in an unsupervised approach) using Data Mining and grouped into clusters using machine learning based on Cascaded Active Learning (Blanchart et al., 2014).In our case, we used a Support Vector Machine (SVM) classifier with a χ2 kernel and a one-against-all approach.
This proposed approach is implementing also a second important function, the hierarchic labelling of the EO images (Dumitru et al., 2016).Firstly, the multi-grids generate a finer localization of the semantic class, this is a quad-tree like spatial-multiscale structure.Secondly, since semantic is changing with scale of the image patches, a semantic tree is generated.This is an explainable method, i.e., an image patch at the coarsest scale is indexed with more detailed meaning at finer scales.
The entire information is stored into the database and can be further queried or can be used to generate additional analytics (e.g., semantic classification maps, statistical analytics, etc.).Some statistics of the volume of data analysed using the Data Mining module and their diversity of locations is presented in Table 3.This table is showing the volume of the data analysed using the CANDELA platform (more precisely, the Data Mining module).From the available data, we selected the appropriate one for our use cases.The semantic labels were selected from (Dumitru et al., 2016) and represent individual labels (if the same label appears several times, in will be marked only once).

Experimental results
In this paper, we illustrate the usefulness of the platform by six examples, outline the classification results, and demonstrate different statistics obtained from the data.For all use cases, we exploited appropriate image data.For ease of use, we kept the same color coding for each semantic category/label.Any remaining differences between the labelling results can be due to the actual resolution and the patch size of the data.

Experimental results for the fires in the Amazon rainforest use case.
For this use case, we selected a multi-sensor and multi-temporal data set, acquired by Sentinel-2 and Sentinel-1.Based on the availability of both instruments, we were able to select more than 10 images for each instrument for a period from beginning of August to the beginning of September 2019.As the highest intensity of the fires occurred around August 25th, 2019, we aimed at obtaining images acquired before, during, and after the fire.Using the Data Mining module, we were able to classify and to semantically annotate all selected images, and to extract several analytics from which we could then extract other statistics.The resulting quick-look views of the investigated areas together with their semantic classification maps are depicted in Figure 14 for Sentinel-1 data, and in Figure 15 for Sentinel-2 data.
From these two figures, we can see that the difference in resolution between the instruments also has an implication for the number of extracted categories.
Figure 16 shows the diversity of the discernible categories, and the changes between the three acquisition dates using Sentinel-1 data (left-hand side of the figure), and Sentinel-2 data (right-hand side of the figure).
In the case of Sentinel-2, by counting the number of patches semantically annotated as Burnt areas, we can easily compute the affected area.Knowing the resolution of 10 m (using the Sentinel-2 bands B2, B3, and B4 with a resolution of 10 m), and a patch size of 120 × 120 pixels, we obtain an area of 12 km 2 for the image acquired during the fire on August 25th, 2019, and an area of 23 km 2 for the image acquired after the fire expired on September 9th, 2019.
Similar results are obtained using the Sentinel-1 data.The largest area affected by fires is a Mixed forest area, and very small percentages are Agricultural areas.We can see that the Burnt areas double between August 2nd and August 25th, 2019.From the same figure, we can observe how some categories, that had a larger area before the event, are reduced or categories with smaller area are increased, and even new categories appeared (e.g., Burnt areas).

Experimental results for the windstorms in Poland use case.
For this use case, we selected a multi-sensor and multi-temporal data set, acquired by Sentinel-1 and Sentinel-2 (see Figures 17 and 18).This helps evaluate the area affected by the  windstorms.In both cases, the first image was acquired on July 30th, 2017, as the preevent image, while the second image is the post-event image.
Due to the cloud coverage compromising the Sentinel-2 images, it is difficult to evaluate the affected forest area, but by analyzing the Sentinel-1 data of the same area on the ground, we were able to compute the area in km 2 simply based on the percentage of the amount of Forest now appearing and annotated as Agriculture areas (we do not have a Wind-damage label category).
From Figure 19, we can extract the percentage of the affected area.Knowing the patch size of the Sentinel-1 data of 128 × 128 pixels together with the given pixel spacing and resolution, we could compute the affected forest area as 42 km 2 .

Experimental results for the deforestation in Romania use case.
For this use case, only very few Sentinel-1 images were available.From them, we chose an image recorded in 2015 as a pre-event image to be sure that no big deforestation had already taken place, and we selected another image after the deforestation had been discovered (as a post-event image).See the results shown in Figures 20 and 21.
After the classification, we were able to generate Figure 22, from which we can see that the percentage of deforestation amounts to 12%.Knowing the patch size of the Sentinel-1 data of 128 × 128 pixels together with their given pixel spacing and resolution, we computed the deforested area as comprising 46 km 2 .

Experimental results for the floods in the Omaha use case.
For this use case, three images were selected as a pre-event image, an image taken during the flooding, and a post-event image.Each image was processed using the platform tools, and each patch of the images was semantically annotated using the hierarchical annotation scheme described in (Dumitru et al., 2016).From the content of these images, we were able to extract five semantic categories, namely Agricultural land (which includes prairies and    grasslands), Mixed forest, Rivers, Mixed urban areas, and Flooded areas (a category that appears in the second and third images).
The results of the annotation are shown in Figure 23, where each image is shown as an RGB quick-look image (bands B4, B3, and B2 at 10 m resolution of Sentinel-2), alongside the classification map generated after the annotation.
By querying the database for each semantic category, we were able to generate some statistical analytics.An example is Figure 24, from which we can see the changes that appear among the three images.
In Figure 24, it can be seen that, after the floods, the category Mixed urban areas increased unnaturally much across the images.This occurred due to the visibility of buildings within the scenes when the annotation was made.One explanation can be that some buildings were not visible and were included in Agricultural land.Because, during the period when the first two images were taken, it was winter, and the area was covered by snow, compared to the last image that was taken in summer.
Another category for which we noticed changes is Rivers.This category appears only in the pre-event image, because then this category is merged with a new one, namely Flooded areas.This category is found during the event, and in the post-event image.
Extracting the percentage of the Flooded areas from Figure 24, and knowing the patch size for the classification of 120 × 120 pixels, and the resolution of 10 m of the Sentinel-2 image, we can compute the affected area in km 2 for the event image, and the post-event

Mixed urban areas Agricultural land
Flooded areas Mixed forest Rivers image.For the event image, the affected area covers about 1000 km 2 , which shrank, after three months, to 445 km 2 .

Experimental results for the floods in the Beira use case.
In this use case, we initially tried to retrieve images from Sentinel-1 (2019) and Sentinel-2 (2019) that cover the area of interest (like in use case 3), but these images were not available or were affected by clouds (see Figure 25 bottom-left).Finally, an image of Sentinel-1 was available after the We chose also one image of Sentinel-2 in order to demonstrate the influence of the clouds.
Both images were semantically annotated and we retrieved the following categories: Sea, Small vessels, Brush/Rangeland, Mixed urban areas, Mountains, Clouds, and Flooded areas.
In the case of Sentinel-1, which is not affected by clouds, we were able to classify the area affected by the floods.An evaluation of the annotated Sentinel-2 image brought us to the conclusion that only a small area of the flooded surface was visible through the clouds.The results of both classifications are presented in Figure 25.
The distribution of the retrieved and classified categories is illustrated in Figure 26 (only for the Sentinel-1 data).
When considering the percentage obtained after classification for the category of Flooded areas and knowing the patch size of the Sentinel-1 data (e.g., 128 × 128 pixels), their resolution of 20 m, and the pixel spacing of 10 m, we could compute the affected areas.In this case, the total affected area was 330 km 2 .

Experimental results for the monitoring of urban areas use case. The results
of the last use case were ordered alphabetically (first the continent, and after that the city).The full list of analyzed cities is shown in Section 2.2.6.Because of space limitations, we picked up from the full list four cities to show in this paper.classification results (see Figure 27), we noticed that for Sentinel-2 with 10 m resolution it was possible to retrieve more categories than for Sentinel-1 with 20 m resolution.The diversity of the retrieved categories and the percentage of each category are depicted in Figure 28.
For the category Mountains retrieved from the Sentinel-1 image, it was not possible to separate Mountains from Hills, but it was possible to separate the Volcano from the category Mountains.Also from the classification map, we can see that for Sentinel-1, it   was possible to extract the category Boats because of their higher reflectance when compared to Water bodies.

Amsterdam and surrounding areas
Similar to Tokyo, we selected one image from Sentinel-1 and one image from Sentinel-2.
Comparing the classification results from Figure 29, once more the number of retrieved categories for Sentinel-2 is higher than the discernible categories for Sentinel-1.Using Sentinel-2, it was possible to find separately Ijssel Lake, Marker Lake, and the categories Tidal flats/Deltas that for Sentinel-1 are classified as Sea.The category Agricultural land from Sentinel-1 was split into two categories of Sentinel-2 data.The diversity of each category is shown in Figure 30.

Saint Petersburg and surrounding areas
For this city, we selected a single Sentinel-2 image, as no Sentinel-1 data were available for this period.When analyzing this image, we identified an interesting category, namely Frozen water/ground (see Figures 31 and 32).When using the Sentinel-2 data, it was not possible to extract more individual categories or to split  this category into other categories.However, we expect using the Sentinel-1 data to be able to separate or to split the categories Ice or Frozen water (Dumitru, Andrei, Schwarz, & Datcu, 2019).

Cairo and surrounding areas
Also here, only Sentinel-2 data were available.During classification, we encountered a problem with the Desert category, which has a high reflectance, and in some areas, the image covering some other objects.The results of the classification and the diversity of the retrieved categories are demonstrated in Figures 33 and 34.

Observations about the use cases
For all the use case (but especially for the urban one), we observed that the number of semantic labels retrieved for Sentinel-2 is higher than the one obtained for Sentinel-1.For example, in Figure 27, the number of semantic labels retrieved for Sentinel-2 is 11 labels,    while for Sentinel-1 there are only 6 labels.This means that the sensor resolution influences the number of semantic labels that can be extracted and classified.This we observed much earlier in Dumitru et al. (2018), when we compared the high-resolution SAR images at 2.9 m resolution provided by TerraSAR-X with medium-resolution SAR images at 20 m resolution provided by Sentinel-1.
For the forest use case, if the area is not covered by clouds, we recommend to use Sentinel-2 images because they are higher resolution, can be extracted more details and can lead to a better separation between the categories (e.g., Smoke, Clouds) and sometimes these categories do not appear in Sentinel-1 (e.g., Smoke).For more details, see Figure 15 vs. Figure 14.
For the floods use case, both sensors can be used, but for a better accuracy Sentinel-1 is more appropriate.In the case of Omaha, because for that area there were no Sentinel-1 images, we used Sentinel-2 images (which were not covered by clouds) and the results are very satisfactory.In the case of Beira, the area was covered by clouds and the images from Sentinel-1 were very few.This is the reason why we could not make an assessment of the affected area without an image before or after the event.
For the urban use case, we noticed that definitions and the number of retrieved categories are influenced by the geographical location of the city and the architecture of the city (including the size of the city and the density of the city) (Dumitru, Cui, Schwarz, & Datcu, 2015).We did another study related to the simultaneous processing of several images using the Data Mining module, and we noticed that for a better grouping of categories it is necessary that the image comes from the same geographical location or has the same architecture.

Data mining validation
The validation of the Data Mining module prior to its integration with the CANDELA platform was made in (EOLib project, 2019), where, for the first time, a data set of multispectral images (e.g., WorldView) and a large data set of SAR images (e.g., TerraSAR-X) were classified and semantically annotated (Dumitru et al., 2018) with an accuracy of about 95%.
As part of Activity 5, during the EO Big Data Hackathon (Joint Hackathon 2019), we conducted a large-scale validation and testing of our Data Mining module.In two days, five European H2020 projects (including CANDELA) funded by the same EO-2-2017 EO Big Data Shift Call (CORDIS, 2019) were tested and evaluated by a large number of expert users in the field (including the reviewers of the European Commission (European Commission, 2019)) from the point of view of the maturity of the algorithms and the usability of the platforms.
A more detailed analysis of the Data Mining module was made within the project by two partners that had the role of users, namely SmallGIS and Terranis (see the deliverables in Candela (2019)).
Finally, a quantitative evaluation measure of the module was performed in order to access how good the retrieved results satisfied the user's query intent.The following metrics were used for evaluating the performance of the Data Mining module: Precision/ Recall, Accuracy, F-measure, Fall-Out, Specificity, and ROC-Curve.The definition of these metrics can be found in (Manning, Raghavan, & Schütze, 2008;Powers, 2011).From this list Accuracy and Fall-Out are selected for a number of categories.
• Input data: Both are processing Sentinel-2 data, but the Data Mining module is processing also Sentinel-1 and other satellite missions.Both are using 120 × 120 pixels for Sentinel-2.• Extracted features: The component of EOpen is using deep learning features, while the Data Mining module is using standard classification features.• Supervised/unsupervised learning: The component of EOpen is unsupervised, while the Data Mining module based on active learning is a supervised/semi-supervised learning tool.Charfuelan, Demir, & Markl, 2019)).The Data Mining module is using an open number up to 100 classes (depending on the sensor resolution).• However, our multispectral images are sometimes affected by clouds and often no reliable reference data are available.
As short-term future work, the fusion (Datcu et al., 2019b) of data coming from SAR (e.g., Sentinel-1) and multispectral (e.g., Sentinel-2) data is under validation with the CANDELA platform.The data fusion module is also based on the Data Mining module by adding a component for fusion of radar and multispectral data and features/descriptors with different patch sizes and for fusion of different semantic labels.This will help circumvent Sentinel-2 problems with cloud cover.As long-term future work, we plan to combine the two polarizations of the Sentinel-1 instrument and to analyze their influence on the number of additional categories that can be retrieved, and on the quality of these categories.As for Sentinel-2, we will continue to analyze the results already obtained, to see how the influence of different combination of bands that are available for this instrument and whether they can provide a better separation of the different categories (e.g., smoke from clouds), and possibly to increase the number of retrieved categories and their accuracy.As a first example, we show the impact of different band combinations of Sentinel-2 channels.Figure 44 illustrates the band-dependent appearance of Clouds, Smoke, and Fires in the Amazon rainforest (for more details, see also (ESA: Fires ravage the Amazon, 2019)).This study is currently under work and will be published in a future paper.

Figure 1 .
Figure 1.Location of our Amazonian target area marked on Google Maps (Google Maps, 2019).

Figure 2 .
Figure 2. Location of our target area in Poland marked on Google Maps (Google Maps, 2019).
Beira, Mozambique.In March 2019, another flooding took place in the same period with the one from Omaha, but this time in Beira, Mozambique caused by the Cyclone Idai.

Figure 3 .
Figure 3. Location of our target area in Romania marked on Google Maps (Google Maps, 2019).

Figure 4 .
Figure 4. Location of our target area in Nebraska marked on Google Maps (Google Maps, 2019).

Figure 6 .
Figure 6.Locations of the selected cities marked on Google Maps (Google Maps, 2019).

Figure 5 .
Figure 5.Our target area in Mozambique marked on Google Maps (Google Maps, 2019).

Figure 7 .
Figure 7.The View Model of CANDELA seen from three perspectives.

Figure 8 .
Figure 8. Block diagram of the CANDELA platform modules as information processing flow (Candela, 2019).

Figure 9 .
Figure 9. Architecture of the Data Mining module on the platform and front end (Datcu et al., 2019a).
key idea behind Active Learning is that a machine learning algorithm can achieve greater accuracy with fewer training examples if it is allowed to choose the data from which it learns.The input is the training data sets obtained interactively from the GUI.The training dataset refers to a list of images marked as positive or negative examples.The output is the verification of the Active Learning loop sent to the GUI and the semantic annotation written in the DBMS catalogue.In conclusion, the functions are search, browse, and query for image patches of interest to the user.The discovered relevant structures are semantically annotated and stored into the DBMS.The tool uses only image features.The results of the actual EO image semantics are learned and adapted to the user conjectures and applications.Active Learning methods include Relevance Feedback which supports users to search images of interest in a large repository.The GUI allows automatically ranking the suggested images, which are expected to be grouped in the class of relevance.Visually supported ranking allows enhancing the quality of search results by giving positive and negative examples.

Figure 11 .
Figure11.GUI interface of Image Mining (Top): Interactive interface to retrieve images belonging to the categories that exist in a collection (e.g., Smoke).The upper left half shows relevant retrieved patches, while the lower left half shows irrelevant retrieved patches.The large GUI panel on the right shows the image that is being worked on, and which can be zoomed.(Bottom): The same interactive interface, but in this case, the users can verify the selected training samples by checking their surroundings as there is a link between the patches in the upper left half and the right half.Here, the magenta color on the big quick-look panel shows the retrieved patches being similar to the ones provided by the user.The user can also see the selected patches selected by him/her as relevant and irrelevant patches (bottom part of the GUI).

Figure 12 .
Figure 12.(a) Query interface of Data Mining.(Left:) Select the query parameters, e.g., "metadata_id", "mission", and "sensor" and enter the desired query values in the query-expression table below.(Right:) Display the products that match with the query criteria ("mission" = S2) including their metadata and image files.(b) Query interface of Data Mining.(Left:) Select the metadata parameters (e.g., "mission = S2A" and "metadata_id = 39") and (Right) semantics parameters (e.g., "name = Smoke").Depending on the EO user needs, it is possible to use only one or to combine these two queries.The combined query is returning the number of results from the database (in this case 543).(c) Query interface of Data Mining.This figure presents a list of images matching the query criteria (from Figure12b).The upper part of the figure shows a table composed of several metadata columns corresponding to the patch information while the lower part of the figure presents a list of the quicklooks of patches that corresponds to the query.

Figure 13 .
Figure 13.Operation state diagram of Data Mining.

Figure 14 .Figure 15 .
Figure 14.A multi-temporal data set for the first use case.(From left to right and from top to bottom): Quick-look views of the first Sentinel-1 image from August 2nd, 2019, of the second image from August 26th, 2019, and of the last image from September 7th, 2019, followed by the classification map of each of the three images.

Figure 16 .
Figure 16.Diversity of categories, and the change of categories identified from three Sentinel-1 images (top) and three Sentinel-2 images (bottom) that cover the area of interest of the first use case.The Sentinel-1 images were acquired on August 2nd, 2019, August 26th, 2019, and on September 7th, 2019, while the Sentinel-2 images were acquired on August 5th, 2019, August 25th, 2019, and on September 9th, 2019.

Figure 17 .
Figure 17.A multi-sensor and a multi-temporal data set for the second use case.(Top -from left to right): A quick-look view of a first Sentinel-2 image from July 30th, 2017, and its classification map, and a quick-look view of a second Sentinel-2 image from September 28th, 2017, and its classification map.(Bottom -from left to right): A quick-look view of a first Sentinel-1 image from July 30th, 2017, and its classification map, and a quick-look view of a second Sentinel-1 image from August 29th, 2017, and its classification map.

Figure 18 .
Figure 18.Diversity of categories identified from two Sentinel-2 and two Sentinel-1 images that cover the area of interest of the second use case.(From left to right and from top to bottom): The distribution of the retrieved and annotated categories of the Sentinel-2 images acquired on July 30th, 2017 and on September 28th, 2017, and the categories of the Sentinel-1 images acquired on July 30th, 2017 and on August 29th, 2017.The differences between the Sentinel-2 and Sentinel-1 results can be explained by clouds being only visible in Sentinel-2 images.For the different labels, see Section 4.2.

Figure 19 .
Figure 19.Semantic label changes between two Sentinel-1 images (right) and two Sentinel-2 images (left) acquired for the second use case.The value of the changes should be multiplied by 100 in order to obtain the percentage of the change.The results are given for the windstorms in Poland.

Figure 21 .
Figure 21.Diversity of categories identified from two Sentinel-1 images that cover the area of interest of the fifth use case.(From left to right): Distribution of the retrieved and annotated categories of the images acquired on June 27th, 2015, and on September 1st, 2016.

Figure 22 .
Figure 22.Semantic label changes between two Sentinel-1 images acquired for the fifth use case.

Figure 23 .
Figure 23.A multi-temporal data set for the third use case.(From left to right and from top to bottom): An RGB quick-look view of a first Sentinel-2 image from March 1st, 2018, of the second image from March 21st, 2018; and the last image from June 24th, 2018, followed by the classification maps of each of the three images.

Figure 24 .Figure 25 .
Figure 24.Distribution of retrieved semantic categories for the three images of the third use case, the floods in Omaha, Nebraska, USA.

Figure 26 .
Figure 26.Diversity of categories identified from a Sentinel-1 image that covers the area of interest of the fourth use case.

Figure 27 .
Figure 27.A multi-temporal data set for Tokyo and its surrounding areas.(Top -from left to right): A quick-look view of a first Sentinel-1 image from July 26th, 2019, and its classification map.(Bottom -from left to right): A quick-look view of a second Sentinel-2 image from May 8th, 2019, and its classification map.

Figure 28 .
Figure 28.Diversity of categories extracted from a Sentinel-1 image and from a Sentinel-2 image that are covering the area of interest of Tokyo and its surrounding areas.(From left to right): The distribution of the retrieved and semantically annotated categories of the images acquired on July 26th, 2019, and on May 8th, 2019.The differences between Sentinel-1 and Sentinel-2 results are mainly due to the higher resolution of the Sentinel-2 data.

Figure 29 .
Figure 29.A multi-temporal data set for Amsterdam and its surrounding areas.(Top -from left to right): A quick-look view of a first Sentinel-1 image from March 22nd, 2016, and its classification map.(Bottom -from left to right): An RGB quick-look view of a second Sentinel-2 image from April 21st, 2016, and its classification map.

Figure 30 .
Figure 30.Diversity of categories identified from a Sentinel-1 image and from a Sentinel-2 image that are covering the area of interest of Amsterdam and its surrounding areas.(From left to right): The distribution of the retrieved and semantically annotated categories of the images acquired on March 22nd, 2016, and on April 21st, 2016.The differences between Sentinel-1 and Sentinel-2 results are mainly due to the higher resolution of the Sentinel-2 data.

FrozenFigure 31 .
Figure 31.A data set for Saint Petersburg and its surrounding areas.(From left to right): An RGB quicklook view of a Sentinel-2 from April 4th, 2019, and its classification map.

Figure 32 .Figure 33 .
Figure 32.Diversity of categories identified from a Sentinel-2 image that is covering Saint Petersburg, Russia.This image was acquired on April 4th, 2019.

Figure 34 .
Figure 34.Diversity of categories identified from a Sentinel-2 image that is covering Cairo, Egypt.This image was acquired on July 8th, 2019.

Figure 36 .
Figure 36.A data set of Aquila, Italy (acquired by the QuickBird sensor during an earthquake) and its surrounding areas.(From left to right): An RGB quick-look view of a QuickBird image from April 6th, 2009, and its classification map.The sensor parameters are described in QuickBird sensor parameter description and data access (QuickBird, 2020).

Figure 37 .
Figure 37.A data set of the French Riviera and its surrounding areas.(From left to right): An RGB quick-look view of a Spot-5 image from April 23rd, 2001, and its classification map.For classification three bands (band 1, 2, and 3) were selected.The sensor parameters are described in Spot sensor parameter description (Spot, 2020).

Figure 38 .
Figure 38.A data set of Venice, Italy and its surrounding areas.(From left to right): An RGB quick-look view of a WorldView-2 image from September 9th, 2012, and its classification map.From the available eight bands of the sensor we used for classification three bands (band 1, 2, and 3).The sensor parameters are described in WorldView sensor parameter description (WorldView, 2020).

Figure 39 .Figure 40 .Figure 41 .
Figure 39.A data set of Calcutta, India (acquired by the Sentinel-3 sensor) and its surrounding areas.(From left to right): An RGB quick-look view of a Sentinel-3 image from January 8th, 2017, and its classification map.From the available 21 bands of the sensor we used for classification eight bands (band 1,2, 3, 6, 12, 16, 19, and 21).The sensor parameters are described in Sentinel-3 sensor parameter description and data access (Sentinel-3, 2020).

Figure 42 .Figure 43 .
Figure 42.A data set of Vancouver, Canada and its surrounding areas.(From left to right): A quick-look view of a RADARSAT-2 image from April 16th, 2008, and its classification map.The sensor parameters are described in RADARSAT sensor parameter description (RADARSAT, 2020).

Table 1 .
Selected use cases and their parameters.

Table 2 .
Comparison between Shallow machine learning (ML), Deep Learning, and Active Learning.

Table 3 .
The amount of data processed by the Data Mining module.

•
Evaluation metrics: The component of EOpen is using Mean Average Precision (mAP), while in the Data Mining module six metrics are implemented (see Section 4.3.2).The component of EOpen is used as a web application, while the Data Mining module is a GUI interface linked to the platform.

Table 4 .
Demonstration of the Big data achievements with the Data Mining module.Sentinel-2 and 10 other multispectral and SAR EO image types Velocity Fast operation (minutes) Veracity Training data selected in an active learning loop, very small volume, thus verifiable Value Extraction of users/applications adapted to EO image semantics

Table 5 .
The achievements of the Data Mining module for each activity.Demonstration of the Data Mining on 10 multispectral and SAR images from contributing missions.Activity 3 Two Data Mining sub-modules have been encapsulated into Dockers and deployed on the CANDELA cloud platform.Users use Jupyter notebooks to launch these processes.Activity 4 Data Mining archives state-of-the-art (SoA) accuracy, with few training samples (beyond SoA) and very fast (beyond SoA).Activity 5 Data Mining is processing data, operating with Sentinel-1 and Sentinel-2 products, which was demonstrated in the urban expansion and agriculture use case, forest monitoring use case, and a Big data demonstration.