Mawrth Vallis, Mars, classified using the NOAH-H deep-learning terrain classification system

ABSTRACT A deep learning (DL) terrain classification system, the Novelty and Anomaly Hunter – HiRISE (NOAH-H) was used to produce a terrain map of Mawrth Vallis, Mars. With it, we digitised the extent and distribution of transverse aeolian ridges (TARs), a common type of martian aeolian bedform. We present maps of the site, classifying terrain into descriptive classes and interpretive groups. TAR density maps are calculated, and the network output is compared to a manually produced map of TAR density, highlighting the differences in approach and results between these methods. Even when mapping on a small scale, humans must divide the terrain into coherent patches in order to map a large area in a reasonable time frame. Conversely, the speed of DL systems enables mapping on the pixel scale, producing a more detailed product, but one which is also “noisier”, and less immediately informative. There are pros and cons to both approaches. Highlights A morphological map of Marth Vallis, Mars, has been created, classifying variations in surface texture into 14 descriptive classes. A deep learning (DL) convolutional neural network was trained to predict these classes in further HiRISE images, which had not been used for training. The resulting classified rasters were orthorectified and mosaicked using ArcGIS. Appropriate classes from the resulting map were compared with manual digitisation of the spatial densities of Transverse Aeolian Ridges (TARs). This comparison highlights the different scales at which human and DL mapping takes place, and that the two datasets have different strengths and weaknesses. The speed at which the network can complete its task allows it to attempt a higher level of fidelity than is possible for a human. Derived maps of the density of boulders and TARs were also produced using both the DL and manual datasets.


Introduction
We present a terrain map of Mawrth Vallis, Mars, showing the distribution of aeolian bedforms and classifying surface texture type (Main Map sheets 1 and 2).Our classification ontology was designed with rover traversability in mind, as part of the landing site selection process for the European Space Agency (ESA) Rosalind Franklin Rover mission (ExoMars, Loizeau et al., 2019;Vago et al., 2017).This is the third in a series of maps presenting the results of the Novelty and Anomaly Hunter -HiRISE (NOAH-H) study (Barrett et al., 2022b;Wright et al., 2022).NOAH-H is a deep learning (DL) convolutional neural network designed to classify the surface of Mars into 14 textural classes (Barrett et al., 2022a;Woods et al., 2020).The network classifies satellite images from the High-Resolution Imaging Science Experiment (HiRISE) instrument.We used red-band images with 25 cm/pixel resolution (McEwen et al., 2007).The NOAH-H network was trained on data from Oxia Planum (e.g.Fawdon et al., 2021;Quantin-Nataf et al., 2021) and Mawrth Vallis (e.g.Loizeau et al., 2019;Poulet et al., 2020), the final two candidate ExoMars landing sites under consideration at the time the DL system was devised (Vago et al., 2017).
In Barrett et al. (2022b), we published our map of the selected landing site in Oxia Planum, Mars.For Oxia, no large-scale human-made map is yet available for comparison.However, the DL classified masks were a good qualitative match for landscape-level trends visible in HiRISE images.The network demonstrated a high mean Intersection over Union (IoU) when comparing the portion of the training data reserved for validation (Barrett et al., 2022a) giving us confidence in the results of the classification.
In Wright et al. (2022), we presented results for Jezero Crater, Mars, the landing site of NASA's Perseverance Rover (Farley et al., 2020).Here we compared our results to manual mapping of the site conducted by the Perseverance team (Stack et al., 2020) as well as to in situ observations from both the rover and the Ingenuity helicopter (Balaram et al., 2021).While the mapping of Stack et al. (2020) was conducted by a different team, using very different categorisations, our comparison at Jezero Crater was broadly favourable.We were able to match up their work with ours in broad terms, although we were hampered by the difference in mapping scale.This suggests that our model is fit for purpose when predicting broad, landscape-level trends.Since Jezero Crater was not a source of training data, this also demonstrates transferability of the model to similar regions of Mars.
Here we present our work characterising Mawrth Vallis.This area was studied intensively as part of the ExoMars landing site selection process.Part of that analysis included identifying and mapping transverse aeolian ridges (TARs); TARs represent a substantial rover traversability hazard (Balme et al., 2018a), and their mapping constituted an important consideration for ensuring the success of the rover mission.These TARs were mapped over a large area at a map scale of between 1:5000 and 1:10,000 i.e. the HiRISE image was zoomed out by a factor of 3 relative to native resolution, and the mapping aimed for fidelity on a scale of ∼10 m.We compared this manual mapping to the DL classification, over an entire HiRISE image (ESP_046459_2025), to see what differences and similarities there are between the two products.
The comparison is still not entirely like-for-like, since the manually mapped dataset was completed prior to the NOAH-H training and did not specifically use our classification scheme.However, it did break down the bedform patterns into continuous and discontinuous areas.This makes it more comparable to the DL-produced product than other available datasets and yields valuable insights into the pros and cons of the two mapping approaches.

Study area
Mawrth Vallis is 'the landing site that wasn't'; it has nearly been selected as the destination of multiple Mars rover missions.It was a finalist for the ESA Rosalind Franklin rover, and was considered for both NASA's Perseverance and Curiosity missions (Michalski et al., 2010;Poulet et al., 2020).It remains of great scientific interest, despite being a challenging landing site from the perspective of geological context and traversability constraints.The region is shown in Figure 1.
The proposed landing site for the ExoMars mission was located on the planetary dichotomy boundary, where the highlands of Arabia Terra give way to the low, flat terrain of Chryse Planitia.The region of interest is a series of plateaus along the southern side of the Mawrth Vallis channel.These have been heavily eroded, exposing layered rocks which contain a record of several ancient aqueous environments (Loizeau et al., 2012;Michalski et al., 2010).Their context and good state of preservation make them a useful site for investigating the habitability of early Mars (Poulet et al., 2020).The rocks at Mawrth Vallis probably date from the early Noachian to the early Hesperian epochs, and so record periods of relatively warm and wet conditions on early Mars.Consequently, the hydrated silicates in the stratigraphy at Mawrth Vallis represent the environment from a time when Mars was most habitable.

NOAH-H classification
The procedure by which the NOAH-H system was trained to classify martian surfaces has been described in detail in Barrett et al. (2022a).In brief, 1500 small sections of HiRISE images ('framelets') were annotated with expert labels.Terrains were classified on a pixel scale, by drawing over the image in a labelling tool to classify areas of the image into one of 14 ontologies.Although a vector tool was used to classify blocks of pixels, the classification was raster based.These masks were then used to train and validate the network, which was based on a modified version of the algorithm 'deeplab' (Chen, 2017).Half of the framelets used to train the network were from Mawrth Vallis, with the rest from Oxia Planum.The classification scheme is described in Table 1 and in the legends of the two main map sheets, where type examples are shown.For more detail on how the classification scheme was developed and applied see (Barrett et al., 2022a).
Qualitative evaluation of the images was conducted to assess how well they conformed to the terrain seen by experts in the HiRISE data (Barrett et al., 2022a) the model was found to perform well on a landscape level.Unlike our analysis of Jezero Crater, this study does not involve transferring the model to a new site, and so many of the misclassifications seen there are avoided.The data for Mawrth was thus expected to be of a comparable quality to that used for the Oxia Planum map (Barrett et al., 2022b).
The method by which images were georeferenced and mosaicked to produce our final classification products are detailed in Barrett et al. (2022b) and Wright et al. (2022).An identical procedure was conducted here.In brief; the original HiRISE images were loaded into ArcGIS Pro, and georeferenced to a basemap (HRSC image HMC_11E10_co5 (Gwinner et al., 2016)).Several hundred control points were placed on each HiRISE image to tie it precisely to the HRSC data.The image was then orthorectified using a spline transformation.The control points were saved and applied to the classified mask produced by NOAH-H.This was orthorectified using the same transformation as the source HiRISE image.The rectified NOAH-H masks were then mosaicked in Arc-GIS, using the procedure outlined in Wright et al. (2022).In regions of overlap the same pixel priorities shown in Table 1 of that paper were used to determine which pixel would be adopted in the final map.

Preparing the comparison product
In order to characterise the suitability of Mawrth Vallis for investigation by rover, an estimate of the coverage of the site by aeolian bedforms was required.These features, primarily TARs (Balme et al., 2008) could impede the progress of a rover since windblown deposits are generally made of loose, unconsolidated material which provide less grip for rover wheels than solid rock or more coherent regolith (e.g.Balme et al., 2018).
TARs were manually digitised using ArcGIS across the central part of the proposed ExoMars landing site on data available in 2016.Areas covered by aeolian bedforms were digitised using a GIS polygon layer.Polygon feature classes were established indicating whether an area exhibited 100% aeolian cover, 50%, or 20%.
. In a '100%' region TARs are essentially continuous, with no gaps in between. .Areas recorded as '50% coverage' are densely covered by TARs, which must cover in excess of 50% of the region by area.Gaps must be present between some TARs, but are not required between all of them. .'20%' areas are sparsely covered with an average of 20% coverage.TARs are present, but are rarely contiguous, and the areal extent of aeolian bedforms is substantially less than 50%.
For the purposes of calculating overall aeolian cover, areas can be considered to have at least the coverage with which they are labelled, but no more than the next highest category.
These feature classes where then used to draw around regions of 20% cover, then 50%, and 100% respectively before clipping to preserve the highest digitised density polygon in any given location.This process took approximately 10 working days for an experienced GIS user to cover ∼840 km 2 (Figure 2).
Notes: The ontology has two levels: (i) descriptive classes and (ii) interpretive groups.The RGB values used to classify these classes and groups throughout this study are given.

Results
Terrain classification by NOAH-H at Mawrth Vallis matches well with manual interpretations (Barrett et al., 2022a A 500 m diameter crater fills the upper right-hand side of Figure 3.Its rim and the rocky ground that surrounds it are marked 'rugged bedrock' by the network.The crater floor is filled with non-bedrock material and a large patch of TARs, as is the area to the south of the rim.The crater interior and a north-south trending band to the east contain TAR patches, which have been correctly identified as ripple classes.Areas of small and large continuous ripples are well-defined, and these are interleaved with examples of large, isolated ripples.A few patches of bedrock are detected, but these are primarily visible beneath areas of textured non-bedrock material.In insets c and d, NOAH-H has classified TARs in generally similar areas to the human mapper, however the NOAH-H output segments individual bedforms, while human mapping groups areas of TARs in 100-1000 m scale patches.This means that while the trends are the same, there are many gaps in the classification, where a nonripple class has correctly been identified in a space in which the human mapper included in the discontinuous unit.

Comparison of DL classification to manual mapping
It is important to validate the results of machine learning classification against preexisting human-made products.However, such comparisons are rarely completely like for like.This can limit the applicability of certain traditional metrics such as precision and recall.In the following section we provide details of the comparison between our two datasets, and highlight some of the important considerations when conducting analyses of this sort.
HiRISE image ESP_046459_2025 was selected to compare NOAH-H and manual classifications.It exhibits a variety of aeolian bedforms and is central to both human and NOAH-H classified areas.All analysis was conducted using ArcGIS Pro software.The manually digitised TAR patterns were converted to a raster dataset with a resolution of 10 m/pixel as this best reflected the fidelity of the manual mapping.'Snap raster' was used to align the pixels of this new raster to those of the NOAH-H mosaic, allowing for a direct comparison of the two.The NOAH-H mosaic was then down-sampled to the same resolution, using a 'majority' approach so that each new 10 m pixel represented the mode of the 1,600 25 cm pixels which composed it.Both datasets were clipped to the extent of the HiRISE image.Down-sampling smooths out some of the pixel-scale 'noise' in the NOAH-H data, which does not reflect real, landscape-scale, variations in surface texture but would affect the comparison.The manually mapped TARs consist of three classes: 100% coverage, 50% coverage, and 20% coverage.The 100% class is considered to correspond to NOAH-H classes 8, 10 and 11, and are termed 'Continuous TARs'.The distinction between the 20% and 50% classes is not reflected in the NOAH-H classification system, so these two classes were aggregated into a single 'Discontinuous TARs' class.This corresponds to NOAH-H classes 9, 12, and 13.
Using the 'reclassify' tool the two datasets were converted into Boolean masks showing the presence or absence of a class.Three masks were produced for each dataset: all TARs, Continuous TARs, and Discontinuous TARs.Difference plots were computed using the 'compute change raster' tool, with the manual observations set as the 'from raster', and the DL classification as the 'to raster'.This allows us to differentiate between type I and type II errors.Difference = (toraster) − (fromraster) (1) Therefore: . False positives (type I errors) are objects which are found in the classification, but not in the manual map.These appear as 'positive changes' or +1 in the difference map.
. False negatives (type II errors) are objects which are present in the manual map, but not in the classification.These appear as 'negative changes' or −1 in the difference map.
The graphical results are shown in Figure 6.Confusion matrices (Table 2) summarise the number of true positives, false positives, and false negatives.Three derived metrics can also be calculated: Thus, precision is a measure of how many of the DL predictions are correct, recall of how many of the 'ground truth' features were detected, and IoU is a summary of the two.
Table 2 shows that the model does not reliably replicate the human map.Precision and recall are high for the background class but vary greatly for the TAR Class.Continuous TARs show high recall, but low precision, while the reverse is true for discontinuous features.When all TARs are considered, both metrics are middling.When both classes are considered, most of the confusion is between each class and the background, rather than between continuous and discontinuous.
If this test is taken as a blind measure for fitness of purpose, then this is not a good result.However, using the difference plots, we can clearly pinpoint how and why the two diverge and see that this poor result reflects a systemic difference in classification approach rather than inaccuracy in the model.
In general, the continuous class performs better than the discontinuous.This should be unsurprising, since in the continuous case the network classifies the terrain into large coherent patches.Where discontinuous terrains occur we had to digitise patches of discontinuous cover, rather than individual features, in order to complete the mapping in a reasonable period.The speed at which NOAH-H can complete its task allows it to attempt a higher level of fidelity than is possible for a human.However, this complicates assessing whether the model's actual performance is good enough to warrant these ambitions.In many ways, the human-and AI-produced datasets will never be entirely comparable.
When all TARs are considered, the overall trend appears similar at first glance.However, NOAH-H has captured much more detail than the human mapper.The result is many small-scale variations, which reduce the accuracy of the classification relative to 'ground truth'.In the case of continuous and discontinuous TARs, there are major differences between the two products.
The DL map has captured far more continuous TARs than the human one, whereas the human map captures more discontinuous TARs.In the human map, TARs were only digitised at all where we deemed their spatial density to be >20%.This is a subjective distinction, and leaves isolated TARs unmapped, however, the DL system can classify every pixel and as such captures many more small and isolated pixel clusters overlooked by the humans.The reverse is true when considering discontinuous TARs.The coherent areas classified as 20% or 50% coverage include many gaps between the bedforms.The DL system has attempted to exclude each of these gaps from its digitisation, classifying them as underlying bedrock or non-bedrock material.
The key questions thus become; are these smallscale variations 'real', making for a reliable product, and are they useful to capture?The human product arguably has less fidelity at very small spatial scales, however, it is much clearer and more intuitive to use.Which is better for a given task will vary depending on the precise science question being addressed.The NOAH-H product, taken on its own, arguably has too much small-scale variability to be useful when considering the whole study area and, as Figure 6 and Table 2 show, this is not entirely smoothed out by down-sampling the data to the same scale as the comparison product.
However, if being used for a small area (e.g. the few 100 m surrounding the location of a putative rover), having this level of granularity in assessment of the aeolian hazard is very useful.Note also that the 'noisy' output from NOAH-H is a function of using a direct pixel-based classification approach.It is possible that less noisy outputs could be produced by altering the way we model the classification problem via the ontology and labelling approach or via suitable post-processing.We will consider this in any future work.What we have learned from this study will allow us to reframe our classification approach to achieve more human-usable outputs in future.
Consequently, we believe that the NOAH-H results are fit for purpose, but that its use should not be unchecked by human input.Rather, it provides a rich and detailed starting point, from which a product more akin to the manually prepared map can be derived.A human user can start from the NOAH-H classification, removing false positives, adding in false negatives, and smoothing out the polygons into more useful, coherent, blocks.We find this process much less time-consuming than starting from scratch, and manually digitising every landform.It becomes much easier to estimate TAR coverage within a coherent bedform unit with the DL classified raster to work from, and this will make the estimation of coverage percentages such as those used in the manual map more consistent, especially in cases where the size of a study area necessitates multiple scientists contributing to the mapping.This is how we intend to use the network, and it is already being employed as a starting point for the bedform classes in ongoing mapping work at Oxia Planum soon to be submitted to this journal.

Ripple density statistics
We computed percentage surface cover by TARs using the method described in Barrett et al. (2022b).The NOAH-H and manual mapping rasters were resampled and reclassified into 1 m/pixel Boolean masks, showing the presence or absence of each bedform class; each NOAH-H class was computed separately (figure S1), and they were combined into groups of large, small, continuous, and discontinuous TARs (Figure 7).The manual map was grouped into masks for continuous, discontinuous, and all TARs (Figure 8).The 'add surface information' tool was then used to compute ripple density for each square in a 1 km grid, producing a vector output which summarises the density and distribution of aeolian cover for the region.The same procedure was also used on the NOAH-H detections of boulders.Similar to ripples, boulders are a dispersed feature of the landscape, the cover of which has implications for safety of landing and traversing a potential landing site.
In the manually mapped area, 100% TAR coverage is found over a total area of 5.2 x10 6 m 2 .Discontinuous TARs cover a much larger area, of 2.4 × 10 7 m 2 for 50% coverage, and 1.5 × 10 8 for 20% coverage, respectively.When these areas are adjusted for TAR coverage, manually mapped TARs are found to cover a total area of 4.8 x10 7 m 2 .When the NOAH-H classified area is clipped to the same extent we find a total of 7.8 × 10 7 m 2 of continuous TARs, and 5.2 x 10 7 m 2 of discontinuous features.If discontinuous features are weighted at 20% this gives a total of 8.9 × 10 7 m 2 .This means that the DL approach produces a percentage cover of 10.6%, while the human-mapping yields 5.8% coverage.Despite larger blocks being digitised, and more background material being included, the human map still provides a lower estimate of aeolian cover, due to the vast number of small areas which NOAH-H  11) and (f) all discontinuous TARs (classes 12-13).Note that for this analysis class 9 has been considered a 'continuous' class rather than a discrete one as in the comparison to the manual map.While large, isolated TARs are indeed discontinuous, for the purposes of calculating ripple statistics every pixel classified as class 9 is covered by aeolian material, they thus count towards the total budget of aeolian material at the site, rather than the area covered dispersed features, where each pixel is only ∼20% covered by aeolian material.
digitises, which were not significant enough to be included in the human map.
The first thing to note when comparing the ripple density statistics derived from the two methods is that it has been possible to classify a much larger area using the machine learning approach.While the distribution of 'All TARs' is similar, this is not the case for the continuous and discontinuous datasets.Relatively few continuous TARs were mapped in the human dataset, and they occur in small patches which do not result in significant coverage at the 1 km grid scale.The reverse is true of the discontinuous features, which were somewhat scarce in the DL-derived dataset, but are found much more readily in the human map.
As shown in section 5.1 the differences between the two classifications are on the metre to decimetre scale, and so aggregating them into 1 km bins can suppress or enhance minor differences in the distributions.The distributions are similar, albeit with very different absolute density.

Conclusion
This study has produced DL-derived maps of Mawrth Vallis.We demonstrated how these were produced, and how ripple density statistics can be calculated using them.We compare it to a manual TAR map, but find that although the two products appear broadly similar, there are major pixel-scale differences between them.These are largely due to how the NOAH-H network classifies images on a pixel-bypixel basis, producing 'noisy' classifications, where it attempts to reproduce every metre scale variation.These small-scale variations are interesting, and upon inspection, seem fairly reliable, but are not useful when classifying the terrain on a landscape level.The model does not produce a one-to-one match to the human data.However, it shouldn't be expected to, as the human mapping aggregated information at the ∼ 10 m scale while summarising this information into large areas.What NOAH-H does do, however, is capture the same landscape-level trends, and this would provide a valuable first step for producing a simplified, human-verified product.In essence, with the NOAH-H classification as a starting point, it would be much easier and faster to achieve this high level of fidelity in a human-produced map than starting from scratch.This is how the NOAH-H tool is being used henceforth.The TAR classes are already being incorporated into manually produced mapping efforts, where, with some manual verification and modification, they can form a valuable component of regional scale mapping efforts.

Software
We used ESRI ArcGIS Pro 2.8 for co-registration of HiRISE and NOAH-H rasters, NOAH-H pixel prioritisation, and generation of the NOAH-H terrain classification map.NOAH-H is a deep learning semantic segmentation software developed by SciSys Ltd (now part of CGI) for the European Space Agency to aid preparation for the ExoMars rover mission.

Figure 1 .
Figure 1.Map of the Mawrth Vallis study area, showing the region on the southern bank of the valley where classification was conducted.The area classified using the Machine Learning approach is outlined in black, while the area mapped by a human is outlined in red.Basemaps: HRSC HMC_11E10_co5(Gwinner et al., 2016) one of the instruments which first detected clays at Mawrth, Mars Global MOLA digital elevation model(Smith et al., 2001).

Figure 2 .
Figure 2. Manual map of ripple distribution across the central Mawrth area.Patches of bedforms are digitised.Patches with 100%TAR coverage are shown in dark red, while 50% coverage is shown in mid-toned orange, and 20% coverage in pale yellow.Areas with discontinuous coverage (20% and 50%) are much more common than areas with 100% cover.Patches with 100% cover are only seen in a few places, such as the interiors of valleys and craters.20% coverage areas are more common, but are still not evenly distributed across the area, being generally denser to the northwest of the region.The edges of some HiRISE images correspond to changes in TAR density.This is likely due to the illumination conditions in those images affecting how well the TARs could be resolved.We recommend downloading the online version of this figure, to better see the small details.

Figure 3 .
Figure 3. Example of a classified area of HiRISE ESP_046459_2025_RED, classified into descriptive classes: (a) the original HiRISE, (b) the classified mask produced by the NOAH-H network, (c) Manually Mapped TAR density, (d) NOAH-H TAR classes, symbolised in the same way as the manual map.The manual map consisted of three density classes, represented by red, orange and yellow for 100%, 50%, and 20% respectively.The ML only predicts 100% and 20% classes, so only red and yellow appear in inset d.

Recall
Positive (TP) = Number of pixels correctly classified, False Positive (FP) = Number of pixels incorrectly classified, and False Negative (FN): Number of pixels incorrectly not classified.

Figure 4 .
Figure 4. Main map of Mawrth Vallis, showing descriptive classes.GIS-ready files are available as supporting material and we encourage readers to download these in order to view the product at full resolution and symbolise it as needed to better highlight the more subtle variations.

Figure 5 .
Figure 5. Main map of Mawrth Vallis showing interpretive groups.GIS-ready files are available as supporting material and we encourage readers to download these in order to view the product at full resolution and symbolise it as needed to better highlight the more subtle variations.

Figure 6 .
Figure 6.Map showing comparison of TAR features to the manual mapping.(a) original HiRISE image ESP_046459_2025.(b) All TARs; manually mapped distribution (yellow), (c) All TARs; deep learning prediction (green), (d) All TARs; difference map (false negatives blue, false positives red).(e) Continuous TARs; manually mapped distribution (yellow), (f) Continuous TARs; deep learning prediction (green), (g) Continuous TARs; difference map (false negatives blue, false positives red).(h) Discontinuous TARs; manually mapped distribution (yellow), (i) Discontinuous TARs; deep learning prediction (green), (j) Discontinuous TARs; difference map (false negatives blue, false positives red).This figure contains very small details, we recommend consulting the online version of the figure in order to examine it at full resolution.

Figure 7 .
Figure 7. Ripple and boulder density statistics derived from the machine learning classification.Coloured grids are overlain on a slope map derived from a CTX stereo mosaic, (a) All TAR classes, (b) all boulders, (c) all large TARs (classes 8-10), d all small TARs (classes 11-13), (e) all Continuous TARs (classes 8-11) and (f) all discontinuous TARs (classes 12-13).Note that for this analysis class 9 has been considered a 'continuous' class rather than a discrete one as in the comparison to the manual map.While large, isolated TARs are indeed discontinuous, for the purposes of calculating ripple statistics every pixel classified as class 9 is covered by aeolian material, they thus count towards the total budget of aeolian material at the site, rather than the area covered dispersed features, where each pixel is only ∼20% covered by aeolian material.

Figure 8 .
Figure 8. Ripple density statistics derived from the manual mapping.Coloured grids are overlain on a slope map derived from a CTX stereo mosaic.(a) All TARs, (b) Continuous TARs, (c) Discontinuous TARs.

Table 2 .
Confusion Matrices showing the precision, recall and IoU metrics for each comparison pair.