A hybrid system of data-driven approaches for simulating residential energy demand profiles

This paper presents a novel system of data-driven approaches for simulating the dynamics of electricity demand profiles. Demand profiles of individual dwellings are decomposed into deterministic (e.g. ‘Trends’ and ‘Seasonal’) and stochastic (‘remainder’) components using the STL (a Seasonal-Trend decomposition procedure based on Loess) approach. Stochastic components are modelled using a Hidden Markov Model (HMM) and combined with deterministic components to generate synthetic demand profiles. To simulate extreme (peak) demand, the synthetic profiles were post-processed using a Generalised Pareto (GP) distribution, and a percentile-based bias-correction scheme. All the techniques are systematically coupled into a hybrid system, referred to as ‘STL_HMM_GP’. The STL_HMM_GP system is thoroughly accessed and validated by comparing a range of statistical characteristic of observed and simulated profiles for three case study communities. The potentials of the STL_HMM_GP system is demonstrated for simulating aggregated demand profiles, generated using an accessible small sample of observed individual demand profiles.


Introduction
Energy is one of the key areas that could contribute significantly towards the global challenges of reducing greenhouse emission and tackling climate change. In the broad context of the whole energy system, energy demand management plays a pivotal role, specifically contributing to the development of sustainable plans, policies and, strategies for optimised allocation of resources and infrastructure design. This paper is aimed to investigate if a small accessible sample of high-resolution electricity demand profiles (available at individual building-level) can be synthetically simulated to replicate their statistical characteristics, then independent and identically distributed synthetic profiles can be statistically organised (e.g. using a k-means approach) and aggregated to project dynamics of community-level energy demand series. concurrently in the household and when superimposed they could display highly complex dynamics. (b) a single modelling technique (due to their intrinsic structural design) can only capture certain dynamical features of the process and could fail in apprehending other important characteristics, for example, most of the data-centric modelling techniques often fail in simulating the dynamics of extreme/rare events.
Extreme events such as peak load (over 95th percentiles) are of utmost importance as these events when occurring simultaneously at a specific time of a day and across a large number of dwellings could result in overall system failure. For example, use of an air conditioning system on a typically hot day, some interesting behavioural aspect (specifically in the context of the UK) like switching on a kettle by thousands of household across the nation during the break time of popular national television programme, or in future could be charging of Electric Vehicles (EV) by a large volume of the households across the community/nation at a certain period of the day, etc. Other well-known challenges are accessibility of a suitable sample at a preferred: (i) fine-temporal scale (required to account for seasonal variations of various periodically repeating processes) and; (ii) across a widespatial range (required to capture diversity in demand due to varied household characteristics, e.g. occupancy, life-style, energy use behaviour, variability in electricity appliance, floor area, space heating/cooling technology, etc.).
To address some of these long-standing challenges, this paper present a novel system of data-driven approaches (STL_HMM_GP) that couples a selection of statistical/computational techniques within a single framework for effectively pre-processing, modelling and post-processing of 'high-resolution' electricity demand profiles to generate synthetic energy demand series (both at the individual and aggregated levels). The STL_HMM_GP model is suitable to simulate time series at a range of temporal resolution, given that the original series is available at a sufficient length, to capture the underlying statistical dynamics of the series. This paper demonstrates the application of the STL_HHM_GP model in generating synthetic electricity demand series at three different temporal resolution (5, 15 and 30 min) selected from three different data portfolios, respectively. The observed electricity demand series available in all three data portfolios are consist of four weeks in length. The model is shown to perform equally efficient across all the case-studies for simulating the dynamics of both individual and aggregated demand profiles using a small sample of individual dwelling (i.e. 10% and 20% of the total number of dwellings in the case-study community). The methodological framework of STL-HMM-GP based modelling schematics is presented in Figure 1 and described in Section 3.

Background
Energy demand modelling is a widely studies area involving demand modelling at the individual dwelling and/or at the community/national level. To compliment physical-based energy modelling approaches, a range of data-driven statistical and/or computation techniques has been developed recently. Unlike physically-based models that can be calibrated using building-specific physical parameters and rely on expert knowledge, statistical/computational models are purely based on historically observed records. A thorough review of some widely applied data-centric approaches (e.g. time series models, regression models, decomposition models, ARIMA type models, hybrid ANN models, including support vector regression, Fuzzy logic models, Genetic algorithm, etc.), and conventional physical-based energy system models (e.g. MARKAL/TIMES/LEAP) can be referred elsewhere (Suganthi and Samuel 2012;Timilsina 2009, 2010). A detailed literature overview on HMM models along with the potential for the GP distribution can be referred to in some preliminary work done by the authors (Patidar et al. 2018;Patidar, Jenkins, and Simpson 2016). This section will provide a brief literature review on (i) some widely applied community-scale energy demand modelling techniques, (ii) the application of time series modelling techniques in the area of energy demand modelling, and (iii) if time series decomposition approaches can be applied for improving overall model efficiency.

Community-level energy demand modelling techniques
Physical-based models utilise the physical characteristic of buildings (such as construction type, technology variants, occupancy details, etc.) to investigate their energy performance. Empirical calibration of these physical models can be resource-intensive and requires detailed information on several building-specific parameters. Whilst purely physical models can provide a very approximate estimation of building energy use which can also (through, for example, stock modelling (Hughes et al. 2016; Cambridge Housing Model and user guide 2010)) be further upscaled to larger geographical regions, they are not suitable for estimating high-resolution demand characteristics of buildings (such as load factor, and magnitude/timing of peak demands).
In the broader context, energy system models, such as MARKAL (MARKet Allocation (Taylor et al. 2014)) and UKTM (the UK TIMES Model) (Broad and Dodds n.d.;Daly, Dodds, and Fais 2014), have been traditionally used in the UK and across more than 70 countries by researchers and policymakers to perform large-scale community-level energy demand modelling. These models are usually applied to understand the long-term (usually over decades) time evolution and sensitivity of new and existing technologies/resources on the future energy system. Specifically, with application to energy demand evolution in the domestic sector, the highRES model (Price and Zeyringer n.d.), utilising existing demand data (at hourly resolution) facilitates greater spatio-temporal detail for exploring potential impacts of future energy scenarios on the sustainability of whole energy systems. However, the highRES model only processes current demand data, making the quantification of issues such as future electrification of heat/transport, demand management/response and building-integrated storage more difficult to quantify. Recently, the CREST demand model (McKenna and Thomson 2016), utilising occupancy/daily activity patterns within a bottom-up modelling approach has been shown to simulate a highresolution (one-minute) demand dataset. Other interesting work in this area demonstrated a 'Stochastic Residential Occupant Behaviour (StROBe) model (Baetens and Saelens 2016), for district-level energy simulation. While addressing uncertainty issues, the StROBe model also accounts for detailed information collected from a range of occupancy/activity factors such as internal heat gains, thermostat settings and hot water usage.
Individual electricity demand profiles at high resolution (e.g. 5 min and below) can be statistically analysed to communicate energy behaviour characteristics that would not be discernible through the use of purely physical modelling. In this context, the availability of an efficient data-driven modelling schematic that can account for the instantaneous high-demand activities occurring at a specific time across a large number of buildings and the impact such activities can have on the communitylevel demand curve can be of particular importance. The STL_HMM_GP model proposed in this paper is intended to facilitate a purely data-driven modelling system for generating aggregated community-level energy use to support the operation of a robust and reliable energy system. In the next subsection, we will focus on giving a brief literature review on data-driven approaches related to the development of the STL_HMM_GP model.

Time series decomposition techniques
The complex dynamic and pattern manifested by the electricity demand series emerge as a consequence of complex interaction and influences imparted by a range of socio-economic and environmental factors. However, the degree of influence could vary significantly across these multiple factors, for example, some of these factors could be realised in form of a constant element that varies gradually over the long-term period (referred to as 'Trend'), whereas, some of these factors could emerge in form of several overlapping repetitive seasonal, monthly, daily or hourly patterns (referred as 'Seasonal'). These complex intrinsic features embedded in a time series can be explored and simplified in form of determinism and stochasticity through the application of time series decomposition methods (Rios and De Mello 2012). Some of the most commonly known mathematical approaches include Fourier Transform (FT) (Stoica and Moses 2004), Wavelet Transform (WT) (Qian 2002), and Empirical Mode Decomposition (EMD) (Huang et al. 1998;Rios and De Mello 2016).
The FT facilitates the transformation of a time series into a frequency domain, comprising of multiple constituent frequencies in form of wave. These waves can be defined using a basic function (consisting of sines and cosines functions) along with a series of spectral information. While the amplitude and the phase of waves can be processed to segregate stochastic and deterministic components, the application of inverse FT can facilitate the reconstruction of the original series from the components that are retrieved in respective frequency domain representation. Thus, a FT is a powerful technique but has limited application to linear and strictly stationary time-series. To address some of these limitations, the WT approach is widely applied to obtain mathematical functions based on ordered coefficients and wavelets for decomposing a non-stationary time series at different scale and resolutions. Unlike the FT approach, WT can extract wave that is not essentially sinusoidal form and can inform the time component associated with the different constituent frequencies in the signal. Recently, for decomposing natural signals which are often nonlinear and non-stationary, the application of the EMD approach has been proposed. The EMD approach facilitates the decomposition of a time series into a set of components, referred to as Intrinsic Mode Functions (IMFs), that all are in the time-domain similar to the original series. Thus, these components can be analysed in the forms of instantaneous frequencies and amplitudes to understand multiple event/causes occurring at specific time intervals.
In addition to these widely applied mathematical approaches, a range of statistical methods has been developed for de-constituting time series into a set of non-observables (latent) components (Dagum 1980). Time series are often considered as composed of four components: (i) Long-term trends; (ii) Cyclic-trendsoften considered as long-term trends; (iii) Seasonal movements, and (iv) Residual/random variations (Persons 1919). These four components are assumed to be mutually independent in an additive decomposition model (represented in Equation (1)) and if there is dependence among the latent components then it can be defined through a multiplicative decomposition model (represented in Equation (2)), ( 1 ) where X t is the observed time series, T t is the longterm trend, C t is the cyclic trend, S t is the seasonality and R t is the irregular/randomness. One of the widely applied approaches for extracting these latent components from time series is Classical Decomposition (CD). The CD approach utilises moving average procedure to extract trend cycles, which is then followed by extraction of various seasonal indices (a constant seasonal factor corresponding to a specified period). The CD approach has several limitations and therefore a range of subsequent version was proposed. These include X-11 seasonal adjustment (Shiskin, Young, andMusgrave 1967, February, Ladiray andQuenneville 2001), which is followed by X-11-ARIMA (Dagum 1980;Dagum 1999), X-12-ARIMA (Findley et al. 1998) and Seasonal Extraction in ARIMA Time Series (SEATS) approach (Dagum and Bianconcini 2016). The SEATS decomposition approach is mainly suitable for monthly and quarterly series only. A detailed overview of these various CD-based approaches can be referred to elsewhere (Dagum and Bianconcini 2016).
Recently, a robust STL (a Seasonal-Trend decomposition procedure based on Loess) technique, which has several advantages over other previously discussed CDbased methods, has been proposed ). The STL approach can be applied to multiple forms of seasonality (such as monthly, weekly, daily, hourly or even minutely), and also allows for these various seasonal components to vary over time. Moreover, the STL approach appears to be robust to outliers and provides a considerable amount of flexibility for the selection of smoothing parameters. Considering these various advantages of the STL approach, the present paper investigates the efficiency of the STL-based time series decomposition approach in developing the hybrid system of STL_HMM_GP.

Application of time-series decomposition approach in energy demand modelling
Few studies recently investigated the potentials of timeseries decomposition techniques in optimising the model performance, some of the related examples are briefly discussed here. In (Imanishi et al. 2017), the time-series decomposition approach is applied to the power demand information, which is collected from multiple houses in Japan for a month and at a temporal resolution of 30-minutes. This study investigated the relationships between decomposed components and customer information to evaluate whether the approach is meaningful for feature extraction. In another study, smart meter data collected from six commercial building in the US at a temporal resolution of 15-minute are analysed using the CD approach. These datasets were evaluated along with the other details, such as HVAC schedule, daily operational variations, temperature and solar radiations dataset (Pickering et al. 2018). The CD is applied to understand patterns in building operation routines and building characteristic for achieving a considerable amount of building energy efficiency (approximately over 700MWh for six commercial case-studies). These savings are achieved simply by rescheduling certain activities occurring in the building. Other examples include the application of exponential smoothing approach for seasonal decomposition of electricity consumption data (Ahmad 2017), and in (Chujai, Kerdprasop, and Kerdprasop 2013) the CD method applied to one-minute electricity power consumption data (collected throughout 2006-2010) is shown to improve the forecasting accuracies with ARIMA and ARMA type models.

The theoretical framework of STL_HMM_GP system
The methodological framework of the hybrid system STL_HMM_GP is developed in four stages, a flow-digram is shown in Figure 1. The colour scheme used in the flowdiagram cast-off discrete model developments/processing stages, specifically boxes in Orange, Blue, Green and Yellow colours are used for Stage 1, 2, 3 and 4, respectively. Gray coloured boxes are used to highlight steps taken for data transformation and inverse-transformation procedures, whereas purple-coloured oval shapes are used to feature the start and the endpoints of the process. A brief overview of four key stages of the model development is discussed below and demonstrated using a randomly selected dwelling from Portfolio 1 of the Findhorn dataset (details of the case study are presented later in Section 4).

Stage I time series decomposition (STL)
To improve the overall modelling accuracy, the STLbased time-series decomposition techniques is applied as a data preprocessing/filtering procedure. As the STL procedure is mainly suitable for additive decomposition, observed energy demand series (which could be multiplicative series) are transformed into additive series by performing a log-transformation. The log-transformation procedure often stabilises variance in non-stationary series, de-emphasizes the impacts of outliers and enables to retract symmetry in the data towards a Gaussian distribution (Metcalf and Casey 2016). The STL procedure is consisting of a sequence of Loess (Locally weighted regression) (Cleveland and Grosse 1990;Cleveland, Devlin, and Grosse 1988) smoothing operation and is associated with a computer implementation developed in R using the function 'stl'.
The application of the STL procedure is demonstrated for de-constituting one-week of observed energy demand dataset into deterministic ('Trends' and 'Seasonality') and stochastic ('Random') components are illustrated in (Figure 2, Left panel) for a randomly selected dwelling from the Portfolio 1 of Findhorn dataset. The patterns of individual components, specifically seasonal, reflect various regular and irregular activities occurring in the household. To further explore these various regular activities occurring in the households, a wavelet power spectrum of the seasonal component over the entire duration of the available dataset (i.e. six-weeks ∼ 42 days) is shown in Figure 2 (Right panel (top)). A plot of time-average power of wavelet, which is useful in assessing the overall strength of various significant periodic activities is shown in Figure 2 (Right panel (bottom)). The analytic Wavelet transformation of the seasonal component is performed using 'WaveletComp' (R package) (Roesch and Schmidbauer 2018). Theoretical and mathematical details of Wavelet power spectrum analysis underpinning the 'WaveletComp' package can be referred to in Appendix 1 of the paper. Further, analogous plots demonstrating the STL decomposition procedure and wavelet spectrum analysis for a randomly selected dwelling from Portfolio 2 and 3 are provided as supplementary material ( Figure S1 and S2 respectively). Please note, details on the Portfolio 2 and 3 dataset are presented later in Section 4. Figure 2 (Right panel, top), on the y-axis each unit represent a period of 5 min. The plot extracts several distinct periodic activities. It is interesting to identify some very strong periodic activities occurring at a period of 2.5 min (0.5 units on the y-axis) and 5 min (1 unit on the y-axis). These activities can be noted as a band of solid red areas across the entire temporal axis. Relatively weak periodic activities at a period of 10 min (2 units on the y-axis), 20 min (4 units on the y-axis), and 40 min (8 units on the y-axis) are also noted as a band of Yellow-Green areas across the entire temporal axis. In addition to this, a clear structure/pattern of periodic activities (displaying a weekday/weekend type of phenomenon) can be noted in the regions of period 5-10 min (1-2 units on the y-axis) and also at a slightly below the period of 5 min (1 unit on the y-axis). White contour lines in the graph outline the area of high significance.
The time-averaged power spectrum plot (Right panel, bottom) provides the average power of these various periodic signals (at a significant level of 0.05) over the entire time period. For each of the above noted periodic activities, the average power of the activities can be accessed in this plot. For example, the most significant periodic activities are corresponding to period 2.5 and 5 min with average wavelet power exceeding 80. The wavelet decomposition of the seasonal component can be further analysed to identify highly significant periodic activities which can be processed with contextual information to extract useful association (e.g. time of use of various electric appliance available in the household) and can be processed to reconstruct separate temporal signals for each of the identified appliances/activities. However, considering the focus of the paper these have not been further analysed at this stage.

Stage II Fitting Hidden Markov Model (HMM)
HMM is core to the structure of the STL_HMM_GP system. Referring to Figure 1, the HMM is applied to only the random component (obtained in Stage 1) for generating (n, user-specified integer) scenarios, representing uncertainty in the system. The trend and seasonal components, which can be attributed to the deterministic/physical processes, are derived from the observed series using the STL process (described above) are combined with the HMM-simulated n random components to generate nsynthetic series. This allows synthetic series to have the same deterministic structure as the original series with variation coming from the HMM simulated random component. As the random component is modelled using the HMM technique, simulated random components represent a realistically possible alternative scenario (or uncertainty of the process). The theoretical framework of HMM is illustrated in Figure 3. The HMM is consist of five modules. Each module (M) is carefully defined to organise a suitable fit for the simulation of the energy demand series. Mathematical details underpinning these modules can be referred into Appendix II of the paper.

Stage III Fitting Generalised Pareto (GP) distribution
As stated earlier, peak demand values, specifically those above the 95th percentile, are of high importance as they constitute visible features in a typical electrical demand profile of a dwelling (e.g. kettles, cooking equipment, heating elements, electric showers, etc.). When aggregated with other dwellings, they contribute towards longer periods of energy use but due to the low frequency at which these features occur for an individual dwelling, data-driven models often struggle to simulate these values correctly. To achieve higher accuracy in the simulation of peak demand, the STL_HMM_GP framework fits a Generalised Pareto (GP) distribution to a set of observed peak demand values, specifically in the range of 95th-99.9th percentile (e.g. 2.2-4.8, respectively for an arbitrarily selected dwelling in Portfolio 1 of Findhorn used for the demonstration in Appendix III). Synthetically simulated series (obtained in Stage III) are then postprocessed to resampled peak demand values from the fitted GP distribution. Maximum-likelihood (ML) estimation method, using R function gpd.fit ('ismev' package) (Gilleland and Katz 2016;Gilleland, Ribatet, and Stephenson 2013;Stephenson 2018), is applied for estimating the asymptotically optimum parameters of the GP distribution (Grimshaw 1993). Further details on ML estimation plot and diagnostic plot for assessing the overall quality of GP fitted at 97th percentile value can be referred to in Appendix III.

Stage IV Bias-Correction
Recalling that the application of STL decomposition follows through a log-transformation of data, which with further processing in Stage 2 and 3 (fitting of HMM and GP model) can produce biased predictions. Some earlier studies discussed this bias in the context of a simple regression-based model and presented some general approaches for minimizing its influence (Koch and Smillie 1986;Beauchamp and Olson 1973;Sprugel 1983;Newman 1993). This type of bias is mostly ignored by the researchers but often seen to substantially impact the predicted values in specific ranges, i.e bias is large in certain ranges of value than the others as noted in . Considering the complexity of the STL_HMM_GP modelling procedure, the authors of the paper developed a novel percentile-based bias correction method (described below) to correct the bias: (1) Estimate all percentiles from 0 th to 100 th at a unit step for both observed and synthetic series, i.e. P Observed (x) and P Synthetic (x), respectively for x ∈ 0, . . . , 100.
(3) Each predicted demand value E(t) depending on the percentile range they fall in, i.e. following the condition: . . , 100, are biased corrected using the respective percentile-based biased correction difference term, i.e. as E(t) + Difference(x).
The proposed approach is specifically designed to address the size-effect of bias and the overall effectiveness of the approach in optimising model outcomes is demonstrated later in Section 6.

Case study and data organisation
To develop and investigate the potentials of the proposed STL_HMM_GP methodology, two case studies communities: (i) Findhorn Foundation Eco-Village (FFEV) (Foundation, F 2020); and (ii) Fintry community (Smith 2018, April 13), were selected. Selection of case-study communities is mainly based on the availability of sufficient electricity demand dataset that simultaneously meets the criteria of being: (i) high-resolution, (ii) continuous for a sufficient length of time to capture temporal and seasonal variability; (iii) accounts for a reasonable amount of spatial diversity in energy demand patterns across the different dwellings in the community.
The FFEV community is located in the North East of Scotland near the small town of Forres and has approximately 211 buildings. Most of the buildings in the FFEV community utilises modern insulations, triple glazed windows, environment-friendly recycled materials (e.g. straw bales, whiskey barrels, grass roofing, etc.) and are supported by intelligent design processes for a sustainable lifestyle that maximises passive solar gains. Based on the energy system types, buildings in the FFEV community can be grouped into four categories: (i) Air Source Hear Pump (ASHP); (ii) Gas Boiler (GB); (iii) Electric Boiler (EB); (iv) District Heating using Biomass (DHB). Further description of the variability of construction type across the sample dwelling can be referred to elsewhere (Patidar et al. 2019). As part of the European FP7-ICT programme, the ORIGIN project (Grant agreement ID: 314742, Nov 2012-Oct 2015) installed smart meters at approximately 55 buildings in the FFEV community and collected highresolution (5-minutely) electricity demand data for two years.
Fintry is a small community village based in Stirlingshire, Central Scotland, and is an off-gas grid community. Fintry is comprised of approximately 350 households (c700 inhabitants) that mainly utilises electric, oil and LPG based heating. In 2007, Fintry Development Trust (FDT ∼ 250 members), with inspiration to transform Fintry into a carbon-neutral and sustainable community, installed various forms of renewable energy generation plants (e.g. Solar PV, Biomass and Wind) (Howell 2020). As part of the SMART Fintry project, funded from the Scottish Government Local Energy Challenge Fund (LECF) throughout May 2016-April 2018, electricity demand data across 115 participating households in the community were collected. Demand data were collected at a temporal resolution of 30-minutes for around one year and at a temporal resolution of 15-minutes for approximately 3 months (Smith 2018, April 13).
Due to various practical real-life situations, it is not always possible to acquire good quality continuous dataset. For the research conducted in the present study, the criteria for being a good-quality dataset is defined as the availability of a continuous dataset with less than 5% missing values. Following a thorough analysis of the available dataset across both the communities, we developed three data portfolios (detailed below) that meet the criteria of being good-quality data and have suitable spatio-temporal variability among them.
• Portfolio 1 comprises a sample of 14 dwellings, drawn out of available 55 in FFEV, providing continuous electricity demand profiles at a temporal resolution of 5minute for a maximum period of six weeks from 15th February to 28 March 2015.
The other two portfolios were defined for the Fintry community, where, • Portfolio 2 consists of 74 dwellings and providing a 30minutely electricity demand dataset for July 2017, and • Portfolio 3, consists of 54 dwellings and providing a 15-minutely electricity demand dataset for November 2017.
Missing data for all the three portfolios were infilled using a logical algorithm developed by the authors and is detailed elsewhere (Debnath et al. 2020).

K-means clustering for classification and organisation of data portfolio
It is often challenging to model transient communitylevel energy demand that requires systematic analysis and modelling of energy demand data for a large number of buildings in the community. Considering the practical challenges involved in the acquisition of a good-quality high-resolution electricity demand dataset, the potentials of a simple k-mean based clustering approach as a systematic data sampling technique has been explored. K-means is applied to process easily accessible macrolevel contextual information (specifically, monthly mean and median demand values) to group dwellings for sample selection and define appropriate weighings schemes for upscaling samples for aggregation purpose. Thus, the classification of data is an essential step to obtain an optimum sample for designing an efficient demand aggregation scheme.
In this section, a K-means clustering approach is presented to organise the large volume of dataset accumulated in the three data portfolios. K-means clustering is a widely applied unsupervised machine learning algorithm that aims to organise dataset into an optimum number of clusters/groups, based on the selected features/characteristics, such that the items within the clusters are coherent to the selected features and are considerably distinct between the different clusters. Figure 4 shows the clustering of dwellings across three portfolios, based on the statistical mean and median used as grouping feature/characteristics. For each of the portfolio, statistical mean and median were estimated for each of the dwellings over the entire period. An optimum number of clusters in each portfolio were identified using the Elbow method. The Elbow method plots the total/sum within the cluster variation (i.e. the sum of squared errors with the clusters) with the number of clusters. An optimum number of clusters is chosen when the change in sum within the cluster does not change significantly with a change in the cluster numbers. Please note a coherent colour scheme is adopted across all the Graphs and Tables presented in the paper: • Portfolio 1 (Findhorn, 15 February-28 March 2015) dataset are illustrated in shades of Blue; • Portfolio 2 (Fintry -July 2017) dataset are illustrated in shades of Green; and • Portfolio 3 (Fintry -November 2017) dataset are illustrated in shades of Oranges.
In Figure 4, 'Top panel' presents the clustering analysis for the sample of 14 dwellings in the FFEV community (Portfolio 1). These 14 dwellings were clustered across the mean and the median electricity consumption estimated for the 5-minutely electricity demand data measured during the period of 15th February-28 March 2015 and for each of the dwellings. Similarly, in 'Middle panel' clustering analysis for the 30-minutely electricity demand data for 74 dwellings in the Fintry community (Portfolio 2, July 2017); and in 'Bottom panel' clustering of 15minutely electricity demand dataset for 54 dwellings in the Fintry community (Portfolio 3, November 2017) were presented. Corresponding plots for the Elbow method for all the three Portfolios 1, 2 and 3, are also presented (Left Panel). For all the cases, the Elbow plot consistently shows no significant changes after 4 clusters for the sum within the cluster measurement, thus 4 is chosen as an optimum cluster number for all the portfolios. K-means clustering presented here is applied using the functions 'fviz_cluster', available in R package 'factoextra'. Further details on the theory and application of the k-means procedure including its implementation in R can be referred to elsewhere (Kassambara 2017). Figure 4 illustrates a clear and coherent clustering is achieved in all three cases with no overlapping of any clusters observed. Moreover, a considerable separation based on the measured statistics achieved across all three cases. To further assess the overall performance of clustering analysis, key cluster performance indicators (as listed below) were estimated and are presented in Table 1. These includes: (i) Within Cluster Sum of Square (WCSS) measures the average squared distance of all points from the centroid within a cluster as withinness of the clusters, (ii) Between Cluster Sum of Square (BCSS) -measures the average squared distance between all the centroids as betweenness of the clusters.

BCSS
(iii) Total Sum of Square (TSS) is effectively sums of the total (WCSS) and (BCSS), where n is the total number of elements, K is the number of clusters, n k is the number of elements in the k th cluster,x k is the mean of the k th cluster, z ik = 1, x i ∈ cluster k 0, x i / ∈ cluster k . To contextualise the analysis, the total number of dwellings in each cluster along with the values for cluster mean and median are also presented in Table 1. A small value for WCSS indicates a compact cluster with less variation within the cluster and a large value of BCSS ensures a large variation (i.e. good separation) across the different clusters. Table 1 presents key performance indicators corresponding to the k-means clustering presented in Figure 4. For all three cases, WCSS is relatively much small than the BCSS, thus ensuring a robust clustering achieved. Metric of all the three performance indicators scored the lowest value for the Findhorn (Portfolio 1), this could be due to the lowest number of data points (14 dwellings) in comparison to the Fintry based Portfolio 2 and 3 datasets of 54 and 74 dwellings respectively. However, interestingly, for Fintry, all the three indicators suggested a lower score for the July dataset (Portfolio 2 of 74 dwellings) in comparison to the November dataset (Portfolio 3 of 54 dwellings). These observations are in line with the high mean and median values observed for November in comparison to July. Further, as trivial, an increase in the number of data points in a cluster Within Cluster Sum of Square appears to increases across all the four clusters for all the three case-studies.  To demonstrate if sufficient diversity is captured across the different groups/clusters, Figure 5 illustrates a distribution of box plot of five key summary statistics (Minimum, 25th Percentile, 50th Percentile, 75th Percentile and Maximum) estimated for total energy demand, for all the clusters across all the three data portfolios. The x-axis represents the respective cluster numbers ranging from 1 to 4, while a 0 value representing the distribution of summary statistics for all the dwellings in the corresponding portfolio. It can be noticed, a wide variation is observed across the box plots of total demand in different clusters across the three portfolios. For example, total demand in Fintry (July 2017, ranging between 50 and 1200 kWh) is considerably low than in Fintry (November 2017, ranging 100-2750 kWh). One of the key reason is the seasonal effect, though it is interesting to notice that the total demand for cluster 2-4 varies considerably much less in the Findhorn community (containing non-residential buildings) than in cluster 1 and also in comparison to the Fintry community. Thus, it is clear that all the 12 clusters (across the three portfolios) are consisting of considerably different ranges of energy demand distribution, and therefore, suitable to investigate the robustness and potential application of novel system STL_HMM_GP.
In addition to the above analysis, a range of cluster evaluation indicators/analysis such as Silhouette (Si) analysis, Partition coefficient index, the classification entropy, the Davies-Bouldin Index (DBI), the Xie-Beni (XB) indicator can be conducted. Though due to the limited scope and considering the focus of the present study these additional analyses are not presented here. Nevertheless, the graphical illustrations presented in Figures 4 and 5 together with Table 1, shows that a considerably robust clustering of three portfolios of the dataset has been achieved ( Table 2).

Model performance, simulation analysis and validation
Rigorous model performance and validation procedure are essential to ensure the reliability of the simulated results, i.e. if the model is capable of representing reality as reflected by the observed dataset. To demonstrate the effectiveness of the proposed data-driven STL_HMM_GP system in simulating the dynamics of observed electricity demand, key statistical properties: (i) Probability Density Distribution (PDD), (ii) Percentile analysis and (iii) Auto-Correlation Function (ACF), is analysed. Thorough model validation is performed to access the capabilities of the STL_HMM_GP system in simulating the diverse portfolio of demand profiles and in capturing the degree of uncertainty associated with the empirical dataset. Simulation analysis is illustrated for 12 arbitrary selected dwellings, randomly drawn from each of the 12 different clusters defined across all the three data portfolios. This selection is to ensure validation is performed across diversity represented in demand patterns, seasonality, and timescales/period (5, 15 and 30 min) across the two different communities.

Probability Density Distribution (PDD)
The area under the PDD curve graphically illustrates the likelihood of occurrence of certain demand values within the specified ranges. The PDD of the empirical demand series is compared with the PDD of the corresponding synthetic demand series generated by the STL_HMM_GP system for all the 12 arbitrary selected dwellings.
To illustrate the impacts of the bias-correction procedure, comparisons are made for synthetics series that has been post-processed using the bias-correction module and those which are not ( Figure 6). In Figure 6, a considerable amount of variation in distribution patterns across the different clusters and three data portfolios can be noted. Findhorn dataset ( Figure 6, Top panel) seems to have a high probability associated with demand in the range of 0-2 kW, except cluster 1 showing two other peaks in the range of 2-4 and 4-6 kW. Fintry dataset shows widely varying demand patterns during Nov 2017 ( Figure 6, Bottom panel) than July 2017 ( Figure 6, Middle panel) with more than two peaks in all cases. Recall that Fintry is an off gas-grid community, thus a high-electricity consumption is expected in the winter months. Interestingly, the bias-correction procedure appears to have a considerable impact on correcting the shapes of the distribution patterns, noticed in almost all cases, and specifically for Fintry Nov 2017 when a large amount of variation noted in distribution patterns.

Percentile analysis
A statistical percentile, say x, is a cut-off value, such that x% percentage of data in the group falls above and (100 − x)% of data falls below that specific value. A thorough percentiles analysis (ranging from 0 − 100 th Figure 7. The percentile analysis of empirical demand series compared with synthetic and bias-corrected synthetic series for 12 arbitrary selected dwellings from each of the 4 clusters across the three data portfolios. percentile at a unit step) is conducted to facilitates a relative percentile comparison of empirical, simulated and bias-corrected synthetic demand profiles (Figure 7). The percentile comparison presented in Figure 7 shows a good match between the different percentile of empirical and STL_HMM_GP simulated demand across all the 12 diverse scenarios presented. It is interesting to notice the influence of bias-correction procedures in minimizing the model bias, mostly noticed for an antilog transformation value of 1 kW/over mostly occurring in the ranges over 80th percentiles of demand values.

Auto-Correlation Function (ACF)
ACF is a signature property of a time series that defines the self-correlation of series over a range of successive delayed (lag) observations. ACF is widely used to assess the strength of randomness in a time-series, i.e. for a perfectly uncorrelated series (referred to as a white noise process), the ACF plot has a unit value at zero lag and a zero value at all other lags. Figure 8 compares the ACF for the empirical demand series against the synthetic and biascorrected synthetic series for all the 12 diverse scenarios considered in this section. The ACF plots for the synthetic series appears to closely follow the patterns and dynamics of empirical series in all the cases with some minor variation (of order 0.1) in correlation strength observed in some instances. Another interesting observation is that the bias-correction procedure has almost no influence on the auto-correlation properties of the synthetic series. Thus confirming that post-processing series with biascorrection does not alter any dynamical features (intrinsic properties) of demand series but efficiently removes the model bias that occurred due to data transformation steps.

Application of STL_HMM_GP system
The potential of the STL_HMM_GP system in simulating the dynamics of aggregated demand profiles are Figure 8. The ACF plot of the empirical demand series compared with the synthetic and bias-corrected synthetic series for 12 arbitrary selected dwelling from each of the 4 clusters across the three data portfolios. investigated using three different sample sizes, which are applied for all three data portfolios. A detailed discussion on the analysis and findings is presented in Subsection 7.1 using the Portfolio 2 (Fintry, July 2017) dataset, which contains the largest number of dwelling, i.e. 74 dwellings. However, to provide a robust assessment of model   performance, a similar and analogous analysis is also conducted for Portfolio 1 and 3 dataset. A brief discussion along with analogical Tables 3 and 4 and graphs  are provided in Subsection 7.2 and 7.3, respectively.

Simulating aggregated demand profiles (Portfolio 2)
One of the major tasks for generating the aggregated demand profile is the selection of an appropriate sample of suitable size of individual dwellings. To capture the dynamics of aggregated demand the sample of individual demand profiles must adequately account for the various socio-economic-environmental factors affecting the demand pattern in the community. Three sample sizes accounting for 10%, 20%, and 100%, of the total number of dwellings (e.g. 74 in Portfolio 2), relating to Aggregation schemes 1, 2, and 3, respectively, are investigated. Since no socio-economic (demographic) information is available, samples from different clusters are selected using a logical weighting factor scheme (Table 3). Weighting factor W c for a cluster c, gives a number of dwellings to be selected from that cluster and is estimated as: where s is the total size of the sample required, P E is the proportion of total demand accounted in cluster c, P s is the proportion of total size accounted for by the cluster. It is anticipated that with the availability of additional demographic information, the proposed aggregation schematic can be further improved.
Selections of multiple dwellings from a cluster are further guided through five summary statistics of demand in the cluster, thus ensuring a varying dynamical composition for sample demand profiles is used in aggregated demand synthesis. For example, Aggregation 1 schematic used 10% of the total number of dwellings as sample size, i.e. 8 dwellings are selected as sample. The sample is selected using Equation (3), i.e. W c = 8 * average(0.28, 0.46) = 3, for Cluster 1. Similarly, using Equation (3), for Aggregation 1 W c is estimated as 1, 1 and 3 for Clusters 2, 3 and 4, respectively.
Generating synthetic aggregated demands using a sample of 10% dwelling involves the generation of multiple synthetic profiles proportional to cluster size. Thus, for Aggregation 1, 11 synthetic profiles are generated using the STL_HMM_GP system corresponding to each of three sample dwellings in Cluster 1 (i.e. 3 × 11 = 33). Similarly, 3 synthetic profiles are generated corresponding to one sample dwelling selected from Cluster 2; 9 synthetic profiles are generated for the one sample dwelling selected in Cluster 3; and 9 synthetic corresponding to each three sample dwellings in Cluster 4 (i.e. 3 × 9 = 27). Further, one additional synthetic profile is generated from Cluster 2 and 3 to make a total aggregation of 33 + 4 + 10 + 27 = 74 synthetic demand profiles. Figure 9 compares the realisation of aggregated demand profiles generated from 74 observed demands in Portfolio 2 with 74 synthetic profiles generated from the STL_HMM_GP system, using three samples of 10%, 20%, and 100% size of total selected within the Aggregation schematics 1, 2 and 3, respectively. It can be noted that Figure 11. Comparing the dynamical demand profiles (5-minutes) over a week in Portfolio 1 (Findhorn, 15 Februay -28 March 2015) for Synthetic Aggregation (aggregated demand profiles constructed by adding 14 synthetic demand profiles generated using STL_HMM_GP system) and Observed Aggregation (14 individual observed demand profiles). with a small sample size (10% and 20%) synthetic aggregation can successfully capture the key dynamical patterns of the observed aggregations; in particular, at 20% sample size the match appears visually good. To further robustly investigate the abilities of the STL_HMM_GP system as a potential demand synthesising tool, an intensive investigation of aggregation error (E Aggregation ) distribution for all three aggregation schemes is performed and  presented in Figure 10. Aggregation error (E Aggregation ) is estimated at each time step (t) as: Aggregation error analysis includes: (i) a discrete probability distribution (Figure 10, Top panel), (ii) a cumulative probability distribution (Figure 10, Middle panel) and (iii) a 3D error distribution relative to the demand values ( Figure 10, bottom panel). It is interesting to notice that the discrete probability distribution of E Aggregation (t) is showing a decreasing trend towards higher error ranges for all Aggregation schemes, with Aggregation 2 closely following the dynamics of Aggregation 3. Cumulative distribution plots can be analysed to further investigate specific details; for example, E Aggregation (t) appears to be within 10 kW for 75% of cumulative probabilities and within 20 kW for 90% of cumulative probabilities for Aggregation 2.
Application of Generalised Pareto distribution is effective in improving the prediction of peak demands at an individual dwelling level. In aggregated demand profiles, specifically in Aggregation 1, the amplitude of peak demand instances appears to be not estimated as accurately as the frequency and time of their occurrences. The discrepancy in the amplitude of peak demand events can be attributed to the aggregation schematic and can be potentially improved through designing new aggregation schematics which could allow optimum sampling of dwellings using key influential factors, such as occupancy, floor area, lifestyle. This additional building/user-related information can be utilised in developing an efficient aggregations scheme through the application of k-means or similar clustering approaches. Considering the focus of the paper a detailed investigation on designing a suitable aggregation schematic is proposed as potential future work.
Nevertheless, 3D distribution of the joint probability of E Aggregation (t) ranges relative to the demand confirms that even for high demand values, error ranges are relatively small, thus confirming that the STL_HHM_GP system is effective in simulating all values consistently including the peak demand (even in aggregated profiles). High peak values occurring at relatively low frequency are often key characteristics for understanding changes in peak demand -the performance of the model at this demand range is therefore of great importance.

Simulating aggregated demand profiles (Portfolio 1)
Performance of the STL_HMM_GP system in simulating aggregated demand profiles for Portfolio 1, analogical to previous Subsection 7.1, is presented (Figures 11 and 12). Since Findhorn has only 14 dwellings, the selection at 10% and 20% of the sample size are not practically possible. In this case, Aggregation 1, 2 and 3 schemes are comprised of 25%, 50% and 100% of the dataset. In Figure 11, the aggregated demand profile of 14 dwellings at 5-minute resolution is comparatively much spikier than Figure 9 (Portfolio 2). This observation is naturally expected but it is interesting to notice that STL_HMM_GP performance is consistent and still synthetic aggregation can capture the dynamics of the observed aggregation, as effectively as in the case of Portfolio 2.

Simulating aggregated demand profiles (Portfolio 3)
Performance of the STL_HMM_GP system in simulating aggregated demand profiles for Portfolio 3, analogical to previous Subsection 7.1 and 7.2, is presented. As observed in earlier cases overall performance of the STL_HMM_GP system is consistent and robust (Figures 13 and 14).

Conclusion
This paper presented a novel approach for constructing a hybrid system of data-driven approaches, referred to as the STL_HMM_GP model, that can be utilised for analysing, modelling, and synthetically simulating electricity demand profiles. The key underpinning idea isif a small accessible sample of high-resolution electricity demand profiles (available at individual building-level) can be synthetically simulated to replicate their statistical characteristics, then independent and identically distributed synthetic profiles can be statistically organised (e.g. using a k-means approach) and aggregated to project dynamics of community-level energy demand series'.
The STL_HMM_GP system can be seen as a pipeline of a thoughtful selection of statistical/computational approaches for systematically processing a large volume of electricity demand data for capturing the impacts of various instantaneous activities occurring at individual demand profiles in the overall dynamics of aggregated demand at the community level.
The state-of-the-art architecture of the STL_HMM_GP framework is highly technical that integrating multiple components/approaches, systematically organised to serve a range of applications. The comprehensive system includes: (i) Data Organisation and sample selection using a K-means clustering approach; (ii) Data pre-processing (simplifying complex dynamics) using STL based time series decomposition; (iii) Rigorous modelling using HMM techniques; (iv) Extreme event modelling using GP distribution (v) Data post-processing using a novel percentilebased bias-correction scheme. The STL_HMM_GP system has been intensively validated for its ability to simulate the key statistical characteristics and dynamical features of individual demand profiles at three time-resolutions (5-minutes, 15 min and 30-minutes) and across various scenarios selected from two distinct community casestudies. Potentials of the STL_HMM_GP system has been demonstrated for generating aggregated demand profiles, using a considerably small sample of individual demand profiles (10%-20% of the total) for three considerably distinct data Portfolios. Results suggested that the STL_HMM_GP appears to work effectively and consistently for all the three data Portfolios used here.
The paper also demonstrates the potentials of using a simple k-mean based clustering approach for optimising the sample selection procedure. The key underpinning idea is to select a considerably small sample of size of individual demand profiles (informed by the k-means clustering that utilises simple monthly-level statistics as characterising features), whichcan be processed with highlyefficient synthetic demand simulation models, such as the STL_HMM_GP, to generate aggregated demand profiles with high accuracies. In this paper, monthly mean and median values are used as demand characterising features in the k-means model, however, other useful statistics such as floor-area, occupancy, the proportion of demand based on time-of-use can be potentially investigated as part of future work to examine different sampling schemes to optimise model efficiency.
One of the key limitations of the proposed modelling work is the ability of the scheme in reproducing the amplitudes of peak demands in the aggregation scheme. As clarified earlier this effect is mainly attributed to the selection/design of aggregations schematics. In the proposed approach simple monthly mean and medians values are used for the classification of dwellings, however, some advanced aggregation schemes can be developed to include other building/user-related information in characterisation and classification of building to allow a more robust aggregation schematic.
Moreover, the proposed modelling scheme is demonstrated using only a one-month dataset. Considering the previous applications of the HMM_GP model in the areas of streamflow simulation (Patidar et al. 2018), it can be expected that the proposed advanced form of STL_HMM_GP model also includes STL-based timeseries deseasonalisation component, should be applicable to any length of time-series over a month. The previous model contains, underpinning HMM_GP scheme and has been demonstrated to generate promising results for flow time-series of up to 50 years in length and at 15-minute resolution. The HMM_GP model has been applied to four morphologically and hydrologically different rivers located in Scotland. However, the author suggests using a time-series of up to a minimum length of 4 weeks to capture weekly-seasonal patterns effectively.
In the current form, the STL_HMM_GP model is purely data-driven. Considering the demonstrable performance of the STL_HMM_GP system, there are immense potentials to further extend its structure to include deliberate modules. These dedicated modules will have the ability to conduct an in-depth causation analysis of the key factors (such as future climate change, technology change, socio-demographic vectors) impacting electricity demand profiles at both individual and aggregated level. The deterministic components of demand profiles, specifically 'Trend' and 'Seasonal' component can provide a key to derive the inter-relationship between these components and the change vector. For example, a correlation-based relationship can be defined for the trend component of an influencing climatic variable and the trend component of the electricity demand profile. The climatic module defined using historic data can then be used to project the updated trend component of the electric demand profile using a projected trend component of the temperature component. The authors are developing this climatic module for the existing STL_HMM_GP model. Similar modules can be specifically and individually designed to simulate the impacts of other influencing factors and can be integrated into the hybrid system to deliver practical and impact-focused research/industry application tool.

Note
1. For non-statioary models two plots are produced; a residual probability plot and a residual quantile plot.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
both local amplitude and instantaneous phase of periodic processes across the time (Haddad and Serdijn 2009). Wavelet analysis widely applied frequency-analysis techniques and is suitable for non-stationary time series. Wavelet transformation can be used for a range of applications such as extracting/detecting feature, reconstructing/compressing data, multiscale analysis, investigating the strength (power) of periodic activities in different frequency bands and filtering (Bishop et al. 2020).

Mathematical details underpinning 'Waveletcomp' (R package)
The 'mother' Morlet wavelet, form the basis of the 'WaveletComp' package, where ω is the angular frequency. The morlet wavelet transform of a time series (x t ) is a convolution of time series that generates a set of complex-valued 'wavelet daughters', through translation in time by τ and scaling by s ( * denotes complex conjugate). The daughter wavelet (Eq. 4) is estimated using a Fast Fourier Transform algorithm (Torrence and Compo 1998). The time evolution of the local amplitude of any periodic component can be estimated as: The phase is given by, The square of the amplitude referred to as time-frequency wavelet energy density, is called a wavelet power spectrum, i.e.
The original series can be reconstructed for the selected periods whose average power was found significant over the entire time.   Table 5. Selection of percentiles for splitting O s set is methodical, obtained following a thorough analysis of several options, that allows allocation of a sufficient number of the data point in each state. Considering the possibility of high variation in values over the 90 th percentile, State J and K are divided using a gap of 5 percentile points.

Appendix II: Fitting HMM to energy demand series
M2. State transitional probability matrix[T] -is an 11 × 11 matrix of transitional probabilities, mathematically defined as j ∈ A, B, . . . , K) if State A is defined as a value between 0 (0 th percentile) and 0.7 (10 th percentile), then hidden state U A will represents set of values (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7). The value 0 and 0.7 are arbitrary and chosen just for the purpose of demonstration. For the present work, a resolution of one decimal place is selected which is practically appropriate for modelling energy demand series. However, depending on scale/application more fine resolutions of two or higher decimal place can be selected, however, an optimal balance is needed to avoid a large number of zero probabilities in the emission probability matrix (as explained later).
M4. Emission probability matrix [E] -is defined for each of the hidden states in U s . Following the example above, corre- Figure A1. Maximum Likelihood estimates and confidence intervals of the shape and modified scale parameter for Generalised Paerto (GP) distribution over the threshold values in range of 95th-99.9th percentile for Node 1 in Findhorn.
sponding to the hidden state U i ∈ U s , an emission probability matrix [E i ] is a row matrix of length Ei Lenght = [(rangeof U i × 10) + 1], mathematically defined as, where I i is the initial probability of i-th state, i.e.

I i = total number of instance the system is in state i Length of entire series
Appendix III: ML estimate and diagnostic plots for GP distribution The gdp.fitrange function is used to plot the ML estimate (along with 95% confidence intervals) of the modified scale and shape parameter for GP fit over a range of threshold values from 95th-99.9th percentile ( Figure A1). GP model is fitted using gdp.fit function in R for a threshold value at 97 th percentile and 2016 observations per block (a week). Diagnostic plots to assess the quality of GP fit are produced using the R function gpd.diag ('ismev' package) and is presented in Figure A2. The gpd.diag function uses an object returned by function gpd.fit as an argument and for a stationary model produces four 1 plots: (a) probability plot, (b) a quantile plot, (c) a return level plot and (d) a histogram of observed extreme demand values with fitted density. The probability plot presents a scatter plot comparing the density distribution of empirical Figure A2. Diagnostic plot for GP fit for threshold value selected at 97th percentile for Node 1 in Findhorn. data (extreme load values) plotted against the theoretical distribution (in the present case 'GP-distribution'). As most of the data points lie on the straight line, this plot demonstrates that the density distribution of empirical data is similar to the fitted GP distribution. With the same analogy, the quantile (Q-Q) plot assess how well theoretical quantiles matches with the quantiles of the empirical data used. Empirical quantiles appear to be in good agreement with the theoretically estimated quantiles for load values within the range of 2-6 kW. The return level plots along with associated confidence envelop shows the likelihood for the exceedance of an extreme demand value (selected on the y-axis, return level) to be observed once at a given return period (for the present data 1 week represent a year/block) (Devineni 2018). The data and return level plot are reasonably in agreement. Finally, a histogram of observed extreme demand values has been shown to fit well with the GP density plot. Further theoretical details on statistical modelling of extreme values can be referred to elsewhere (Coles 2001;Coles n.d.).