Energy-efficient control of thermal comfort in multi-zone residential HVAC via reinforcement learning

Energy efficient control of thermal comfort has been already an important part of residential heating, ventilation, and air conditioning (HVAC) systems. However, the optimisation of energy saving control for thermal comfort is not an easy task due to the complex dynamics of HVAC systems, the dynamics of thermal comfort and the trade-off between energy saving and thermal comfort. To solve the above problem, we propose a deep reinforcement learning-based thermal comfort control method in multi-zone residential HVAC. In this paper, firstly we design a SVR-DNN model, consisting of Support Vector Regression and a Deep Neural Network to predict thermal comfort value. Then, we apply Deep Deterministic Policy Gradient (DDPG) based on the output of the SVR-DNN model to achieve an optimal HVAC thermal comfort control strategy. This method can minimise energy consumption while satisfying occupants' thermal comfort. The experimental results show that our method can improve thermal comfort prediction performance by 20.5% compared with DNN; compared with deep Q-network (DQN), energy consumption and thermal comfort violation can be reduced by 3.52% and 64.37% respectively.


Introduction
Building energy consumption accounts for about 40-50% of global energy consumption (Prez-Lombard et al., 2008) and 30% of all CO 2 emissions (Costa et al., 2013). The explosion of building density and urban population leads to an inevitable increase in building energy consumption. HVAC systems are the main method of indoor thermal comfort control and their proper regulation has a significant impact on occupants' satisfaction with thermal comfort. The energy consumption of the HVAC system is a significant proportion of the building energy consumption. Therefore, it is necessary to promote energy-comfort-related control strategies for smart building energy management (Moreno et al., 2017).
Currently, rule-based control(RBC) is widely adopted to solve control problems in HVAC systems, but they are based on engineers' experience and thus cannot learn knowledge from historical data to save energy effectively or satisfy occupants' thermal comfort requirements. Model predictive control(MPC) (Zeng & Barooah, 2020), a model-based control method, usually solves HVAC control problems better than RBC. However, MPC requires a large amount of historical data and real-time monitoring data to establish an accurate model to save energy while meeting occupants' thermal comfort requirements. In the single-zone HVAC control problem, a low-order model can be established for MPC to solve the control problem. However, in the multi-zone HVAC control problem, it is necessary to consider not only the indoor and outdoor heat exchange but also the heat exchange between zones, which makes the building thermal model more complex. It is difficult to establish an accurate model for MPC to solve the multi-zone HVAC control problem.
In order to overcome above problems, data-driven machine learning methods such as deep learning and reinforcement learning have received extensive attention and research. In Chen et al. (2022), researches involving machine learning have been developed to optimise industrial loads. Kheyrinataj and Nazemi (2020) propose a neural network algorithm to solve delay fractional optimal control problems. In Nazemi et al. (2019), a neural network framework-based method is proposed to solve the optimal control problem. In smart gird, Fatema et al. (2021) propose a deep learning based method to meeting grid requirements and Rocchetta et al. (2019) propose a RL framework to optimise power gird. In , a Q-learning based method is proposed to optimise HVAC systems to achieve energy savings during the cooling season. Nowadays, especially model-free optimal control methods based on deep reinforcement learning (DRL) Zhou et al. (2022) show good adaptability and robustness in control problems. In Han et al. (2019), a DRL method is used to optimise the energy management. Model-free control methods do not need to establish an accurate model to solve control problems.
There has also been some pioneering works utilising DRL to optimise HVAC system control. K. R. Kurte et al. (2021) use Deep Q-Network (DQN) to satisfy residential demand response and compared it to the model-based HVAC method. This work mainly focuses on the effect of temperature on thermal comfort. In addition, the method applied is only compared on discrete action space and not with methods that can handle continuous action space. Wei et al. (2021) use a DQN based approch for the HVAC control. Brandi et al. (2020) adopt the DRL method for indoor temperature control. All of the above research works have demonstrated the effectiveness of the DRL method for HVAC thermal control. However, the above studies solve optimal HVAC control on a discrete action space. Discrete spaces can limit the performance of thermal control. In Du et al. (2021) and Yu et al. (2019), the Deep Deterministic Policy Gradient(DDPG) method is used to achieve energy efficiency and comfort satisfaction without discretization. However, these research only focus on the effect of temperature on comfort and is more concerned with energy efficiency.
Motivated by the above concerns, we apply a method combining a thermal comfort prediction model and DRL to optimise residential multi-zone HVAC control. Our objective is to minimise energy consumption under the condition of satisfying occupants' thermal comfort requirements in HVAC systems. We evaluate RBC and three RL algorithms: Q-learning for discrete control, Deep Q-Network(DQN) for discrete action space control, Deep Deterministic Policy Gradient(DDPG) for continuous control in a multi-zone residential HVAC model, where Q-learning for traditional RL methods, DQN and DDPG for DRL methods. Our method also provides new ideas and techniques for maintaining a certain level of comfort in commercial and residential buildings while saving energy and reducing emissions. The main contributions of this paper aresummarised as follows: (1) We design a hybrid model based on Support Vector Regression (SVR) and a Deep Neural Network (DNN), called SVR-DNN, for predicting thermal comfort value which is taken as a part of the state and reward in reinforcement learning.
(2) The multi-zone residential HVAC problem considering occupancy and heat exchange between zones is formulated as a reinforcement learning problem. (3) We apply Q-learning, DQN and DDPG methods to optimise HVAC control in a multizone residential HVAC model and compare the performance of these three algorithms. We show that these algorithms can reduce violation of thermal comfort compared with rule-based control in multi-zone HVAC control. Moreover, the DDPG method shows the best control performance. (4) We verify the adaptability of the method we propose under different regional weather conditions. And We also verify that the proposed method shows the best performance under different thermal comfort models.
The rest of the paper is organised as follows. The related work is described in Section 2. Section 3 introduces theoretical background of RL and DRL algorithms. The problem formulation of Multi-zone residential HVAC control is presented in Section 4. The simulation environment is introduced in Section 5. The simulation results are analysed in Section 6. Finally, Section 7 concludes the paper.

Thermal comfort
Thermal comfort is the level of satisfaction with the environment experienced by the occupants. Nowadays, a number of models have been developed to quantitatively evaluate thermal comfort. Shaw (1972) proposed the Predicted Mean Vote-Predicted Percentage Dissatisfied (PMV-PPD) model, which is based on a heat balance model. The model is designed to quantify the extent to which occupants perceive the environment as hot or cold. With the rapid development of ML, there have been a range of thermal comfort models based on ML algorithms. Zhou et al. (2020) used the support vector machine (SVM) algorithm to develop a thermal comfort model with self-learning and self-correction ability. In Liu et al. (2007), a model based on the Back Propagation(BP) neural network for individual thermal comfort was proposed. Baldi et al. (2018) proposed a switched self-tuning method to reduce energy and improve thermal comfort. Korkas et al. (2018) proposed an EMS strategy to change the energy demand considering the occupancy information. In Korkas et al. (2015), a distributed demand management system is proposed to be adaptable to different changes (weather or occupancy). The above study was carried out to obtain the control strategy by adaptive optimisation. In Wu et al. (2018), a hierarchical control strategy is used to provide primary frequency regulation in residential HVAC systems. Watari et al. (2021) adopted the MPC-based method for energy management and thermal comfort. Zeng and Barooah (2021) proposed an adaptive MPC scheme in HVAC systems for energy saving.

Thermal comfort HVAC control
The above methods can all be classified as model-based methods, where the thermal dynamic environment of the HVAC needs to be modelled. However, the thermal environment influenced by a variety of factors that is difficult to be modelled precisely. model-free RL has been greatly developed in recent years, and as a result, many researchers have applied RL to deal with HVAC control problems.  implemented Q-learning and the model-based controller to respectively optimise building HVAC systems to save energy. In Brandi et al. (2020), deep reinforcement learning(DRL) is applied to optimise the problem of the supply water temperature setpoint in a heating system and the well-trained agent can save energy between 5% and 12%. Achieving energy savings from optimising HVAC control equates to cost savings. Jiang et al. (2021) proposed DQN with an action processor, saving close to 6% of total cost with demand charges, while close to 8% without demand charges. In Wei et al. (2021) applied a DRL-based algorithm for minimising the total energy cost while maintaining desired room temperature and meeting data centre workload deadline constraints. A deep Q-network (DQN) was applied to optimise four airhandling units (AHUs), two electric chillers, a cooling tower, and two pumps to minimise the energy consumption while maintaining the indoor CO 2 concentration (Ahn & Park, 2019). Zenger et al. (2013) implemented the RL algorithm to maintain thermal comfort while saving energy. In K. R. Kurte et al. (2021), authors implemented the DQN algorithm to achieve energy savings and meet comfort(temperature). Fu et al. (2022) proposed a distributed multi-agent DQN to optimise HVAC systems. Cicirelli et al. (2021) used DQN to balance energy consumption and thermal comfort. K.  and K.  applied DRL in residential HVAC control to save costs and maintain comfort. However, Q-learning and DQN can still only handle the discrete action space.
In Yu et al. (2019), a DRL method was applied for smart home energy management. Du et al. (2021) implemented the DDPG method to address the issue of 2-zone residential HVAC control strategies that allow for the lower bound of the user comfort level(temperature) with energy savings but they did not establish a thermal comfort prediction model. In McKee et al. (2020) and McKee et al. (2020) used DRL to optimise residential HVAC considering human occupancy to achieve energy savings. Sang and Sun (2021) applied DDPG to generate an HVAC cooling-heating-power strategy to solve the demand response problem. In Yu et al. (2021), a multi-agent DRL method is proposed in building HVAC control to minimise energy consumption. A concern with multi-agent algorithms is that as the number of zones increases, the number of neural networks increases, resulting in excessive computational costs. Based on these considerations, we propose a method combining DDPG and thermal comfort model for energy-efficient thermal comfort control in multizone residential HVAC. Our method is able to reduce energy consumption while meeting the thermal comfort requirements (Table 1).  Korkas et al. (2018Korkas et al. ( , 2015, Wu et al. (2018), Watari et al. (2021), and Zeng and Barooah (2021) The model-based method is able to learn strategies based on the system's own dynamic model and is more robust.However, the model-based method requires an accurate model and the dynamic model of the HVAC is difficult to build accurately. This literature applies DDPG (DDPG is able to handle continuous space and again does not require an accurate model) for optimal control to achieve energy savings and comfort requirements. However, they focus more on energy consumption and only on the temperature comfort. Multi-agent DDPG Yu et al. (2021) The literature uses a multi-agent DRL with attention mechanism to minimise energy consumption, assigning an agent to each zone to solve the high dimensional space problem. But the ensuing computational cost becomes high.

Theoretical background of algorithms
In this section, the theoretical background of RL and DRL is presented. And we concentrate on DDPG, DQN and Q-learning methods. Reinforcement learning(RL) is a kind of trial and error learning through interaction with the environment (Montague, 1999). Its goal is to make the agent get the largest cumulative reward in the environmental interaction. The problem of reinforcement learning (RL) can be modelled as a Markov Decision Process (MDP), which includes a quintuple S, A, r, p 1 , P . MDP is shown in Figure 1.
(1) S is the state space, s t ∈ S indicates the state of the agent at time t.
(2) A is the action space, a t ∈ A represents the action taken by the agent at time t.
(3) r : S × A → R is the reward function, r t ∼ r(s t , a t ) indicates the immediate reward value obtained by the agent executing the action a t in the state s t . (4) p 1 is an initial state distribution with density p 1 (s 1 ).
A policy, denoted by π : S → P(A), is used to select actions in the MDPs. π(a t |s t ) represents the probability of selecting a t in s t .According to different requirements, the policy can be stochastic or deterministic. The agent uses its policy to interact with the environment to generate a trajectory of states, actions and rewards, z 1:T = s t , a t , r 1 , . . . , s T , a T , r T over S × A × R. From the beginning of time t to the end of the episode at time T, assuming that the immediate reward at each time in the future is multiplied by a discount factor γ , the return G t is defined as follows: , γ is used to weigh the impact of the future rewards on the return G t . The value functions are state value function and state action value function respectively. They are defined as the expectation of return G t : . Similarly, we have the relationship between V π (s) and Q π (s, a) : The agent finds a policy to maximise the performance objective (Montague, 1999): In dynamic HVAC systems, state transition probability distribution is unknown. The agent learns critical information by trial and error. The agent can learn a optimal policy that takes into account energy consumption and thermal comfort through model-free reinforcement learning methods.

Q-learning
Q-learning is a value-based algorithm. Firstly, its goal is to find an optimal policy π * to maximise i.e. "(3)". If we have an optimal policy π * , the optimal V π (s) and the optimal Q π (s, a) will be V * (s) = max π V π (s) and Q * (s, a) = max π Q π (s, a) respectively. We have the Bellman optimality equation: In general, Q * is solved by iterating the Bellman equation. But P is generally unknown in practical problems, Bellman equation cannot be solved directly. Q-learning uses the time difference method (TD), which combines Monte Carlo sampling and bootstrap of dynamic programming to update the Q value. The updated formula for Q-learning is as follows: Q-learning is a simple, intuitive and effective algorithm, but it can only deal with discrete small state space and action space. If the space is too large, its Q table will become larger, and the efficiency will be greatly reduced.

Deep Q-network (DQN)
To handle high-dimensional sensory inputs and generalise past experience to new situations, Volodymyr et al. (2019) combine convolution neural network with traditional Q-learning and propose a deep Q-network model. DQN is a pioneering work in the field of DRL. DQN parameterises the state action value function Q π (s, a) by a nonlinear neural network, and updates the neural network parameters to approximate the optimal state action value function Q * (s, a). We use Q(s, a; ω), where ω represents the estimated parameters, to denote the parameterised value function. DQN is also based on the Bellman optimality equation. We change i.e. "(2)" to iterative form: RL is not convergent when the value function is represented by a nonlinear function. In order to solve the above issue, DQN introduces a target network and experience replay mechanism to stabilise the training. In this paper, the target network is represented by Q(s, a; ω ). Y j = r + max a Q(s , a ; ω ) is used to approximately represent the optimisation objective of the value function. We can train the network by minimising the mean square error loss function: The parameter ω of network Q(s, a; ω) is updated in real-time. After N rounds of iteration, the parameters of the network Q(s, a; ω) are copied to the target network Q(s, a; ω ). We can differentiate ω in i.e. "(5)", and the gradient is as follows: DQN is effective in solving most problems. Its disadvantage is that it can not deal with continuous action space. Even if DQN processes discrete action space, which is large, its performance will deteriorate.

Deep deterministic policy gradient (DDPG)
Inspired by DQN, Lillicrap et al. (2015) proposed a DDPG algorithm combining DQN and the deterministic policy gradient (DPG). DDPG is designed to solve problems with continuous action space in DRL.

Deterministic policy gradient (DPG)
In reinforcement learning, policy gradient(PG) is used to deal with continuous action space. Different from DQN and Q-learning based on value function, PG directly parameterises the policy π θ (a|s). The performance objective is J(π θ ) = E[G 1 |π θ ]. We use ρ(s) representing state distribution. In PG, the policy selects stochastically action a in state s according to its parameter. So what the PG algorithm can learn is a stochastic policy. In Sutton et al. (1999), Sutton et al. propose the policy gradient theorem as follows: In stochastic problems, policy gradient(PG) may need more samples due to the integration of state space and action space, which also increases the calculation cost. Silver et al. (2014) propose the deterministic policy gradient (DPG) algorithm which integrates over the state space. We use a deterministic policy μ θ : S → A with parameter vector θ ∈ R n . The performance objective is J(μ θ ) = E[G 1 |μ θ ]. We define state distribution ρ μ . Then the deterministic policy gradient theorem is as follows:

Actor-critic network
In DDPG, actor network is used to evaluate the quality of action a selected in state s and finally approximatly fits the optimal deterministic policy. Critic network is used to evaluate the state action value pair Q(s, a) and approximatly fits Q * (s, a).Similarly, DDPG draws on DQN and has a pair of actor networks and critic networks.Online actor network and target actor network are defined as μ(s|θ μ ) and μ(s|θ μ ), θ μ ∈ R n and θ μ ∈ R n . Online critic network and target critic network are respectively represented by Q(s, a|θ Q ) and Q(s, a|θ Q ), θ Q ∈ R n and θ Q ∈ R n . Actor network uses the deterministic policy gradient theorem to update the network parameters and constantly method the optimal policy. Its loss function We differentiate θ μ in J(θ μ ), the gradient is as follows: Critic network is similar to the Q network of DQN, we define TD-target by y i = r(s, a) + Q(s , μ(s |θ μ )|θ Q ). It updates through the mean square error between y i and online net- . Critic network uses the following update: Both target actor network and target critic network adopt the soft update method to ensure the stability of the algorithm, as follows:

Multi-zone residential HVAC control problem formulation
In this section, we first briefly introduce the HVAC optimal control problem, followed by the SVR-DNN thermal comfort prediction model and the MDP modelling of the HVAC control problem. Finally, we present RL methods based on SVR-DNN for thermal comfort control.

Optimization control problem
In this study, we consider a multi-zone residential apartment. When the thermal comfort in the room does not meet the thermal comfort of the occupants, the HVAC is turned on to regulate the thermal comfort. In this work, we establish a thermal comfort model to predict thermal comfort. The goal of HVAC system control is to save energy while meeting thermal comfort requirements.

Thermal comfort prediction
The range of thermal comfort is typically [−3, 3], −3 for the coldest, 3 for the hottest, and 0 for neither cold nor hot. In ASHRAE, researchers provided a large sample of thermal comfort.
There are many factors that affect thermal comfort, such as wind speed, thermal radiation, temperature, humidity, clothing, metabolism and so on. Thermal radiation and wind speed M  are influenced by the structure of the building. Clothing and metabolism are determined by the individual. In general, it is not possible to measure all of these factors. Relatively, temperature and humidity are able to be measured in real-time. So we choose temperature and humidity as the main considerations of the thermal comfort model. Firstly, we train an SVR model. SVR is a branch of Support Vector Machine(SVM). We predict the occupants' thermal comfort value at time slot t as: The SVR problem can be written as follows: .
where f (x i ) = wx i + b, w and b are the parameters of SVR and C is the regularisation parameter. ξ i and ξ i are the relaxation factors. x i represents the ith sample and M i represents the ith label. ε represents the maximum deviation between f (x i ) and y. The optimal parameters w and b are obtained by solving the above constrained problems. Then, we train an SVR-DNN model to predict thermal comfort. The structure diagram of SVR-DNN is shown in Figure 2 . The inputs of SVR-DNN are predicted value of SVR, indoor temperature and indoor humidity. The output of SVR-DNN is the predicted thermal comfort value. Thus, we predict the occupants' thermal comfort value at time slot t as: where M SVR−DNN t represents the predicted value of thermal comfort at time slot t and SVR−DNN is the thermal comfort prediction model. The reason why we take the predicted value of SVR model as an input is that we give an approximate label value to the deep neural network in advance, so that the deep neural network can learn more information. The predicted value of SVR has certain guiding significance for the learning of the deep neural network. The single deep neural network has only two feature inputs, while SVR-DNN has three inputs, in which the predicted value of SVR has the feature of the real label. Therefore, SVR-DNN has better performance.

Mapping multi-zone residential HVAC control problem into MDP
In this paper, we consider a multi-zone residential apartment in the cooling season. The indoor temperature of the apartment will vary with the setpoint of the HVAC. If the indoor temperature is higher than the set temperature, the HVAC system will work to push the indoor temperature close to the setpoint, on the contrary, HVAC will not operate. The time is denoted as t = 0, 1, 2 . . .. The duration of each time slot is one hour. We formulate the multi-zone residential HVAC control problem as an MDP, including state, action and reward function.
(1) State space The state space is shown in Table 2 , which includes outdoor temperature, outdoor humidity, thermal comfort and ideal thermal comfort in three zones. Note that state space includes ideal thermal comfort which can change with time. In reality, user's thermal comfort is different in different time periods. In this study, first consider the satisfaction of thermal comfort, and then consider energy saving.
(2) Action space The action space is shown in Table 3, which includes the temperature setpoints in three zones. HVAC systems will take actions according to different demands. In DQN and Qlearning, the action space is discrete, so we discretise the range of setpoints with a step size of 0.5 • C. In this study, the set point of relative humidity is fixed and set at 60%.

(3) Reward function
Because it is necessary to consider energy saving under the condition of satisfying thermal comfort requirements, we define the reward function as: where Q k t represents the energy consumption of Roomk at time t. Since thermal comfort is the first consideration, multiply the first item by weight β to increase its impact on the reward function.

Q-learning, DQN and DDPG algorithms for HVAC thermal comfort control strategies
We implement three algorithms: Q-learning, DQN and DDPG. See Algorithms 1-3 for the specific algorithm flow.
Algorithm 1 is Q-learning for HVAC control. In Algorithm 1, first initialise the Q value; Q-learning RL agent observes the state s t , selects the action a t by − greedy policy, and then updates the Q value by the i.e. "(3)", stores the updated Q value into the Q table, and loops in this way until the state is terminated to jump out of the loop for the next Algorithm 1 Q-learning for HVAC Require: Learning rate α ∈[0, 1], very small , > 0 Require: For all s ∈ S, a ∈ A(s), initialise Q(s, a) = 0, among Q(termination state, ) = 0, a well trained thermal comfort model for episode=1,M do Obtain the initial state s 0 for t = 1, T do The action a t is selected by − greedy policy at s t r t is obtained according to i.e. ''(14)'', s t+1 is observed from the environment if s t is termination state then break end if s t ← s t+1 end for end for episode update. Algorithms 2 and 3 are DQN and DDPG for HVAC control respectively. In Algorithm 2, the Q-network is initialised and the associated target Q-network is initialised with the same parameters. For each iteration, the state is first initialised and the Q-network selects the action, i.e. the current setpoint, based on the current state by − greedy policy. Next, the reward and the next state are observed. If Room1 and Room3 are occupied, the transition (s t , a t , r t , s t+1 ) is stored in the replay buffer D1; Room5 is occupied, the transition (s t , a t , r t , s t+1 ) is stored in the replay buffer D2. When enough samples are collected in the replay buffer, a mini-batch of transitions is randomly selected to update the Q-network parameters. The Q-network parameters are copied to the target Q-network every U time steps of delay. In Algorithm 3, the actor network and critic network are initialised, and the target actor network and target critic network are initialised with the same parameters. Similar to DQN, the state is also initialised; the agent selects a t through the actor network based on the current state s t , and adds noise to the selected action. The action a t is executed in the environment, and the reward and next state are observed. The transition (s t , a t , r t , s t+1 ) is stored in the same way as the DQN. Again when enough samples are collected in the replay buffer, a mini-batch of transitions is randomly selected to update the network parameters. The critic network updates the network parameters by the mean square error of the target Q value and the current Q value. The actor network updates the network parameters according to the deterministic policy gradient theorem. To ensure the stability of the algorithm, then the target network parameters are updated softly according to i.e. "(11)".

Simulation environment
The plain layout of a five-zone and three-occupant residential HVAC model (Deng et al., 2019) is shown in Figure 3. There are five rooms in the apartment, of which three are functional rooms, including Room1, Room3 and Room5 which have HVAC systems. The layout of the residential apartment is identified from multi-level residential buildings in Algorithm 2 DQN for HVAC with two replay buffers Require: Initialise action-value function Q with random weight ω Require: Initialise target action-value function Q with random weight ω Require: Initialise replay buffer D1 and D2, a well trained thermal comfort model for episode=1,M do Obtain the initial state s 0 for t = 1, T do The action a t is selected by − greedy policy at s t r t is obtained according to i.e. ''(14)'', s t+1 is observed from the environment if Room1 and Room3 is occupied then Store transition(s t , a t , r t , s t+1 ) in D1 else Store transiton(s t , a t , r t , s t+1 ) in D2 end if if Room1 and Room3 is occupied then Q(s, a; ω)) 2 ] with respect to the network parameter ω Every delayed policy update U steps reset Q = Q end for end for Chongqing, China. We use real-world weather data from Bureau (2005). The HVAC model considered in this paper is only utilised for cooling and we consider the cooling time from May to September. Considering the occupation of personnel, the specific schedule is shown in Table 4. As the toilet and kitchen are occupied only under specific circumstances, these two rooms are not considered for the time being. Both Room1 and Room3 are bedrooms. There are two occupants when Room1 is occupied and one occupant when Room3 is occupied. Room5 is the sitting room, with three occupants when occupied. We assume that the three occupants have the same attributes. It is further assumed that the thermal comfort of occupants changes circularly in a day according to the number of occupants and occupation time. The specific planning is shown in Table 5.
The specific simulation process is shown in Figure 4. Data on the indoor temperature and humidity are collected and fed into a trained SVR-DNN thermal comfort model to predict Algorithm 3 DDPG for HVAC with two replay buffers Require: Randomly initialise critic network Q(s, a|θ Q ) and actor network μ(s|θ μ ) with weight θ Q and θ μ Require: Initialise critic network Q (s, a|θ Q ) and actor network μ (s|θ μ ) with weight θ Q ← θ Q and θ μ ← θ μ Require: Initialise replay buffer D1 and D2, a well trained thermal comfort model for episode=1,M do Obtain the initial state s 0 for t = 1, T do Select action a t = μ(s t |θ μ ) + N t according to the current policy and exploration noise r t is obtained according to i.e. ''(14)'', s t+1 is observed from the environment if Room1 and Room3 is occupied then Update critic by minimising the loss :L = 1 N i (y i − Q(s i , a i |θ Q )) 2 Update the actor policy using the sampled policy gradient: the current thermal comfort value. The predicted thermal comfort value is used as part of the state and the current reward is obtained through the reward function we designed. Through continuous interaction with the environment and learning, RL agents learn the optimal thermal comfort control strategies for multi-zone HVAC control.

Experiment
In this section, A multi-zone HVAC model is used to demonstrate the effectiveness of RL methods by comparing with the RBC case. And the effectiveness of DDPG for thermal comfort control is demonstrated by comparing with discrete control methods based on DQN and Q-learning. We also compare the performance under different thermal comfort models on the test set to validate the advantages of our method.

Implementation details
In Q-learning and DQN, the action space is discrete. We discretise the range of setpoints with a step size of 0.5 • C. As a result, there are 9 actions for each zone and 729 combinations of actions for the 3-zone HVAC. In DDPG and DQN, the specific design of the deep neural network and hyperparameters is respectively shown in Tables 8 and 7. The hyperparameters of Q-learning are shown in Table 6. We design an RBC case which is presented in Table 10 without the RL agent as comparison. When the room is unoccupied, setpoint is set at 27 • C. In other cases, setpoint is set according to Table 10. The implementation details of SVR-DNN are shown in Table 9. We choose Relu as the activation function of hidden layers    to prevent the disappearance of the gradient. Since the range of thermal comfort is [−3, 3], we choose to use tanh function y tanh as the activation function of the output layer, so that the predicted value is M = 3y tanh .

Performance of SVR-DNN
We select 899 samples under the same conditions in the ASHRAE (Fldvry Liina et al., 2018), 80% for training and 20% for testing. These samples are selected in summer and under the condition of indoor air conditioning. We first train an SVR model. Then we take the predicted value of SVR, indoor temperature and indoor humidity as the input of the deep neural network. The occupants' thermal comfort value at time slot t as The prediction error is shown in Table 11 below. These datasets are labelled by the target subjects in different thermal states to evaluate their thermal comfort values. Because of individual physiological differences, regional differences and other factors, the labelled data may be subjective and noisy. In order to solve the above problems, we add L2 regularisation to SVR-DNN to solve the overfitting problem, so that SVR-DNN has better generalisation ability. The cost function of SVR-DNN with L2 regularisation is as follows: where n is the number of training samples, M i is the ith label value, and M i is the predicted value of SVR-DNN. In i.e. "(16)", The second term is the regular term. λ is the weight of the regular term. The greater λ is, the greater the role of the regular term, that is, the greater the punishment. m is the number of weights in the neural network, and ω j is the jth weight. We set λ = 0.01. The prediction errors of DNN and SVR-DNN are shown in Figure 5. The prediction performance improvement of SVR-DNN compared with other methods is shown in Table 12. SVR-DNN shows the best performance.

Performance of DDPG based on SVR-DNN
(1) Convergence In Figure 6, the reward of Q-learning, DQN and DDPG is presented during training. In this paper, we take May to September as an episode, a total of 50 training episodes. We set the weight β in the reward function to 10. From Figure 6, we note that the rewards of DQN and DDPG are lower than those of Q-learning in the first few episodes. This is because both DQN and DDPG need to store transitions in the early stage, and they have not learned yet. On the contrary, Q-learning has begun to learn. After about 15 episodes, Q-learning tends to converge, which is due to the discretization of state space and action space, which greatly reduces the scope of exploration and converges quickly. However, the reduction of exploration space will lead to insufficient exploration, resulting in low reward. The reward of Q-learning is the lowest of the three methods. The reward of DQN tends to converge after about 24 episodes. Due to the discretization of action space and the incomplete exploration of action space combination, the reward of DQN is lower than that of DDPG. Because DDPG can deal with   continuous control problems, it can explore enough space so that it can find the optimal action to get a higher reward. The reward of DDPG is the highest of the three methods.
The violation of thermal comfort in each episode is presented in Figure 7. The thermal comfort violation and reward have the opposite trend. The lower the thermal comfort violation, the higher the reward, which also indicates the better the thermal comfort control strategy learned by the RL agent. The violation of thermal comfort has the same convergence trend as the reward. The Q-learning method had the highest thermal comfort violation, followed by DQN and the least by DDPG. It can be seen from Figures  6 and 7 that the DDPG method has greater advantages in dealing with HVAC control  problems. Both Q learning and DQN do not perform as well as the DDPG method, which is better able to meet the thermal comfort needs of occupants. The indoor temperature based on Q-learning, DQN and DDPG methods is presented in Figures 8(a), 9(a) and 10(a) respectively in Changsha. Notice that there is a period of relatively low temperature between 0 to 1000 hours and 3000 to 4000 hours. This is because HVAC is not on, the indoor temperature is relatively low and does not reach the setpoint temperature. The indoor temperature controlled by DDPG is mostly between    Figures 9(a) and 10(a), part of the indoor temperature controlled by DQN is biased to 24.5 • C and 27 • C and the part of that controlled by Q-learning is biased to 24 • C. We take out the indoor temperature and thermal comfort on August 1 for detailed description and analysis. From Figures 11(a) and 12, the indoor temperature and thermal comfort controlled by the DDPG method are more regular, Figure 13. Room1 for 21 test days based on SVR-DNN. and the thermal comfort deviates little from the ideal thermal comfort we set. In Figure  11(b), the temperature controlled by DQN is lower than that controlled by DDPG most of the day, which increases energy consumption. The performance of indoor temperature controlled by Q-learning is worse than that of DQN and DDPG from Figure 11(c). In Figure 12, the comparison of thermal comfort of three rooms is presented on August 1. DDPG method maintained thermal comfort best in three rooms. DQN can learn a certain amount of laws, but it can't learn more complete knowledge. Only partial thermal comfort can be maintained on this day by DQN. Q-learning can only maintain a small part of thermal comfort in a day because it can only explore a limited space. The maintenance of thermal comfort from May to September is presented based on DDPG, DQN and Q-learning in Figures 8(b), 9(b) and 10(b). The well-trained RL agents are applied to generate the HVAC control strategies for the test 21 days from July 20 to August 9 in Chongqing. The RL control strategy, the associated indoor temperature and thermal comfort are further shown in Figures 13-15. In Figure 13, the control strategy of the DDPG RL agent is well able to learn the key knowledge of the environment. The indoor temperature and the thermal comfort level show regularity, and the DDPG agent is able to control the thermal comfort to fluctuate around the ideal comfort level we set and to achieve energy savings. As shown in Figures 14 and 15, DQN can learn certain laws, and since they both can only handle discrete action spaces, especially Q-learning can only handle discrete problems. They all end up learning worse strategies than DDPG. We take out the indoor temperature and thermal comfort based on four methods on August 1 in Chongqing. In Figures 16 and 17, the DDPG method continues to show the best performance and is able to learn and make optimal actions in complex environments. DQN and Q-learning are also unable to handle the continuous action space and explore all actions, so their performance is worse than DDPG. Especially the strategy learned by Q-learning has poor regularity. RBC is usually set based on experience and can not learn from historical data to self-regulate, so it has a high degree of violation. In particular, Q-learning, when the environment is too complex to handle, is not necessarily better than RBC in terms of the strategies it eventually learns. The thermal comfort violation and energy consumption test results of the four methods are shown in Table 13. In Table 13, The DDPG control method has the lowest energy consumption and the least thermal comfort violation. Although the energy consumption of DQN and Q-learning is slightly higher than that of RBC, the thermal comfort violation is smaller than that of RBC.   (4) Performance comparison under different thermal comfort models We use different thermal comfort models to compare their effects on energy consumption as well as thermal comfort. The test results are shown in the the temperature set point, is significantly lower which leads to increased energy consumption. The thermal comfort predicted by the SVR and DNN models is higher than that predicted by the SVR-DNN, which results in a lower temperature setpoint learned   by RL agents. The more accurately the thermal comfort is predicted, the better the optimal control strategy for RL will also be. Table 17 shows the thermal comfort violation in the room for each time period based on the DNN and SVR models respectively. Table 18 shows the average values of thermal comfort in each room for each time period. The DDPG method also demonstrates the best control performance as seen in Tables 17  and 18. The comparison between Tables 14 and 17 shows that RL-based thermal comfort control strategies are able to reduce thermal comfort violation regardless of the thermal comfort model. The DDPG method performs better than the discrete methods DQN and Q-learning regardless of the thermal comfort model. Our method can better meet the thermal comfort requirements of occupants and also achieve energy savings with less thermal comfort violation.

Conclusion
In this paper, we proposed a method combining a thermal comfort prediction model and deep reinforcement learning to optimise the residential multi-zone HVAC control. We        first trained an SVR-DNN model to predict thermal comfort; then used DDPG based on SVR-DNN to optimise indoor thermal comfort to meet occupants' conditions and achieve energy savings. A multi-zone residential HVAC model was used to evaluate the performance of the method we proposed. The results show that SVR-DNN can improve thermal comfort prediction performance by 20.5% compared with the deep neural network(DNN);  compared with Q-learning and DQN, DDPG can reduce the energy consumption cost by 4.69%, 3.52% and reduce the comfort violation by 68.11%, 64.37%; based on SVR-DNN, DDPG, DQN and Q-learning compared with a rule-based control strategy can respectively reduce thermal comfort violation by 69.27%, 13.76%, 3.63%. The optimal control based on SVR-DNN thermal comfort model shows the best performance. Through comparative experiments, Q-learning shows great limitations for handling complex HVAC environment problems;compared with Q-learning, DQN shows better performance; DDPG shows the best performance. In future work, we consider multi-agent RL algorithms to solve the multi-zone HVAC control problem. Instead of limiting the consideration of energy saving comfort optimisation to the cooling season, we are able to carry out energy saving optimisation control of thermal comfort on a year-round basis.