Patent application title:

OPTIMIZED CONTROL METHOD AND SYSTEM OF BUILDING AIR CONDITIONINGS TAKING PERSONALIZED COMFORT OF REGIONAL USERS INTO CONSIDERATION

Publication number:

US20260071771A1

Publication date:
Application number:

19/306,155

Filed date:

2025-08-21

Smart Summary: An improved method for controlling building air conditioning focuses on the comfort of users in different regions. It uses a system called a Markov decision process to define how the air conditioning should respond to various situations. The method learns from user feedback to create a personalized comfort model for each region. By combining this comfort model with energy consumption data, it develops a reward system for the air conditioning units. Finally, trained networks help make smart control decisions to optimize both comfort and energy use. 🚀 TL;DR

Abstract:

An optimized control method of building air conditionings taking personalized comfort of regional users into consideration, comprising: describing a heating, ventilation and air conditioning system control problem as a Markov decision process, and defining a state, an action and a reward of each agent; learning by adopting a meta-learning method to obtain a user comfort reward function initial model; performing differentiated training on the user comfort reward function initial model by respectively using comfort feedback data from different regional users to obtain a differentiated comfort reward function model of each agent; and obtaining a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, training an actor network and a critic network of each agent, and making control decisions for a building heating, ventilation and air conditioning system by using the trained actor network.

Inventors:

Applicant:

Interested in similar patents?

Get notified when new applications in this technology area are published.

Classification:

F24F11/46 »  CPC main

Control or safety arrangements for purposes related to the operation of the system, e.g. for safety or monitoring Improving electric energy efficiency or saving

F24F11/64 »  CPC further

Control or safety arrangements characterised by the type of control or by internal processing, e.g. using fuzzy logic, adaptive control or estimation of values; Electronic processing using pre-stored data

F24F2120/20 »  CPC further

Control inputs relating to users or occupants Feedback from users

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present disclosure claims priority to Chinese Patent Application No. 202411266211.X filed to China National Intellectual Property Administration on Sep. 11, 2024, and entitled “OPTIMIZED CONTROL METHOD AND SYSTEM OF BUILDING AIR CONDITIONINGS TAKING PERSONALIZED COMFORT OF REGIONAL USERS INTO CONSIDERATION”, which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The present disclosure belongs to the technical field of building energy saving and intelligent control, and in particular relates to an optimized control method and system of building air conditionings taking personalized comfort of regional users into consideration.

BACKGROUND

In China, building energy consumption accounts for a relatively large proportion of the total energy consumption in the country, among which heating, ventilation and air conditioning (HVAC) systems are the main energy consuming devices. With the acceleration of urbanization and the improvement of people's living standards, the comfort and energy saving needs of the internal environment of buildings continue to rise. Due to individual differences, different regional users may have different thermal preferences, which will affect the actual comfort of the users. In addition, due to the significant climate differences between winter and summer, seasonal energy demand peaks are evident, further exacerbating energy consumption issues. To address these challenges, improving the energy efficiency of HVAC systems and optimizing their control policies have become urgent issues that need to be addressed.

The existing HVAC system control methods mainly rely on traditional rule-based control methods. There are several issues with these methods in practical applications:

Firstly, traditional control methods generally use standardized comfort evaluation indicators (such as predicted Mean Vote (PMV) model and American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc. (ASHRAE) Standard 55), making it difficult to effectively consider the personalized needs of different regional users, resulting in low actual comfort of the users.

Secondly, traditional model-based control methods require high accuracy and complex calculations for the system model, while building HVAC systems are complex dynamic systems that are nonlinear, multi-variable, and strongly coupled. Traditional methods often struggle to achieve ideal control effects in practical applications.

SUMMARY

To overcome the shortcomings of the existing technologies, the present disclosure provides an optimized control method and system of building air conditionings taking personalized comfort of regional users into consideration. An optimized control method based on meta-learning is adopted, and a personalized comfort reward function model for regional users is introduced to ensure the personalized comfort needs of different regional users, so that the HVAC system can achieve efficient energy saving and emission reduction goals while ensuring the personalized comfort of regional users, the operation and maintenance costs are reduced, and the application and promotion in various building environments are facilitated.

In order to realize the above purpose, one or more examples of the present disclosure adopts the following technical solution.

According to a first aspect, the present disclosure provides an optimized control method of building air conditionings taking personalized comfort of regional users into consideration.

The optimized control method of building air conditionings taking personalized comfort of regional users into consideration, including:

    • acquiring comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and performing data preprocessing;
    • describing a heating, ventilation and air conditioning system control problem as a Markov decision process (MDP), and defining each agent, and a state, an action and a reward of each agent;
    • performing initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;
    • performing differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and
    • obtaining a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, training an actor network and a critic network of each agent, and outputting a control policy of the building heating, ventilation and air conditioning system by using the trained actor network; wherein:
    • the output control policy includes multiple optimized control parameters with physical values; controlling and adjusting the opening degree of a chilled water valve, a fresh air valve, a return air valve, and an exhaust air valve of the air handling unit in the building heating, ventilation and air conditioning system, and the opening degree of an end air valve of each variable air volume system (VAV) in the building heating, ventilation and air conditioning system to be consistent with the physical values of the optimized control parameters.

According to a second aspect, the present disclosure provides an optimized control system of building air conditionings taking personalized comfort of regional users into consideration.

The optimized control system of building air conditionings taking personalized comfort of regional users into consideration, including:

    • a data acquisition module, configured to acquire comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and perform data preprocessing;
    • a MDP construction module, configured to describe a heating, ventilation and air conditioning system control problem as the MDP, and define each agent, and a state, an action and a reward of each agent;
    • a meta-learning training module, configured to perform initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;
    • a meta-learning testing module, configured to perform differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and
    • an optimized control policy training module, configured to obtain a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, train an actor network and a critic network of each agent, and make control decisions for the building heating, ventilation and air conditioning system by using the trained actor network.

According to a third aspect, the present disclosure provides a computer-readable storage medium having a program stored therein, wherein the program, when executed by a processor, implements the steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to the first aspect.

According to a fourth aspect, the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and runnable on the processor, wherein the program, when executed by the processor, implements the steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to the first aspect.

The above one or more technical solutions have the following beneficial effects:

The present disclosure provides an optimized control method and system of building air conditionings taking personalized comfort of regional users into consideration. An optimized control method based on meta-learning is adopted, and a personalized comfort reward function model for regional users is introduced to ensure the personalized comfort needs of different regional users, so that the HVAC system can achieve efficient energy management, energy saving and emission reduction goals while ensuring the personalized comfort of regional users, the operating and maintenance costs are reduced, and the application and promotion in various building environments are facilitated.

The present disclosure has the advantage of high efficiency, and can quickly adapt to the comfort needs of new regional users and improve the convergence speed of control policies by adopting the method based on meta-learning.

The present disclosure has the advantage of personalization, which can achieve personalized comfort reward function model learning according to feedbacks from different regional users, and improve the user comfort.

The present disclosure has the advantage of energy saving, and the optimized control policy can minimize energy consumption while ensuring the comfort.

The advantages of the additional aspects of the present disclosure will be set forth in part in the description below, which will become apparent from the description below, or will be understood by the practice of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings of the description, which form a part of the present disclosure, are intended to provide a further understanding of the present disclosure. The exemplary examples of the present disclosure and their descriptions are intended to describe the present disclosure, instead of constituting any improper limitation on the present disclosure.

FIG. 1 is a flowchart of a method according to Example 1.

DETAILED DESCRIPTION

It should be noted that, the following detailed descriptions are all exemplary, and are intended to provide further descriptions of the present disclosure. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those usually understood by a person of ordinary skill in the art to which the present disclosure belongs.

It should be noted that the terms used herein are merely used for describing specific implementations, and are not intended to limit exemplary implementations of the present disclosure.

Without causing any conflict, the examples of the present disclosure and the features in the examples may be combined with each other.

Example 1

In order to solve the problems mentioned in the background, the present example discloses an optimized control method of building air conditionings taking personalized comfort of regional users into consideration, which adopts an optimized control method based on meta-learning to ensure the personalized comfort needs of different regional users. The specific solution includes:

    • performing training by using a meta-learning algorithm to obtain a user comfort reward function initial model, wherein the model is used as an initial model for the learning of a comfort reward function of different regional users;
    • aiming at differentiated comfort requirements of the different regional users, respectively collecting comfort feedback data from the different regional users, and performing differentiated training and parameter updating on the user comfort reward function initial model to obtain a comfort reward function model separately corresponding to each agent; and
    • based on the reward function considering energy consumption and regional differences in comfort, performing HVAC optimized control neural network training, and making control decisions by using the trained actor network.

In the present disclosure, a personalized comfort reward function model for regional users is introduced, so that the HVAC system can achieve efficient energy saving and emission reduction goals while ensuring the personalized comfort of regional users, the operation and maintenance costs are reduced, and the application and promotion in various building environments are facilitated.

As shown in FIG. 1, the optimized control method of building air conditionings taking personalized comfort of regional users into consideration may include:

    • acquiring comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and performing data preprocessing;
    • describing a heating, ventilation and air conditioning system control problem as a MDP, and defining each agent, and a state, an action and a reward of each agent;
    • performing initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;
    • performing differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and
    • obtaining a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, training an actor network and a critic network of each agent, and making control decisions for the building heating, ventilation and air conditioning system by using the trained actor network.

Further, the detailed technical solution of the present example is specifically as follows:

1. Data Collection and Preprocessing

Comfort feedback data from different regional users and historical operation data of a building air conditioning system, including indoor and outdoor temperature and humidity, CO2 concentration, energy consumption and the like, are collected. The collected data are preprocessed through methods such as data cleaning and normalization to ensure the quality of the data used for training.

The processed data will be used as a dataset in a meta-learning process to learn a comfort reward function model that meets personalized comfort needs of different regional users. A meta-learning model learns the comfort needs of the different regional users through these data, and constructs a personalized comfort model.

Further, it includes the following steps:

1.1 Collection and Preprocessing of Comfort Feedback Data

There are many factors affecting the comfort of users, such as air temperature, relative humidity, CO2 concentration, metabolic rate, and clothing insulation. However, in practice, due to user privacy issues, it is difficult to monitor factors such as metabolic rate and clothing, and the changes in these factors are not significant. Therefore, these factors are fixed in numerical values without further feedback. In order to adapt a controller to different regional user comfort preferences, different comforts of users in different regions will be included as part of the reward item when designing an optimized control method.

In this application,

r ^ i , t com

represents different comfort rewards of different regions i(1≤i≤N) of the system at time t, which is related to the user feedback regarding the indoor environment feeling. To collect feedbacks from residents, we may provide them with feedback apparatuses that have five options, such as “very uncomfortable,” “uncomfortable”, “moderate”, “comfortable”, and “very comfortable.”

After receiving different feedbacks from different regional users, these five feedback options are quantified. The collected five options are quantified as −2, −1, 0, +1, and +2, which may be considered as different real rewards

r ^ i , t com

of different regions i(1≤i≤N) under different indoor environments at time t, namely:

r ^ i , t com = f ⁡ ( T i , t , C i , t , H i , t ) ;

    • wherein,

r ^ i , t com

represents different comfort rewards of different regions i(1≤i≤N) of the system at time t; ƒ(⋅) is an unknown function, Ti,t represents indoor temperature of a region i(1≤i≤N), Ci,t represents indoor CO2 concentration, and Hit represents indoor humidity. In the present disclosure, a meta-learning method will be used for approximating by using a θ-parameterized neural network.

1.2. Collection and Preprocessing of Historical Operation Data of Building

Building environment data and HVAC system operation data are collected. Environment data include indoor temperature and humidity, CO2 concentration, outdoor temperature and humidity, wind speed, solar radiation, etc. HVAC system data include an operating state of the HVAC system, operation parameters (such as fan speed, chilled water temperature, and heating water temperature), energy consumption, etc.

The collected data are preprocessed to ensure the data quality and training effect. Firstly, the data is cleaned to remove missing values and outlier values, and ensure the integrity and accuracy of the data. Further, the data are standardized, i.e., by performing standardization processing on the data, the data is on the same scale, using a standard normal distribution with a mean of 0 and a standard deviation of 1.

2. Construction of MDP

The present example considers a building air conditioning system with N regions. The controllable part of the HVAC system consists of an air handling unit (AHU) for the entire building and variable air volume (VAV) boxes for each region. Therefore, N+1 agents are considered in MDP. We assume that the HVAC system only needs to consider its own policy to solve MDP and make sequential decisions, which only requires a sample sequence of state, action, and reward.

An HVAC system control problem is described as the MDP. A state including indoor temperature, indoor humidity, indoor CO2 concentration and the like, an action consisting of regional mass flow rate, fresh air volume, and chilled water flow rate, and a reward consisting of energy consumption and comfort are defined.

2.1 State

This HVAC system totally includes N regions. A state si,t of each region i(1≤i≤N) at time t is defined as:

s i , t = ( T t out , T i , t , T - i , t , λ t , k i . t , C i , t , H i , t ) ;

    • wherein,

T t out

represents outdoor temperature, Ti,t represents indoor temperature of the region i, T−i,t represents indoor temperature of an adjacent region of the region i, λt represents electricity price, ki,t represents the number of residents in the region i, Ci,t represents indoor CO2 concentration of the region i, and Hi,t represents indoor humidity of the region i; si,t is a state of an ith regional agent at time t, (1≤i≤N).

In addition to the N regions, AHU is defined as an (N+1)th agent, and its state sN+1,t at time t is expressed as:

s N + 1 , t = ( λ t , k 1 , t , … , k N , t , C 1 , t , … , C N , t , H 1 , t , … , H N , t ) ;

    • wherein, sN+1,t is a state of the (N+1)th agent AHU at time t.

2.2 Action

An action of each regional agent i 1≤i≤N) at time t is defined as:

a i , t = ( M i , t ) ;

    • wherein, ai,t is the action of the agent i(1≤i≤N) at time t, and Mi,t represents regional mass flow rate of the agent i(1≤i≤N).

An action of the (N+1)th agent at time t is defined as:

a N + 1 , t = ( d t , W t ) ;

    • wherein, aN+1,t is the action of the (N+1)th agent at time t, dt represents fresh air volume, and Wt represents chilled water flow rate.

2.3 Reward

The control goal of the present disclosure is to minimize the total energy consumption and maximize the comfort. For N regional agents, the reward may be divided into two parts:

r i , t = λ 1 ⁢ r i , t com ( s i , t , a i , t ) + λ 2 ⁢ r i , t en ( s i , t , a i , t ) ;

    • wherein, ri,t represents a reward of the regional agent i(1≤i≤N) at time t;

r i , t com ⁢ ( s i , t , a i , t )

represents a comfort penalty of the regional agent

i ⁢ ( 1 ≤ i ≤ N ) , r i , t en ⁢ ( s i , t , a i , t )

represents an energy consumption penalty of the regional agent i(1≤i≤N), and λ1 and λ2 are positive weight coefficients; si,t is a state of the regional agent i(1≤i≤N) at time t, and ai,t is an action of the regional agent i(1≤i≤N) at time t.

The comfort reward function is obtained through training by using a meta-learning method. Each agent corresponds to a distinct comfort reward model. The specific method will be elaborated below.

A key goal of the AHU agent is to minimize the energy consumption of the entire system, while the AHU needs to consider the air quality standard of the entire building to ensure environmental comfort throughout the entire building. Therefore, for the (N+1)th agent AHU, a reward function is expressed as:

r N + 1 , t = λ 3 ⁢ ∑ i = 1 N r i , t com ( s i , t , a i , t ) + λ 4 ⁢ r N + 1 , t en ( s N + 1 , t , a N + 1 , t ) ;

    • wherein, rN+1,t represents a reward of the (N+1)th agent AHU at time t; λ3 and λ4 are positive weight coefficients;

r N + 1 , t en ⁢ ( S N + 1 , t , a N + 1 , t )

represents an energy consumption penalty of the (N+1)th agent AHU; sN+1,t is a state of the (N+1)th agent AHU at time t; aN+1,t is an action of the (N+1)th agent AHU at time t.

3. Meta-Learning Training Task-Leaning of Comfort Reward Function Initial Model

The meta-learning training task is to train on multiple tasks to find an initial parameter that can quickly adapt to a new task. Then, at the meta-learning testing task stage, this initial parameter is used so that the model can quickly adapt to a new task through minimal data training and few gradient updates.

The specific steps of the meta-learning training task include task definition, parameter initialization for N different regions, task sampling, intra-task training, and meta parameter updating. Finally, the loss functions of N regional users are averaged, and the initial parameter is updated.

For N different regions, each task is defined as learning the comfort reward function of each regional user. Firstly, the sampled data are divided into a Support set and a Query set. Each task is separately trained by using the data in the Support set, and parameters of each task are updated through gradient descent. Testing is performed in the Query set by using the trained parameters, and then the loss function under each task is calculated. The loss functions of these tasks are averaged, and the initial parameter of the meta-learning training task is updated through gradient descent.

The specific process of the meta-learning algorithm is as follows:

(1) Task Definition

The comfort reward function of each regional user is defined as a subtask. By sampling data from different regional users, the meta-learning model learns how to quickly adapt to the needs of different regional users.

(2) Model Initialization

A weight parameter θ of the meta-learning network is initialized.

Learning rates α and β, and a data batch size B are set.

(3) Task Sampling

B regional users are randomly sampled from a dataset F to form B training subtasks.

For the regional user that each subtask targets, K sets of indoor environment state and user comfort feedback data pairs (Sin,k, ƒ(Sin,k)) (1≤k≤K) are sampled as the Support set, and K′ sets of indoor environment state and user comfort feedback data pairs

( S in , k ′ , f ⁡ ( S in , k ′ ) ) ⁢ ( 1 ≤ k ′ ≤ K ′ )

are sampled as the Query set, wherein Sin,k and

S in , k ′

are both indoor environment states consisting of indoor temperature, humidity, and CO2 concentration, and k and k′ are both counting parameters; ƒ(Sin,k) and

f ⁡ ( S i ⁢ n , k ′ )

represent comfort feedback data of indoor users regarding the indoor environment state.

(4) Intra-Task Training

Firstly, the B subtasks are trained in the Support set to separately train model parameters for each subtask.

The loss function is calculated:

L θ , b = 1 K ⁢ ∑ k = 1 K ( f ⁡ ( S in , k ) - f ^ θ ( S in , k ) ) 2 ;

    • wherein, ƒ(Sin,k) represents actual comfort feedback data from indoor users regarding indoor environment states obtained through sampling; {circumflex over (ƒ)}θ(Sin,k) represents a predicted output of a model represented by a weight parameter θ for the same input data Sin,k; K represents that the Support set totally contains K sets of data, and k represents a k(1≤k≤K)th set of data; Sin,k represents a k(1≤k≤K)th set of indoor environment data in the Support set; b represents a bth training subtask (1≤b≤B).

Parameters are updated through gradient descent:

θ ′ = θ - α ⁢ ∇ θ L θ , b ;

    • wherein, θ′ is an updated weight parameter of a meta-learning network.

Testing is performed in the Query set by using the parameter obtained through training in the previous step.

The loss function under each subtask is calculated:

L θ ′ , b = 1 K ′ ⁢ ∑ k ′ = 1 K ′ ( f ⁡ ( S in , k ′ ) - f ^ θ ″ ( S in , k ′ ) ) 2 ;

    • wherein, Sin,k′ is a k′ (1≤k′≤K′)th set of indoor environment data in the Query set; θ″ represents a model weight parameter obtained during testing by using the Query set after training with the Support set; K′ represents that the Query set totally contains K′ sets of data; k′ represents a k′ (1≤k′≤K′)th set of data.

(5) Meta Updating

The loss functions of all users are averaged, and the initial parameter is updated through gradient descent:

θ = θ - β · 1 B ⁢ ∑ b = 1 B ∇ θ L θ ′ , b ;

    • wherein, b represents a bth training subtask (1≤b≤B).

Meta-learning provides real-time personalized comfort feedback by training a comfort model that can quickly adapt to personalized needs of different regional users. In the meta-learning process, the model learns the comfort preference of each regional user by sampling data from multiple regional user tasks. This process enables the model to quickly adjust parameters through a small amount of regional user feedback data, thus adapting quickly to new regional user tasks. This model serves as a part of the reward function in the training process of the agent, and helps to optimize the control policy of the HVAC system to achieve personalized comfort adjustment and efficient energy management. This method effectively solves the problem that traditional comfort evaluation models cannot meet individual differentiated needs, thus improving the overall user satisfaction and system response speed.

By adopting meta-learning and training on multiple tasks, an initial parameter that can quickly adapt to new tasks is learned, so that it can quickly adapt to the needs of new tasks or new users with a small amount of training data and few gradient updates. When facing new users or new environments, there is no need to retrain the entire model, and only a small amount of data are needed to adjust the model to adapt to new users and improve the system response speed.

4. Meta-Learning Testing Task-Leaning of Differentiated Comfort Reward Function Models of Different Regional Users

In order to obtain the comfort reward function models of each regional user, model training is performed by using the comfort feedback data from different users in each region based on the comfort reward function initial model. For each task, the loss function is calculated, the parameters are updated through a gradient descent method, and finally a comfort reward function model

r i , t com ( s i , t , a i , t )

of this region is obtained.

Specifically,

    • for different regional agent systems, their differentiated comfort feedback data are used to respectively train the comfort reward function models.

For each comfort reward function task,

    • (1) The loss function is calculated. A loss function under current parameters is calculated by using the feedback data from the region.
    • (2) Calculation parameters are updated. The current parameters are updated through a gradient descent method.

The comfort reward function model of each agent can quickly adapt to the unique comfort requirements of the agent through a small amount of training data and few gradient updates, thus obtaining the comfort reward function model

r i , t com ( s i , t , a i , t )

corresponding to the agent.

5. Training of Building HVAC System Optimized Control Policy

The energy consumption reward

r i , t en ( s i , t , a i , t )

and the differentiated comfort reward

r i , t com ( s i , t , a i , t )

of different regional users obtained from meta-learning testing tasks are jointly used as a reward function for each agent i(1≤i≤N+1), an actor network and a critic network are updated and trained, and finally an optimal control policy is obtained, thus achieving energy consumption while meeting personalized comfort needs of regional users.

The present disclosure adopts a centralized training with decentralized decision-making training method. Each agent obtains an execution action under a current state based on its own policy, and after interacting with the environment, stores the obtained experience in its own experience replay buffer. After all agents interact with the environment, each agent randomly selects experience from the experience replay buffer to train its own neural network.

An HVAC system control neural network is trained based on the reward function consisting of the energy consumption reward and the differentiated comfort reward. Each agent corresponds to a set of actor network and critic network. The trained actor network is used for making differentiated optimized control decisions for each agent. The specific process of training is as follows:

(1) Initialization

An actor network μi, a target actor network

μ i ′ ,

a critic network Qi, and a target critic network

Q i ′

of each agent are initialized.

An experience replay buffer D of each agent is initialized.

Parameter of all networks are randomly initialized.

(2) Interaction with Environment

At each time t, each agent i receives a state si,t from the environment.

Each agent i(1≤i≤N+1) uses its actor network μi to select an action ai,ti(si,t)+t, wherein t is noise used for exploration.

(3) Execution of Action

The agents execute an action at=(a1,t, a2,t, a3,t, . . . , aN+1,t) in the environment. The environment returns a new state si,t+1 and a reward ri,t for each agent, wherein at represents a set of actions of all agents at time t.

(4) Storage of Experience

Experience (si,t, at, rt, si,t+1) is stored into an experience replay buffer D, wherein rt represents a set of rewards of all agents at time t.

(5) Experience Replay

A batch of experience samples

( s i j , a j , r j , s i ′ ⁢ j )

are randomly extracted from the experience replay buffer D, wherein the superscript j represents an extracted jth experience sample,

s i ′ ⁢ j

represents a next state that a current state of the jth experience sample transitions after the action a.

(6) Updating of Critic Network

For each agent i(1≤i≤N+1), a target value

y i j

is calculated:

y i j = r i j + γ ⁢ Q i ′ ( s i ′ ⁢ j , a 1 ′ , a 2 ′ , … , a N + 1 ′ ) ;

    • wherein, the superscript j represents an extracted jth experience sample;

s i ′ ⁢ j

represents a next state of the jth sample of the agent i;

μ i ′

represents a target actor network of the agent i;

Q i ′

represents a target critic network of the agent i;

a i ′ = μ i ′ ( s i ′ )

represents an action generated by the target actor network

μ i ′

for each agent i based on the next state

s i ′ ⁢ j ,

and γ is a discount factor.

The loss function is minimized, and parameters of the critic network are updated:

L ⁡ ( ε i ) = 𝔼 [ ( y i - Q i ( s i j , a 1 j , a 2 j , … , a N + 1 j ) ) 2 ] ;

    • wherein, L(εi) represents a loss function of the agent i, εi represents an updated weight parameter of the critic network of the agent i; yi represents a target value,

a i j

represents an action of the jth experience sample of the agent i, Qi represents a critic network of the agent i;

𝔼

represents an expected value operator, which is used for calculating the average loss of multiple samples.

(7) Updating of Actor Network

For each agent i, parameters of the actor network are updated through policy gradient:

∇ ε μ i J = 𝔼 [ ∇ a i Q i ( s i j , a 1 j , a 2 j , … , a N + 1 j ) · ∇ ε μ i μ i ( s i ) ] ;

    • wherein, εμi represents an updated weight parameter of the actor network; J represents a target function of the actor network.

(8) Updating of Target Network

For each agent i, parameters of the target network are softly updated:

ε μ i ′ ← τε μ i + ( 1 - τ ) ⁢ ε μ i ′ ; ε Q i ′ ← τε Q i + ( 1 - τ ) ⁢ ε Q i ′ ;

    • wherein, τ is a soft update coefficient; ε represents a network weight parameter;

ε μ i ′

represents an updated weight parameter of the target actor network; εQi represents an updated weight parameter of the critic network;

ε Q i ′

represents an updated weight parameter of the target critic network.

Through the above method, the present example can achieve efficient, personalized, and energy-saving control of the HVAC system while ensuring the personalized comfort of the regional users, and is applicable to optimizing HVAC systems in various building environments.

The specific examples are as follows:

1. Control of Opening Degree of Chilled Water Valve in Air Handling Unit

The temperature, cooling load, and supply air temperature set values in each region are collected in real time. Based on the specific physical values of the optimized control parameters in the output control policy, the opening degree of the chilled water valve in the air handling unit is adjusted to achieve precise matching between the actual outlet water flow rate of a cooling coil and the required cooling capacity. By dynamically optimizing the outlet water flow rate, the outlet air temperature of the cooling coil can reach the actual requirement of each region, avoiding excessive cooling and energy waste while meeting the cooling need of the room.

2. Control of Opening Degree of Fresh Air Valve, Return Air Valve and Exhaust Air Valve

Based on the values collected by CO2 concentration, humidity, and other sensors in each region in real time, the opening degree of the fresh air valve, the return air valve, and the exhaust air valve is adjusted according to the specific physical values of the optimized control parameters in the output control policy, so as to dynamically adjust the fresh air volume, the return air volume, and the exhaust air volume, thus ensuring that the indoor air quality meets health standards, avoiding additional energy consumption caused by excessive fresh air, and achieving a dynamic balance between fresh air volume and energy saving.

3. Control of Opening Degree of End Air Valve of Each VAV

Based on inputs such as current temperature, set temperature, and personnel activity level in each region, and the specific physical values of the optimized control parameters in the output control policy, the opening degree of the end air valve of the VAV in each region is independently adjusted to determine the air supply volume to the region, thus achieving the dynamic distribution of the air supply volume to each region according to demands, accurately controlling the temperature of each region to meet user-defined comfort settings, and ensuring the personalized comfort experience.

Example 2

The present example discloses an optimized control system of building air conditionings taking personalized comfort of regional users into consideration.

The optimized control system of building air conditionings taking personalized comfort of regional users into consideration may include:

    • a data acquisition module, configured to acquire comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and perform data preprocessing;
    • a MDP construction module, configured to describe a heating, ventilation and air conditioning system control problem as the MDP, and define each agent, and a state, an action and a reward of each agent;
    • a meta-learning training module, configured to perform initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;
    • a meta-learning testing module, configured to perform differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and
    • an optimized control policy training module, configured to obtain a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, train an actor network and a critic network of each agent, and make control decisions for the building heating, ventilation and air conditioning system by using the trained actor network.

Example 3

The objective of the present example is to provide a computer-readable storage medium.

The computer-readable storage medium has a program stored therein, wherein the program, when executed by a processor, implements the steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to Example 1 of the present disclosure.

Example 4

The objective of the present example is to provide an electronic device.

The electronic device includes a memory, a processor, and a program stored in the memory and runnable on the processor, wherein the program, when executed by the processor, causing the processor to implement the steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to Example 1 of the present disclosure.

The steps involved in the above examples 2, 3 and 4 of the apparatus correspond to the steps in the example 1 of the method. For the specific implementation, a reference may be made to the relevant description in Example 1. The term “computer-readable storage medium” should be understood as a single medium or multiple media that include one or more instruction sets; and it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set executed by a processor and causing the processor to implement any method in the present disclosure.

Those skilled in the art should understand that the various modules or steps of the present disclosure may be implemented by using general-purpose computer apparatuses. Alternatively, they may be implemented by using program codes executable by computing apparatuses, so that they can be stored in storage apparatuses for execution by computing apparatuses, or may be implemented by separately making them into various integrated circuit modules, or making multiple modules or steps thereof into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

Although the specific implementations of the present disclosure have been described with reference to the drawings, they do not limit the scope of protection of the present disclosure. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without any creative work based on the technical solution of the present disclosure still fall within the scope of protection of the present disclosure.

Claims

1. An optimized control method of building air conditionings taking personalized comfort of regional users into consideration, comprising:

acquiring comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and performing data preprocessing;

describing a heating, ventilation and air conditioning system control problem as a Markov decision process (MDP), and defining each agent, and a state, an action and a reward of each agent;

performing initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;

performing differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and

obtaining a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, training an actor network and a critic network of each agent, and making control decisions for the building heating, ventilation and air conditioning system by using the trained actor network; wherein:

using a meta-learning method to obtain a user comfort reward function initial model specifically comprises:

defining the comfort reward function of each regional user as a subtask;

acquiring a plurality of sets of indoor environment states and comfort feedback data for the regional user that each subtask targets, and dividing into a Support set and a Query set;

respectively training each subtask by using data in the Support set, and updating parameters of each subtask through gradient descent;

performing testing in the Query set by using the parameters obtained through training, and then calculating a loss function under each subtask;

averaging the loss functions of a plurality of subtasks, and updating an initial parameter of a meta-learning training task through gradient descent;

calculating the loss function:

L θ , b = 1 K ⁢ ∑ k = 1 K ( f ⁡ ( S in , k ) - f ^ θ ( S in , k ) ) 2 ;

wherein, ƒ(Sin,k) represents actual comfort feedback data from indoor users regarding indoor environment states obtained through sampling; {circumflex over (ƒ)}θ(Sin,k) represents a predicted output of a model represented by a weight parameter θ for the same input data Sin,k; K represents that the Support set totally contains K sets of data, and k represents a k(1≤k≤K)th set of data; Sin,k represents a kth set of indoor environment data in the Support set; b represents a bth training subtask, 1≤b≤B;

updating parameters through gradient descent:

θ ′ = θ - α ⁢ ∇ θ L θ , b ;

wherein, θ′ is an updated weight parameter of a meta-learning network;

calculating the loss function under each subtask:

L θ ′ , b = 1 K ′ ⁢ ∑ k ′ = 1 K ′ ( f ⁡ ( S in , k ′ ) - f ^ θ ″ ( S in , k ′ ) ) 2 ;

wherein, Sin,k′ is a k′th set of indoor environment data in the Query set; θ″ represents a model weight parameter obtained during testing by using the Query set after training with the Support set; K′ represents that the Query set totally contains K′ sets of data; k′ represents a k′th (1≤k′≤K′) set of data; and

averaging the loss functions of all users, and updating an initial parameter through gradient descent:

θ = θ - β · 1 B ⁢ ∑ b = 1 B ∇ θ L θ ′ , b ;

wherein, b represents a bth training subtask, 1≤b≤B.

2. The optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 1, wherein the comfort feedback data refer to different feeling feedbacks of the users regarding an indoor environment, the indoor environment comprises air temperature, relative humidity, and CO2 concentration; the historical operation data of the building air conditioning system comprises building environment data and HVAC system operation data, and the building environment data comprises indoor temperature, indoor humidity, indoor CO2 concentration, outdoor temperature, outdoor humidity, wind speed, and solar radiation; the HVAC system operation data comprise fan speed, chilled water temperature, and heating water temperature.

3. The optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 1, wherein considering a building with N regions, the N regions are defined as N regional agents, and an air handling unit AHU for the entire building is defined as an (N+1)th agent AHU;

a state comprising indoor temperature, indoor humidity and indoor CO2 concentration is defined;

an action comprising regional mass flow rate, fresh air volume and chilled water flow rate is defined; and

a reward comprising energy consumption and comfort is defined.

4. The optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 3, wherein the defined state of each agent comprises:

for N regional agents, a state si,t of each region i at time t is:

s i , t = ( T t out , T i , t , T - i , t , λ t , k i , t , C i , t , H i , t ) ;

wherein,

1 ≤ i ≤ N ; T t out

represents outdoor temperature, Ti,t represents indoor temperature of the region i, T−i,t represents indoor temperature of an adjacent region of the region i, λt represents electricity price, ki,t represents the number of residents in the region i, Ci,t represents indoor CO2 concentration of the region i, and Hit represents indoor humidity of the region i;

a state sN+1,t of the (N+1)th agent AHU at time t is expressed as:

s N + 1 , t = ( λ t , k 1 , t , … , k N , t ,   C 1 , t , … , C N , t , H 1 , t , … , H N , t ) ;

an action ai,t of each regional agent i(1≤i≤N) at time t is defined:

a i , t = ( M i , t ) ;

wherein, Mi,t represents regional mass flow rate of the agent i(1≤i≤N);

an action aN+1,t of the (N+1)th agent AHU at time t is defined:

a N + 1 , t = ( d t , W t ) ;

wherein dt represents fresh air volume, and Wt represents chilled water flow rate;

for N regional agents, a reward function ri,t is:

r i , t = λ 1 ⁢ r i , t com ⁢ ( s i , t , a i , t ) + λ 2 ⁢ r i , t e ⁢ n ⁢ ( s i , t , a i , t ) ;

wherein,

r i , t c ⁢ o ⁢ m ⁢ ( s i , t , a i , t )

represents a comfort penalty of a regional agent i,

r i , t e ⁢ n ⁢ ( s i , t , a i , t )

represents an energy consumption penalty of the regional agent i, and λ1 and λ2 are positive weight coefficients; si,t is a state of the regional agent i at time t, and ai,t is an action of the regional agent i at time t;

for the (N+1)th agent AHU, a reward function rN+1,t is:

r N + 1 , t = λ 3 ⁢ ∑ i = 1 N r i , t com ( s i , t , a i , t ) + λ 4 ⁢ r N + 1 , t en ( s N + 1 , t , a N + 1 , t ) ;

wherein

r N + 1 , t e ⁢ n ⁢ ( S N + 1 , t , a N + 1 , t )

represents an energy consumption penalty of the (N+1)th agent AHU; λ3 and λ4 are positive weight coefficients; sN+1,t is a state of the (N+1)th agent AHU at time t; aN+1,t is an action of the (N+1)th agent AHU at time t.

5. The optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 1, wherein training an actor network and a critic network of each agent specifically comprises:

allowing each agent i to interact with the environment at time t to obtain a state si,t, and use the actor network to select and execute an action ai,t to obtain a new state si,t+1 returned by the environment and a reward ri,t of each agent;

storing experience (si,t, at, rt, si,t+1) in an experience replay buffer, and randomly extracting a batch of experience samples from the experience replay buffer;

calculating a target value for each agenti, minimizing the loss function, and updating parameters of the critic network; and

updating parameters of the actor network through policy gradient for each agenti.

6. The optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 5, wherein for each agent i, parameters of a target network are softly updated:

ε μ i ′ ← τε μ i + ( 1 - τ ) ⁢ ε μ i ′ ; ε Q i ′ ← τε Q i + ( 1 - τ ) ⁢ ε Q i ′ ;

wherein, τ is a soft update coefficient; ε represents a network weight parameter; μi is an actor network of an ith agent,

μ i ′

is a target actor network of the ith agent, Qi is a critic network of the ith agent, and

Q i ′

is a target critic network of the ith agent; εμi represents an updated weight parameter of the actor network;

ε μ i ′

represents an updated weight parameter of the target actor network; εQi represents an updated weight parameter of the critic network;

ε Q i ′

represents an updated weight parameter of the target critic network.

7. An optimized control system of building air conditionings taking personalized comfort of regional users into consideration, comprising:

a data acquisition module, being configured to: acquire comfort feedback data from different regional users of a building and historical operation data of a building air conditioning system, and perform data preprocessing;

a Markov-decision-process construction module, being configured to: describe a heating, ventilation and air conditioning system control problem as a Markov decision process (MDP), and define each agent, and a state, an action and a reward of each agent;

a meta-learning training module, being configure to: perform an initial parameter leaning of a comfort reward function for a plurality of regions based on the comfort feedback data from a plurality of regional users by using a meta-learning method to obtain a user comfort reward function initial model;

a meta-learning testing module, being configured to: perform differentiated training and parameter updating on the user comfort reward function initial model by respectively using the comfort feedback data from the different regional users to obtain a differentiated comfort reward function model of each agent; and

an optimized control policy training module, being configured to: obtain a reward function of each agent based on an energy consumption reward model and the differentiated comfort reward function model of each agent, training an actor network and a critic network of each agent, and making control decisions for the building heating, ventilation and air conditioning system by using the trained actor network; wherein:

using the meta-learning method to obtain the user comfort reward function initial model specifically comprises:

defining the comfort reward function of each regional user as a subtask;

acquiring a plurality of sets of indoor environment states and comfort feedback data for the regional user that each subtask targets, and dividing into a Support set and a Query set;

respectively training each subtask by using data in the Support set, and updating parameters of each subtask through gradient descent;

performing testing in the Query set by using the parameters obtained through training, and then calculating a loss function under each subtask;

averaging the loss functions of a plurality of subtasks, and updating an initial parameter of a meta-learning training task through gradient descent;

calculating the loss function:

L θ , b = 1 K ⁢ ∑ k = 1 K ( f ⁡ ( S in , k ) - f ^ θ ( S in , k ) ) 2 ;

wherein, ƒ(Sin,k) represents actual comfort feedback data from indoor users regarding indoor environment states obtained through sampling; {circumflex over (ƒ)}θ(Sin,k) represents a predicted output of a model represented by a weight parameter θ for the same input data Sin,k; K represents that the Support set totally contains K sets of data, and k represents a k(1≤k≤K)th set of data; Sin,k represents a kth set of indoor environment data in the Support set; b represents a bth training subtask, 1≤b≤B;

updating parameters through gradient descent:

θ ′ = θ - α ⁢ ∇ θ L θ , b ;

wherein, θ′ is an updated weight parameter of a meta-learning network;

calculating the loss function under each subtask:

L θ ′ , b = 1 K ′ ⁢ ∑ k ′ = 1 K ′ ( f ⁡ ( S in , k ′ ) - f ^ θ ″ ( S in , k ′ ) ) 2 ;

wherein, Sin,k′ is a k′th set of indoor environment data in the Query set; θ″ represents a model weight parameter obtained during testing by using the Query set after training with the Support set; K′ represents that the Query set totally contains K′ sets of data; k′ represents a k′th (1≤k′≤K′) set of data; and

averaging the loss functions of all users, and updating an initial parameter through gradient descent:

θ = θ - β · 1 B ⁢ ∑ b = 1 B ∇ θ L θ ′ , b ;

wherein, b represents a bth training subtask, 1≤b≤B.

8. A computer-readable storage medium, having a program stored thereon, wherein when the program is executed by a processor, implementing steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 1.

9. An electronic device, comprising a memory, a processor, and a program stored on the memory and runnable by the processor, wherein when the program is executed by the processor, implementing steps of the optimized control method of building air conditionings taking personalized comfort of regional users into consideration according to claim 1.

Resources

Images & Drawings included:

Sources:

Recent applications in this class: