It takes two, not one: context-aware nonverbal behaviour generation in dyadic interactions

Nonverbal behaviours are integral parts of human social interaction. Equipping social robots with human nonverbal communication skills has been an active research area for decades, where data-driven, end-to-end learning approaches have become predominant in recent years, offering scalability and generalisability. However, most of the current works only consider social signals of a single character to model co-speech gestures in non-interactive settings. To address this shortcoming, this paper introduces a context-aware Generative Adversarial Network, intending to produce social cues for robots. The approach captures both intra- and interpersonal social signals of two interlocutors to model body gestures in dyadic interaction. We conducted a series of experiments to validate the proposed solution under different interaction settings. First, the experimental results conducted in the JESTKOD dataset demonstrate the contribution of encoding context, namely the behaviours of the interaction partner, in the prediction of target person's gestures in agreement situations. Second, the experiments conducted in the new LISI-HHI dataset show that combining Discriminator and Context Encoder results in a gesture generation framework that is effective across various social communication contexts. GRAPHICAL ABSTRACT


Introduction
Social robots are envisioned to have a profound impact on many sectors, including education, healthcare, workplace, and home.All of such practical applications require that humans and robots interact and collaborate with each other seamlessly.Along with verbal communication, successful social interaction is closely coupled with the exchange of nonverbal cues, such as gaze, facial expressions, body movements, and hand gestures.Humans tend to use a wide range of nonverbal cues to signal their emotions, intentions, or verbal contents of their speech to their interaction partners.Motivated by this, imitating nonverbal communication has been an active area of research to enhance the clarity of the humanrobot interaction (HRI) interfaces and the sense of rapport, hence maximize the user trust and acceptance of them.
A considerable effort has gone into designing nonverbal interaction skills for social robots.For humanoid robot platforms, nonverbal cues are commonly inspired by human behaviours.One of the main reasons is to ensure communicative messages, encoded in robots' body movements, are interpretable by humans [1].
CONTACT Nguyen Tan Viet Tuyen tan_viet_tuyen.nguyen@kcl.ac.uk;Oya Celiktutan oya.celiktutan@kcl.ac.uk Previous work on nonverbal generation can be briefly categorized into two groups: (1) the rule-based approach and (2) the data driven-approach.Early methods have focussed on rule-based approaches [2,3], requiring the design of interaction logic manually, which is limited, not transferable to unforeseen interaction contexts, and not robust to unpredicted inputs from the robot's environment (e.g.sensor noise).Therefore, data-driven, endto-end learning approaches [4][5][6] have been a promising solution to address these shortcomings.However, so far, only a handful of works [7][8][9][10][11][12] aim to model behaviours by taking into account the interaction context, namely, the nonverbal signals of the interaction partner.Although social interaction is an open-ended concept, it can be formalized through two main processes: (i) Perception perception process involves receiving visual stimuli about the behaviours of others, or the state of the interaction; and (ii) Action -action process is the generation of a behaviour by taking into account all aspects of interaction including current perceived states and history.Therefore, it is necessary to consider the interaction partner's way of speaking and acting to be able to create socially suitable behaviours for robots.
This paper introduces a context-aware Generative Adversarial Network (GAN) towards modelling robots' nonverbal behaviours in dyadic interactions.The approach takes speech features of a target person together with nonverbal signals of their interaction partner, modelled by a novel Context Encoder, to produce appropriate body gestures supporting for social interaction.We comprehensively validated the proposed framework on two datasets, namely JESTKOD and LISI-HHI, covering human dyadic interactions in affective contexts and social communication contexts, respectively.The main contributions of this paper are: (1) a novel co-speech gesture generation framework that captures both intra-and interpersonal social signals to model body gestures of the target person in dyadic interactions; (2) a series of experiments carried out in different scenarios to examine the impact of interaction context on generated cues; and (3) a newly created LISI-HHI dataset which aims to serve as a high-accuracy multimodal database for HRI community and related research domains.The experimental results conducted in the LISI-HHI dataset aims to serve as a benchmark of the context-aware nonverbal behaviour synthesis task.
The rest of this paper is organized as follows.In Section 2, we review previous studies on nonverbal behaviours generation inspired by data-driven approach.Section 3 describes the proposed end-to-end learning framework in detail.It is followed by a series of experiments conducted to verify the proposed network.We validate the model performance on an affective interaction dataset in Section 5 and a social communication database in Section 6.As a proof concept, we demonstrate the proposed framework on the Pepper humanoid robot in Section 7. Finally, the experimental results and future work are summarized in Section 8.

Nonverbal behaviour synthesis from intrapersonal social signals
The data-driven approach provides a solution to transfer human nonverbal communication skills to robots in an end-to-end manner using large-scale datasets of human behaviours [13,14].The approach receives social signals (e.g, speech audio, speech text) of a target person to model their co-speech gestures conveying their emotions or intentions.Different learning frameworks have been introduced to capture the relationship between human audio [6,15], speech text [4,5,16] and human co-speech gestures.The network architecture could be constructed in various ways, ranging from autoregressive [17], encoder-decoder [6], Long Short Term Memory (LSTM) [15] to generative adversarial network (GAN) [16].Although these approaches are promising solutions to address the shortcomings of the rule-base approach, they only consider social signals of a single character to model co-speech gestures in non-interactive settings.

Nonverbal behaviour synthesis from intra-and interpersonal social signals
In small-group social interaction, an essential aspect of communication is the dynamic exchange of nonverbal signals among interlocutors, with the aim of adapting to interacting social norms [18], for building or breaking a common ground [19][20][21].This factor suggests that when modelling human or social robots' nonverbal behaviours in small-group interaction settings, particularly dyadic interaction, both intra-and interpersonal nonverbal signals should be taken into consideration.However, only a few studies [7][8][9][10][11][12] aim to generate behaviours by taking into consideration the interaction context, namely, the nonverbal signals of the interaction partner.
The problem of modelling human facial expressions in an interaction between an interviewee and an interviewer could be addressed by a conditional GAN framework [7] or a variational autoencoder (VAE) [8].On the other hand, the idea of forecasting nonverbal cues was demonstrated by a residual attention network [10] to forecast human upper body motions or a GAN network [11] to predict interlocutors' upper body gestures and their facial landmarks.In the scenario of triadic interaction, the authors [9] introduced a generative framework that observes nonverbal signals of all interlocutors to forecasts nonverbal signals of a target person.However, none of these approaches has investigated the problem of co-speech gesture synthesis in dyadic interaction and, importantly, the effect of interaction contexts on generated actions.Motivating from that, our early work [12] introduced a context-aware co-speech gesture generation framework and verified the impact of affective context on synthesized gestures.In this paper, we further extended the work [12] by incorporating the early approach with a new loss function, a modified network architecture, and an updated audio feature extraction towards enhancing the model performance.In addition to the experiment conducted in affective interaction contexts, we further demonstrated the approach in social communication contexts using our newly created LISI-HHI dataset [22].By demonstrating the idea on two different databases representing for two different settings, this paper aims to understand the impact of interaction contexts on the context-aware GAN approach comprehensively.

Problem statement
We define the problem of speech-driven gesture generation with context awareness as follows: in a dyadic interaction between a target person S fo and an interaction partner S ob , A 0:T fo denotes the speech audio of S fo in a temporal time window, namely t ∈ [0, T].P 0:T ob and A 0:T ob are the co-speech gesture and the speech audio simultaneously observed from S ob within the same spatial and temporal window.This research aims to find a mapping function F that receives A 0:T fo , P 0:T ob , and A 0:T ob as inputs, and predict an output co-speech gesture of S fo , namely P 0:T fo .

Model overview
To address the research question in the aforementioned section, this paper introduces a co-speech gesture generative framework with context awareness, as shown in Figure 1.fo .This process is repeated until t = T. Finally, the generated co-speech gesture P0:T fo and their corresponding speech feature vector s 0:T fo , contextual vector c 0:T ob are injected into D for identifying samples to be either fake or real.In the sequel, the proposed network architecture is described in detail.

Context encoder E
Context Encoder is designed to encode social signals simultaneously collected from the interaction partner in dyadic interaction into a contextual vector.Context Encoder consists of Motion Encoder and Speech Encoder.
Here, c t P encoded by Motion Encoder and c t A encoded by Speech Encoder are combined into c t ob .c t ob represents the contextual information extracted from the interaction partner P t ob at the current timestamp t.

Motion encoder E M
The network receives the motion sequence P 0:T ob of the interaction partner P ob as input and delivers the output feature vector c 0:T P .Motion Encoder is constructed with a sequence of fully connected (FC) layers and Long-Short Term Memory (LSTM) layers.Motion Encoder iteratively encodes P 0:T ob into c 0:T P frame-by-frame.

Speech encoder E S
The network handles the speech audio A t ob as input and produces the audio feature vector c 0:T A .From the raw audio speech, we first extract the MFCCs and related lowlevel speech features.MFCCs are well known to encode signal frequencies according to how humans perceive sounds, and such low-level features are widely utilized in speech recognition or identification tasks [23].In addition to MFCCs, the prosodic features representing the energy of speech are utilized as it encompasses intonation, rhythm, and other information about the speech outside of the specific words spoken (e.g.semantics and syntax).Speech prosody is a common candidate for modelling human beat gestures [24].Similar to the Motion Encoder, Speech Encoder processes input speech features frame-by-frame.Speech Encoder is constructed with 4 Convolutional (CONV) layers, 1 LTSM layers, and 1 FC layer.

Generator G
Generator G consists of Speech Encoder, G Encoder , and G Decoder .Speech Encoder implemented in G inherits the same network architecture as the one implemented in E, and they share the same weight parameters.Here, at a time stamp t, Speech Encoder receives the audio speech A t fo as an input and encodes it into s t fo .It is followed by feeding s fo , c t ob , and the previously generated pose Pt−1 fo into G Encoder .At the initial time stamp (t = 0), a seed pose P init fo is injected into G Encoder instead of the previously generated pose Pt−1 fo .G Encoder is designed with a sequence of FC layers to encode the input vector into an internal representation h t e .Finally, h t e is fed to G Decoder for generating the next motion frame Pt fo .We designed G Decoder with a sequence of FC layers and LSTM layers.As illustrated in Figure 1, for better modelling the velocity of generated motion, a residual connection is added between the previously generated pose and the new output pose produced by G Decoder .This approach allows G Decoder to model the differences between Pt−1 fo and Pt fo that encourages the continuity of generated motions.Note that Generator can also be used independently without the need of integrating with ContextEncoder and Discriminator.In this case, G receives A 0:T fo to predict the co-speech gesture P0:T fo .Further details are presented in Sections 5 and 6.

Discriminator D
During the training phase, both real P 0:T fo and fake cospeech gestures P0:T fo are injected into the Discriminator D. Additionally, D also takes both speech feature s 0:T fo of the target user P fo and the contextual vector c 0:T ob of the interaction partner P ob into consideration for producing the adversarial loss y.Here, D is able to work as a smart adaptive loss function where s 0:T fo delivers information allowing D to validate the speech synthesis while c 0:T ob contains information for verifying the context synchrony.D is designed with 2 LSTM layers and followed by a sequence of FC layers.Output values from the last FC layer are passed through a sigmoid function to produce a probability indicating whether the input motion is real or fake.
Overall, the framework demonstrated in Figure 1 is trained with the loss functions L G and L D defined in Equations ( 1) and ( 2), respectively.The training procedure is summarized in Algorithm 1. P 1:T fo and P1:T fo represents the velocity of ground truth motion P 1:T fo and the generated one P1:T fo , respectively.α, β, and γ are weight parameters to manipulate the corresponding loss terms.Note that the newly implemented velocity loss can be considered as an improvement of the loss function L G introduced in [5].By incorporating velocity loss into the total loss L G , along with adversarial loss and position loss, the new approach enhances the smoothness of generated motions.

Evaluation metrics
The following metrics are used to validate the accuracy and the quality of generation actions based on the related literature [6,15,25].In short, Average Position Error is used to to measure the differences between ground truth and the predicted motions while Acceleration and Jerk are implemented for assessing the smoothness of the actions.Average Position Error (APE) : APE measures the average distance between the predicted joint angles and the ground truth ones as given in Equation ( 3 for t=0 to T do y r ← D(c 0:T ob , s 0:T fo , P 0:T fo ) 10: Update G, E S , and E M with L G 13: end for denotes the time sequence of motion, D is the total number of joints.The closer APE scores to 0, the more similar to the ground truth motions.
Acceleration and Jerk: Acceleration is calculated based on the rate of change of joint velocity while Jerk is defined as the rate of change of Acceleration.The two metrics are commonly used for verifying the smoothness of motion; the lower values, the smoother motions are [26].

JESTKOD -a dataset of dyadic interactions in affective contexts
The proposed approach was validated on the JESTKOD dataset [27], a time-synchronized speech and gesture dataset in affective dyadic interactions.The body data was collected by a motion capture system and was defined by Euler angles.This dataset allows us to model the full body gesture of a target person from speech while taking into consideration the contextual information simultaneously acquired from an interaction partner.The JESTKOD dataset covers a wide range of agreement and disagreement discussions on different topics (e.g.movies, sport, music, etc.) with 10 participants (4 females, 6 males).The dataset was collected in such a way that the participants' profiles were considered to put them into proper conversational topics to create agreement and

Dataset preprocessing
We divided the dataset into training and testing sets.
To better understand the contribution of affective contexts to generated motions, we trained and evaluated the approach on two separate interaction tasks, namely, agreement and disagreement scenarios.Specifically, for agreement scenarios, 41 sessions were used for training, and 15 sessions were utilized for testing.For disagreement scenarios, the training set includes 30 sessions, while the testing set consists of 12 sessions.The recordings of motion and speech were down-sampled into a common frame rate of 20 frames per second (fps).On each interaction session, from the audio recordings, we extracted low-level features as illustrated in Table 1 with a total dimension of 48.In terms of motion data, on each motion frame, 63 features representing 21 joints of human body motion in Roll, Pitch, and Yaw were selected (P 0:T ∈ R 63×T ).Speech features and motion features were normalized by taking into consideration their corresponding min-max values over the whole time sequence.Finally, data was split into a set of training instances using a time window T = 6 (secs) and a sliding window T = 2 (secs).On each motion instance, we stored as an initial pose P init of the motion sequence P 0:T and used it as a seed pose as discussed in Section 3.4.

Ablation studies
The network was firstly trained on the training set of agreement scenarios as mentioned in Section 5.2.The training data was fed to the network with a batch size of 1024.We use the Adam optimizer with a learning rate α = 0.0001, β 1 = 0.9, β 2 = 0.999.The learning rate was decayed after completing the first 700 training epochs, it was then reduced with a decay factor 0.9 for every next 20 epochs.In the loss function L G , we set α = 5, β = 5, and γ = 1.All values of these parameters were chosen empirically.The network was trained for 1000 epochs.In the first 50 warm-up epochs, the adversarial loss was not included in L G .This training pipeline was repeated for the JESTKOD training set of disagreement scenarios.
In addition to the full model consisting of Generator, Context Encoder, and Discriminator, ablation experiments were conducted to verify the impact of individual model components.Table 2 summarizes the key components of 5 implemented models: (1) the full model is composed of G, E M , E S , and D as introduced in Section 3. Compared to [12], the network architectures of E, G, and D of full model were improved by updating several hidden layers to better present output features.Audio inputs were described by a higher number of relevant low-level features as shown in Table 1.Indeed, L G was incorporated with the velocity loss to better encourage the smoothness of generated motions.(2) the model is comprised of G, E M , E S , and D as introduced in [12].(3) the approach without D was implemented by removing D out of the proposed framework.In other words, the adversarial loss was not contributed to the loss function L G .(4) the model without E and D was designed by removing both D and E. (5) the Speech to Gesture network is introduced in [15], Similar to the without E and D framework, Speech to Gesture receives A 0:T ob as an input for modelling speech gestures P 0:T fo .5 models were trained on the JESTKOD training set of agreement and disagreement scenarios using the same training pipeline mentioned above.

The impact of affective context on body gestures in dyadic interaction
The results shown in Table 3 indicated that the full model and the network [5] demonstrate a similar performance in terms of APE scores in Agreement and Disagreement scenarios.However, motions produced by full model have lower Acceleration and Jerk values.The result can be interpreted taking into consideration the improved loss function L G of full model, which aims to enhance the smoothness of generated actions.
A closer look at the APE scores reported in Table 3(a ,b), except for the full model and the approach [12] in which the difference in terms of APE values is negligible, other models implemented in the scenarios of Agreement always showed better performance with respect to all metrics defined in Section 4 as compared to the same network architecture employed in disagreement scenarios.In other words, implemented models conducted in agreement scenarios were able to produce co-speech gestures P0:T fo more similar to the ground truth motions P 0:T fo .Indeed, generated motions were smoother with respect to the smaller Acceleration and Jerk values obtained.The differences of APE values were even more obvious in the case of Speech to Gesture and without E and D networks in which Context Encoder was not implemented.This result suggests that in affective conversations, it is more difficult to model co-speech gestures of the target person P fo since their speech feature s t fo is not the only factor manipulating their body gesture P0:T fo .In other words, the impact of interaction context on the prediction of co-speech gestures is unavoidable.Thus, Context Encoder should be employed for better modelling the dynamic exchange of social signals in dyadic interaction.
From interpersonal perspectives, there are several moderating variables (e.g.mimicry, synchrony, etc.) that have a high impact on the way human behave, in particular, their body gestures during affective interactions [20,21,28].For instance, the non-conscious behavioural mimicry can be detected when interlocutors have affiliative motivations during interaction [28], or the synchrony of movements in dyadic interactions is established between people who has pre-existing friendship [21].Vice versa, the synchrony of behaviours has been observed to decrease in situations in which the relationship between interlocutors is not well established [20].The aforementioned studies provide empirical evidence that the impact of moderating variables on the interlocutors' nonverbal behaviours is unavoidable in affective dyadic interactions.Specifically, considering agreement and disagreement scenarios presented in this work, the synchrony and mimicry of nonverbal signals between two interlocutors tend to decrease when they are involved in a controversial communication.Contrarily, when two partners share convergent opinions for building a common ground, this process   encourages the dynamic exchange of nonverbal signals during interactions.As a result, information encoded from our proposed Context Encoder can better contribute to the prediction of co-speech gestures.Figures 2 and 3 present examples of generated cospeech body gestures derived from the test set of agreement and disagreement scenarios, respectively.Here, human motions represented by 3D angle rotations were converted into joint coordinates and presented in 3D coordinate space.It can be seen that in this dataset, interlocutors tend to use hand gestures to communicate their messages to their interaction partner, while the lower body remains relatively static.In particular, one of the frequently occurring cues was 'head tilting ' motion related to the disagreement scenario as illustrated in Figure 3.As also highlighted in [29], this is a common behaviour used to communicate a disagreement or confusion to the interaction partner in controversial conversations.

LISI-HHI -a multimodal dataset of dyadic interactions in social communication contexts
The LISI-HHI (Learning to Imitate Social Interaction -Human-Human Interaction) dataset [22] consists of multimodal signals, including multiple RGBD views, eye gaze, audio, and motion data.Figure 4 illustrates an example of multimodal data collected from the dataset.The experiment was conducted in a motion capture room, where all sensors are synchronized together in the time domain.The LISI-HHI dataset complements the previous databases by incorporating a multi-sensory setup with a novel design of social communication scenarios.Without creating a dataset limited to a specific context, for instance, agree and disagree discussions [27], theatrical narratives [30], LISI-HHI covers a wider range of human daily communication scenarios, which are practical to transfer into social HRI.To the best of our knowledge, LISI-HHI is among the few available databases that cover a high number of modalities, camera views, participants, and social interaction sessions.Putting all together, LISI-HHI aims to serve as a high-accuracy and multimodal dataset for many different research domains, especially HRI.
With the aim of collecting a diverse set of verbal and nonverbal behaviours in different communication contexts, participants are not given any narrations, and no constraints are put into them regarding their way of speaking and acting.LISI-HHI comprises 5 designed scenarios, including: (1) small talk; (2) meal planning; (3) tangram game; (4) role playing; (5) way finding.The LISI-HHI dataset covers a total of 160 interaction sessions performed by 64 participants (38 females, 26 males).Each pair of participants were instructed to conduct 5 interaction sessions under 5 different scenarios mentioned above.The dataset is composed of a total of 8.3 hours.
In addition to the experiment conducted in affective contexts discussed in Section 5, the LISI-HHI dataset was utilized to verify the model performance in social communication contexts.Regarding motion data, body gestures are presented by a sequence of joint angles in the JESKOD, while they are defined as joint coordinates in LISI-HHI.Putting all together, conducting experiments with two datasets, recorded in two different dyadic contexts and defined by two different motion types, allowed us to validate the proposed approach comprehensively.

Dataset preprocessing
From the audio recording, we first extracted low-level audio features resulting in a total dimension of 48 (A 0:T ∈ R 48×T ) as explained in Table 1.The audio vectors were then normalized based on their min-max values over the whole time sequence, similar to the preprocessing of the JESTKOD dataset.Concerning motion data, in the LISI-HHI dataset, human gestures are defined by 39 joints in 3D Cartesian space.To eliminate the differences in body size, we reconstructed human  joints with respect to their top-chest joint coordinates.Motion values were then normalized, taking into consideration their min-max values over the whole time sequence.From 39 joints of the dataset, 30 main joints were selected out to represent a whole human body pose, which results in a 90-dimensional motion vector (P 0:T ∈ R 90×T ).
The LISI-HHI dataset consists of 5 social communication scenarios, each of them is composed of 32 sessions.To better examine the effects of interaction context on generated motions, the network was trained and evaluated on individual scenarios.For each scenario, 25 sessions were used for training, and 7 sessions were utilized for testing.

The effects of social communication context on motion synthesis
An ablation study was conducted with 5 implemented frameworks.Their model components are illustrated in Table 2.We sequentially trained them on each of 5 scenarios.Due to a higher joint dimension P 0:T ∈ R 90×T , the training process was conducted with a batch size of 512.Apart from the batch size of 512, we used the same training parameters and strategies as introduced in Section 5.3.Table 4 summarizes the performance of 5 implemented frameworks across 5 different social communication scenarios.Figures 5-9 present examples of generated co-speech body gestures derived from the testing set of the LISI-HHI dataset.
The results depicted in Table 4 indicates that differences between full model and the approach introduced in [12] were not noticeable in term of APE.However, Acceleration and Jerk of generated motions produced by full model tend to be lower than the ones generated by the approach [12].The differences in Acceleration and Jerk values could be explained by the differences in their loss functions L G .In full model, the addition of velocity loss to L G enhances the smoothness of generated motion, resulting lower values of Acceleration and Jerk.
Compared to full model and the network in [12], the performances of without D, without E and D, and Speech to Gestures were significantly reduced, even though without D demonstrated a comparable performance with full model and the network in [12] in some scenarios.It is interesting to observe that APE values significantly increased by removing D from the framework.Taking into consideration the setting of LISI-HHI dataset in which a wider range of body and hand gestures were exhibited to support communicators' messages during dyadic conversations, the adversarial loss from D could provide G informative feedback to better imitate the distribution of human communicative gestures, with a final goal of producing fake gestures P0:T fo as much similar as the real ones P 0:T fo .In other words, despite the fact that     by participants to better explain semantic contents of their speech, for instance, shapes of tangram cards.Such nonverbal behaviours tend to produce a higher range of movements compared to beat gestures [31] as displayed in Figures 5 and 8. Consequently, a higher position error is predictable in that scenario.Similarly, in the example of Way finding shown in Figure 9, iconic and pointing gestures were commonly performed by participants for navigating and localizing purposes.This context encourages participants to perform more energetic hand movements resulting higher Acceleration and Jerk values.In general, the variations of motion accuracy across 5 scenarios highlighted the influence of communication contexts on the performance of co-speech generative networks.To some extent, the experimental results in Table 4 demonstrated that the combination of E and D as in the full model can mitigate the effect of interaction contexts on the accuracy of actions produced by G.It also suggests that discussions about the performance of co-speech gesture framework should consider the nature of interaction context settings.

Transferring human gestures to social robots
The generated motions in dyadic human-human interaction can be transferred into social robots, being robots' nonverbal gestures supporting social humanrobot interaction.As a proof of concept, we implemented the generated motion P0:T fo of the target person P fo in affective contexts on the Pepper robot.The process was started by converting P0:T fo into a set of 3D human joint coordinates.The motion P0:T fo defined in human motion space was then transferred into the Pepper robot's motion space using the transformation model introduced in [32].Consequently, the robot's motion is presented by a list of the robot's joint angles over the time sequence.Figure 10 presents generated actions collected in the test set of agreement and disagreement interactions.

Conclusion
This paper introduces a context-aware GAN approach towards modelling robots' nonverbal behaviours in dyadic interactions.The framework consists of Context Encoder, Generator, and Discriminator.The approach receives speech features of a target person together with nonverbal signals of their interaction partner, modelled by a novel Context Encoder, to generate appropriate co-speech gestures supporting for dyadic interaction.A series of experiments were conducted to validate the proposed framework comprehensively.We first evaluated our method against agreement and disagreement situations using the JESTKOD dataset.The experimental results show that Context Encoder can better contribute to the prediction of co-speech gestures in agreement situations, implying the importance of interaction context.To verify model performance in social communication settings, we conducted an experiment using our new LISI-HHI dataset.The experimental results confirm the contribution of Context Encoder to the accuracy of generated gestures.The results also highlight the combination of Discriminator and Context Encoder to form an efficient co-speech generation network that can work across different social communication settings.As a proof concept, we demonstrated the idea of modelling body gestures with context awareness on the Pepper robot.
In small group interaction, especially dyadic interaction, an essential aspect of communication is the dynamic exchange of nonverbal signals among interlocutors.The interaction context could influence interlocutors' way of speaking and acting, either for adapting to interaction social norms, building or breaking a common ground.Consequently, this social factor should be considered when modelling nonverbal signals in dyadic interactions, particularly for generating appropriate robots' gestures in social HRI settings.
From human behaviour studies, interpersonal coordination in dyadic interactions can be observed either in the same time window or with several seconds lag [33][34][35].In other words, the contribution of interaction context to generated gestures should be investigated not only in the nonverbal behaviour synthesis task, as discussed in this paper, but also in the forecasting task.Hence, potential avenues for future research include demonstrating gesture synthesis and gesture forecasting within HRI scenarios, as well as investigating their effectiveness using both objective and subjective evaluation techniques.

Notes on contributors
Nguyen Tan Viet Tuyen is a Research Associate in Centre for Robotics Research, Department of Engineering, King's College London, United Kingdom.He received the M.Sc degree and Ph.D. degree in Information Science from Japan Advanced Institute of Science and Technology (JAIST) in 2018 and 2021, respectively.His research interests include social and cognitive robotics, human-robot interaction, machine learning, and mechatronics.
Oya Celiktutan received her Ph.D. degree in electrical and electronic engineering from Bogazici University, Turkey, in 2013.In 2018, she joined King's College London, United Kingdom, where she is an Associate Professor in the Centre for Robotics Research, Department of Engineering and is the Head of the Social AI & Robotics Laboratory.Her primary research interest is machine learning applied to computer vision, human behaviour understanding and generation, and human-robot interaction.

Figure 1 .
Figure 1.The proposed framework based on conditional GAN to generate body gestures for a target person taking into consideration the target person's speech (or audio) and their interaction partner's nonverbal signals encoded by the Context Encoder.

Algorithm 1 1 :
), where T The proposed algorithm for the training phase Input: P 0:T ob , A 0:T ob , P 0:T fo , A 0:T fo for s=0 to training step S do 2:

Figure 2 .
Figure 2. Sample generated body gestures (colored in red -right side) from the agreement scenario: (a) ground truth; (b) full model.The human skeleton coloured in black represents the body motion of the interaction partner P ob .

Figure 3 .
Figure 3. Sample generated body gestures (colored in red -left side) from the disagreement scenario: (a) ground truth; (b) full model.The human skeleton coloured in black represents the body motion of the interaction partner P ob .

Figure 5 .
Figure 5. Sample generated body gestures (coloured in red -left side) from the scenario Small Talk.

Figure 6 .
Figure 6.Sample generated body gestures (coloured in red -left side) from the scenario Meal Planning.

Figure 7 .
Figure 7. Sample generated body gestures (coloured in red -left side) from the scenario Tangram game.

Figure 8 .
Figure 8. Sample generated body gestures (coloured in red -left side) from the scenario Role playing.

Figure 9 .
Figure 9. Sample generated body gestures (coloured in red -left side) from the scenario Way finding.

Figure 10 .
Figure 10.Transferring the generated motion of the target person P fo into the Pepper social robot.The human skeleton coloured in black represents for the body motion of the interaction partner P ob .

Table 1 .
Low-level features extracted from audio input.

Table 2 .
Key components of ablation models.

Table 3 .
Performances of 5 implemented models in terms of APE, Acceleration, Jerk, and FGD.
Note: The results are reported on the test set of: (a) agreement scenarios and (b) disagreement scenarios of the JESTKOD dataset.

Table 4 .
Performances of 5 implemented models in terms of APE, Acceleration, Jerk, and FGD.