Guài (2023)

Project Summary

Guài means strange, or monster, in Mandarin. The project started with the question – if AI could really decipher our personality, what sort of inner monster would we find inside?

Commissioned for Melbourne Fringe 2023, the work was presented at Footscray Community Arts from 5 - 15 October 2023. Although the work was targeted toward youth of ages 10+, the audience came from a large demographic spread across diverse ages and cultural background, particularly due to the location of Footscray Community Arts within a large immigrant and multicultural community. This was particularly pertinent in the context of well-documented biases against cultural minorities in the field of biometric analysis (Michael et al. 2022).

As participants individually entered our "portal", they were confronted with an AR mirror (a translucent two-way mirror sitting atop an LED screen) allowing them to see a reflection of their own body overlaid over a virtually-constructed reality. An Azure Kinect computer vision camera was positioned above the mirror, allowing us to track the participant body and adjust the virtual view with their head position. The camera was also used to capture a photo of the participant for biometric facial analysis.

Facial recognition AI provided by Amazon Rekognition ("Guidelines on face attributes" 2023) was used to obtain general information such as age and gender.

During the development stage of this work, we used Microsoft’s Azure Face for emotion recognition. However, Microsoft discontinued public access to the API in June 2023 due to concerns about the ‘lack of scientific consensus on the definition of emotions, the challenges in how inferences generalise over use cases, regions and demographics and the heightened privacy concerns around this type of capability’ (Crampton 2022).

Amazon Rekognition also provided the ability to extrapolate 8 different emotional states being happy, sad, angry, surprised, disgusted, calm, confused and fear as well as a confidence rating for each metric.

Amazon Rekognition provides a binary prediction of gender (male/female). In its documentation, it states that the prediction is based on the physical appearance of a face in a particular image and does not indicate gender identity. In Guài, we chose to retain rather than remove this prediction despite its non-inclusive nature to make transparent the types of biometric analysis and categorisation that are we are potentially subject to without our knowledge or consent. Similarly, predictions of emotional expression are based on the physical appearance of a person’s face. In Amazon Rekognition’s documentation, it states that this does not indicate a person’s actual internal emotional state. For example, a person pretending to have a happy face in a picture might look happy but might not be experiencing happiness.

For Guài, we took creative license to extrapolate personality traits from these emotional states using an algorithm provided by ChatGPT - being happiness, boringness, intelligence, attractiveness, emotional stability, weirdness, trustworthiness and egoism. In reality, the algorithm provided by ChatGPT had no scientific basis. We also had no way to interrogate how ChatGPT derived this algorithm, due to the black box nature of complex neural networks (Bathaee 2018). However, we used this AI construction to challenge participants to think critically about the way in which society tends to accept AI as omniscient and accurate despite their many documented biases, particularly against bodies which are transgressive, coloured, or non-normative (Singh et al. 2021).

Based on the detected personality traits, we assigned participants to one of eight virtual avatars derived from monsters described in the Chinese classic Shan Hai Jing (Classic of Mountain and Seas), a compilation of mythic geography and myth (Chen Cheng jin et al. 2010). For example, participants with a very high "attractiveness" score would be assigned the Tian Hu, described in the literature as a shape-shifting tempter. Participants with a very high "egoism" score would be assigned the Tao Wu, described as a vicious and stubborn beast. The participants were allowed 3 minutes to "embody" the monster, as the tracking camera overlaid the virtual avatar over the participants’ reflection in the AR mirror. In addition, a soundscape was created for each individual participant based on the personality traits. The soundscape was interactive, encouraging participants to become the monster through a combination of kinaesthetic, visual and sonic senses.

Avatar designs by Henry Lai-Pyne inspired by the Shan Hai Jing

Although the participatory experience was individual, allowing only one person in the portal at any one time, the virtual avatar embodied by the participant was projected onto a large screen, allowing all visitors in the space to see and hear the virtual monster. This not only provided a user interface functionality (i.e. people could see and hear other participants’ virtual embodiment and were therefore aware of the interactivity) but provided a social experience where visitors could discover other virtual avatars, listen to different soundscapes and compare their virtually-constructed self to others.

System Design

The main interface for the biometric facial analysis, body tracking and virtual avatar was built in Unity, a gaming engine, which interfaced with Amazon Rekognition’s API as well as the Azure Kinect camera.

Body presence and movement information were processed in Unity and output as OSC messages using a LAN Wi-Fi network. The OSC messages were received by a separate computer running MaxMSP, which controlled the digital signal processing for the soundscape. The OSC messages were also received by an ETC Element lighting console to automate different lighting states based on the stages of the participatory design as well as the avatar assigned. All user interface, projector, sound and lighting changes were designed to be fully automated, negating any requirement for manual handling of the installation. Each individual participatory experience was designed to run for approximately 5 minutes or until no body was detected in the portal.

System diagram for Guài. Note this diagram does not include the interactive lighting design system

Sound and Interactive Design

The sound design consisted of 2 different elements – a narrative element, and an interactive element.

Narrative Element

The narrative element had the purpose of welcoming the participant and setting up the narrative context for the work. A neural voice by Microsoft Azure Cognitive Services was used, which was then put through a vocoder effect to achieve a cliched "robotic" voice.

In addition to the neural voice, an underscore was inserted to create a ‘scanning’ effect over the body and a sonic build-up prior to the avatar reveal. Although the biometric analysis was almost instantaneous the moment the participant entered the portal, we decided to drag out the analysis on the AR mirror to provide a more suspenseful experience and to provide time for the participant to absorb and reflect on their analysis.

At the end of the individual experience, a neural voice also functions to farewell the participant and as a signal to exit the portal.

Interactive Element

The interactive element commenced as soon as the avatar was revealed, consisting of 3 layers:

an ambient drone or rhythmic layer;
a textural layer that was highly responsive to movement; and
a ‘whoosh’ layer that was triggered by intense movement.

In designing the work, our original intention was to create a unique sound design for each participant based on their biometric analysis. This sound design had to be semantically related – for example, someone who scored very high on the "happiness" metric would have an upbeat sound design, and someone who scored very low on the "emotional stability" metric would have chaotic sound. Further, the sound had to change with the participant’s movement to support the embodied virtual experience.

To achieve some control over the semantic relationship of the sound to the biometric analysis, I created a database of sound samples to form the basis of the sound design. These samples ranged from acoustic elements such as recorded instruments and environmental sounds, to rhythmic beats and synthesised textures. Multiple different samples were layered over each other for any given participant, providing a greater number of possible combinations. Given the complexity of trying to map eight different personality metrics to sound, I decided to implement a machine learning algorithm using Wekinator (Fiebrink and Cook 2010) to undertake the mapping to various parameters of the sound sampler. Using a "mapping through listening" approach (Caramiaux et al. 2014), I randomised sound parameters, listened to their musical characteristics and adjusted the personality metrics to ‘match’ the resulting sound. A K-nearest-neighbour (KNN) (K=3) classification algorithm to predict future mappings from participants’ metrics. The diagram below shows the input and output connections between the biometric analysis (8 inputs) and sound design (26 outputs).

Screenshot from Wekinator’s input/output connection editor for Guài

Overall, the outputs controlled the sound samples chosen for each layer, the number of samples, whether or not the samples were looped, the range of possible MIDI notes to play each sample (equating to the amount of pitchshift on the sample) and the probability that a pitchshift would occur.

Section of MaxMSP patch showing the ML outputs from Wekinator being used to set various sound parameters based on each participants’ biometric analysis.

For the interactivity with participant movement, I implemented the following mappings:

The velocity of the left hand and right hand respectively controlled the velocity of the textural layer. Therefore, the faster the movement of the hands, the louder the textured sound. The speed at which the textures were triggered were also mapped to the x-y angle and x-z angle, to provide greater variation in the texture;
Overall movement intensity above a certain threshold triggered a large "whoosh" sound to emphasise the movement. This would also randomly silence the ambient drone/rhythmic layer or textural layer, to provide variations to the thickness of the sound over time; and
The distance between the hands controlled the horizontal (azimuth) spatialisation of the ambient drone/rhythmic layer. The horizontal and vertical (azimuth and elevation) spatialisation of the other layers were controlled by the x-y angle and x-z angle.

We decided to utilise dynamical metrics such as velocity and intensity rather than simple coordinates so that we could cater for different heights and participant body types (particularly as we were expecting children as our target audience), and to ensure consistency when participants moved around the portal (which had a diameter of approximately 2 metres for movement). The x-y angle and x-z angle was also used as it could generalise over a wider range of movement.

The following are videos of 3 participant experiences.

Evaluation

Guài was a complex work that attempted to relate music generation to biometric facial analysis, embodied interaction and AI ethics in a playful, immersive environment. The interactive sound design helped participants to embody their virtual avatar, as well as provided an immersive shared experience for the audience in the space. The projection of the virtual environment onto the large screen and the sound design allowed the sharing of the participant body to the audience, expanding the interactive experience from a one-to-one experience to a collective one. The networked connections between the gaming engine, sound engine, AR mirror, and lighting system reflected the collaborative networks between the participating artists, each individually contributing skills to create a larger, complex work than would not have been individually possible.

Significance to Research

This project allowed me to build on research questions brought up in previous projects regarding digital ethics, surveillance technology and virtual embodiment. It also enabled further exploration of modularity in sound design commenced in Scrape Elegy, allowing for customisation of sound to each individual participant while maintaining an overall consistent structure. It further cemented an undisciplinary approach to my practice involving many different fields of expertise to create a multi-modal work.

Credits

Creative concept: Monica Lim, Mindy Meng Wang

Sound design: Monica Lim

Coding: Qiushi Zhou

Avatar design: Henry Lai-Pyne

Lighting design: Giovanna Yate Gonzalez

Set design: Savanna Wegman

Supported by: Melbourne Fringe, Footscray Community Arts, School of Computing and Information Systems, University of Melbourne, Creative Australia, City of Melbourne