Echo Chamber (2023)

Project Summary

Echo Chamber was part of a broader exhibition entitled Simulacra at the Grainger Museum from 8 - 11 November 2023. The exhibition was the culmination of a creative research residency where I worked with students from the Faculty of Fine Arts and Music, School of Dance and School of Computing and Information Systems to speculate about the role of AI in the Creative Arts and its intersection with the museum collection. Jean Baudrillard’s philosophical conception of simulacra (1994) on the relationship between reality, symbols and society was particularly resonant and adopted as the name of our public exhibition. Although generally applied to media culture, Baudrillard’s claim that current society has replaced all reality with symbols and signs seemed even more relevant to a world in which AI is used to generate fake news, fake identities and fake art. These simulacra do not merely imitate or replace reality but precede them – Baudrillard’s "precession of simulacra".

Echo Chamber was a participatory experience where visitors played a 10-second melody on an acoustic piano. A generative music AI developed by Meta called MusicGen was used to generate multiple solo piano or orchestral versions of the melody, which were layered, looped and played back through a 16-channel speaker configuration.

Part of the 16-channel speaker configuration at the Grainger Museum.

Percy Grainger’s idiosyncratic musical instructions from his various scores were randomly chosen as a text prompt for the AI (e.g. "impulsively and very feelingly", "flowingly, and rather wayward in time").

We chose the acoustic piano as the main interface and source of sound as it is an instrument that many people can play, albeit at different levels of proficiency. Regardless of proficiency, its percussive nature meant that most people could produce some sound with some control over dynamics and rhythm through depressing the keys. Further, the piano had a link to Percy Grainger. Lastly, its role as a symbol and tool of colonisation (Moffat 2009) gave another nuance to the work and its questioning of generative AI as a new form of colonisation of culture.

Percy Grainger was one of the most famous and successful pianists of his time. The museum collection houses some his pianos, including pianos he experimented with such as the Butterfly Piano.

In conceptualising this work, we were inspired by Grainger’s pioneering work with Free Music experiments to investigate the cutting-edge of contemporary technological development in sound – generative AI. We began with the question of whether generative AI could assist in the creation of "free music", or music unbound by conventions of its time.

We found that in the case of Meta’s MusicGen API, the music that was generated tended to fall into conventional note pitches, melodies, harmonies and genres – perhaps due to the prevalence of such conventions in the training data, which consisted of 20,000 hours of audio data including 10,000 from a proprietary dataset and the rest from the Shutterstock and Pond5 music datasets (Copet et al. 2023). Indeed, even where atonal or microtonal input was fed into the AI, the resulting generated music tended to be more melodically and harmonically conventional and within the Western equal-tempered scale.

We used this limitation of MusicGen to create an evolving soundscape where the participant prompt was used to generate a melody, which was then used to generate the next melody and so on (up to 6 generations due to a 4-minute time limit for each participant). Because of the tendency of the AI to stay within the same harmonic and melodic structure from generation to generation, layering multiple generations over each other was possible while still sounding musically coherent.

The experience allowed visitors to hear multiple simulated versions of themselves, alluding to the way in which generative AI is a simulation of human artistic output. Using a participatory approach, visitors were encouraged to think about generative AI’s risks, not only in the protection of copyright and intellectual property, but to a world where artistic conventions and popular genres become ever more amplified at the expense of less dominant cultures and practices – a network of simulations and simulacra.

System Design

Our main interface was built in Python. This was used to capture the 10-second participant melody through an Ishcell piano contact mic attached to an acoustic grand piano, which was then sent to the pre-trained MusicGen model.

MusicGen is a single-stage transformer language model. It can generate music based solely on text prompts or conditioned by an uploaded audio clip. Its API allows developers to interface with the model for custom applications. In Echo Chamber, we passed the participant melody into MusicGen together with a text prompt which specified the instrumentation and musical instructions derived from Percy Grainger’s scores. Below are the possible prompts, which were randomised for each AI generation. While we experimented with using only piano solo instrumentation for each generation, we found that randomly requesting orchestral or other forms of instrumentation provided more variety and interest to the audio journey across different participants.

Text prompts in the Python code for Echo Chamber

We requested an output duration of 20 seconds (the default is 15 seconds) with a temperature of 1. The code function for the AI generation is set out below.

The temperature is a parameter which controls the level of uncertainty or randomness in the probability distribution of the machine learning output. A higher temperature will result in a more diversity but also greater probability of error, whilst a lower temperature will be more conservative. We experimented with various temperature settings and found that while temperatures below 1 produced a much more predictable output that had more similarity to the original melody, there were fewer creative results that could generate an element of surprise. Temperatures above 1, on the other hand, tended to produce results that often had very little cognitive resemblance to the original melody.

Python code function to send participant melody to the MusicGen model

Once the system produced a generated audio sample, the sample was then input into the model with another random text prompt for the next generation, creating an output from the synthetic input. We limited each participant experience to 4 minutes, with 5 generations that evolved from the previous generation plus the original participant sample.

One of the challenges we faced in using MusicGen for real-time applications was the time required to generate each audio sample. Using cloud-based GPUs, the time required for each generation was too long to be practicable (approximately 40 seconds to generate a 20 second sample) and carried risks of internet-dependence and reliability. Therefore, we decided to use a local computer with an RTX 3090 GPU card to run the model. This enabled us to reduce the processing time for each generation to approximately 20-25 seconds. Given that it took longer to generate an audio sample than the length of the sample itself, we had to use strategies such as looping and layering to provide a continuous 4-minute audio experience for each participant.

The audio samples generated by MusicGen were stored in a local file, which was accessed by MaxMSP. A MaxMSP patch was used to loop and layer the generated files and spatialise them across 16 speakers. The main Python file communicated with MaxMSP via OSC messages to cue start and stop times for each relevant event (i.e. when a participant completed a recorded input, when each generation was completed and when the 4-minute duration had lapsed).

Systems diagram for Echo Chamber

The user interface was built in Processing, which communicated with Python via a User Datagram Protocol (UDP) socket. The user interface consisted of 5 sections, being:

Ready – with simple instructions to the participant regarding the participatory recording process;
Prepare – a 3-second countdown to the recording;
Recording – a 10-second countdown;
Generating – showing the relevant text prompt used for the generation; and
Performing – a 4-minute countdown till the system reset.

Transcorporeality

In designing the physical layout of the installation, we deliberately placed the speakers in a different room to the piano, requiring the participant to walk through a hallway of approximately 10 metres length to be immersed in the sound. The physical distance created a sense of rupture, or displacement, between real and unreal, simulated and simulation.

The number of speakers used were strictly speaking not required by the sound design; we could have easily created a similar sonic effect with 4 speakers. However, the physical form of the speakers was used as a metaphor to give the effect of a chorus of people / agents - an echo chamber created by algorithmic curation, feedback networks and social media "like" culture.

Sound Design

In designing the way the AI generated samples and original recording were played back across the 4-minute audio journey, we built in sufficient flexibility to account for variable AI generation times (usually between 20-25 seconds) as well as unpredictability in the AI output. For example, some AI samples would be generated with a very abrupt cut-off at the 20-second mark, and some would be generated with silence towards the end. Therefore, we used extensive looping and layering of audio samples to mask unexpected silences. We also implemented a fade out at the end of every loop.

As Python generated each AI sample, it would overwrite the existing file in a local drive with the new sample. This ensured that we never ran out of local memory storage and did not store any participants’ output for ethical reasons. Therefore two buffers were created in MaxMSP - one for the original recording of participant input, and one for the latest AI generated sample. Throughout the 4-minute audio journey, these files were looped and played back at variable amplitudes. In addition, we implemented a randomised playback speed for the original recording so that it would play back at between 70 – 90% of the original speed, like a recurrent sonic memory to remind the participant of their original input. However, the original pitch was maintained to preserve its harmonic compatibility with the other layers.

In addition to the layering of the latest AI generated sample and the recording, we implemented an additional delay line of 20 seconds. This added another layer being the previous AI generated sample. We implemented a further delay line of approximately 23 seconds to this layer (i.e. a total delay of 43 seconds), which would play a fourth layer being the second previous AI generated sample. This fourth layer was pitched down one octave to provide a larger frequency range to the total sonic output.

The amplitudes of all 4 layers were controlled by line functions, which ensured that different layers faded in and out in prominence over the 4 minutes. However, the overall amplitude of all layers increased towards the last 30 seconds to build up to a crescendo before fading out and resetting for the next participant.

Line functions to control amplitude of different sound layers over 4 minutes

In addition, a simple circular pan spatialisation was implemented for all 4 layers so that they would independently move around the 16-speaker configuration. The rate of pan was controlled by a line function and increased towards the end of the 4-minute journey, creating a whirling vortex of sound.

Line function to control rate of circular pan of all sound layers over 4 minutes

Evaluation

Given the recent release of text-to-music AIs (MusicGen was released in June 2023), there was very little literature or prior works that we could refer to in designing an installation with this technology. As a community, we are still grappling with its impact, both positive and negative, on the music industry and society in general. In this regard, Echo Chamber is a pioneering work in the field of real-time participatory sound art using text-to-music generative AI.

A number of participants commented that the generative AI made their piano playing “sound better”. For others, it made them wonder if the generated music “belonged” to them. Through this embodied experience of playing, listening and being in the space, Echo Chamber allowed participants to self-discover their own opinions on AI-generated music, to evaluate its potential for creativity and for democratisation of music against the risks of homogenisation, artistic cliché and intellectual property plundering.

By placing the participant body as a central focus of the experience and utilising the physical space, acoustic instruments and physical objects, we elevated the work from an intellectual exercise to an immersive experience that participants could share with others as they listened and conversed about their own output and that of other participants.

Significance to Research

This project was part of a broader research residency and exhibition that required me to combine creative practice with teaching, supervising and leading a group of students from diverse technical backgrounds and academic levels. It helped to inform my discoveries on the critical need for undisciplinarity and institutional approaches to post-disciplinarity. It also sparked reflection on the potential dangers of a network culture, and how art can either serve to proselytise or problematise technology. Often, as with Echo Chamber, it sits on a slippery slide between both positions. Together with Scrape Elegy and Guài, it cemented the potential for participation to create ambiguous (and fun) experiences overlaying serious issues and in turn, multi-layered meanings accessible at different levels.

Credits

Creative concept and sound design: Monica Lim

Coding: Bingqing Chen and Ying Sima

User interface design: Melanie Huang

Video documentation: Celeste de Clario and Patrick Telfer

Supported by: Grainger Museum, Fine Arts and Music Faculty Research Grant, University of Melbourne