Frank's Thesis Chapter 1

Skip to Chapter 2 or return to the Index

Chapter 1
INTRODUCTION TO AURALIZATION

1.0 What is Auralization?
1.1 A bit of history
1.2 The current state of the art
1.3 Variables in an auralization system

1.0 What is Auralization?

Virtual Reality is one of the hottest buzzwords today in the electronics industry. Products of all varieties are claiming to be a part of this emerging technology. Unfortunately, much of the hype is just that - hype. While a great deal of time and money has been spent in the virtual arena, the fact remains that the enabling technologies are still in their infancy. We have yet to fully understand the underlying human perceptual systems, much less develop our own devices to fool them beyond simple effects such as conventional stereo audio.

Until recently, the focus of these efforts was in providing stereoscopic three-dimensional graphics to stimulate our sense of vision. This is understandable, considering the extent to which human perception relies on visual information. We use our eyes as the primary tool for exploring our world, and when presented with contradictory information it is our visual perceptions that take precedence in our mental processing [1].

After much effort by researchers, stereoscopic displays are now available which can provide some degree of representation of a virtual world. However, the early pioneers found that though visual cues are a predominant part of our perceptions, they alone are not sufficient to create believable worlds. Coupling a three-dimensional visual display with conventional stereo sounds presents a very unnatural experience for the user. As a result, the field of auralization - three-dimensional sound - was drawn to the forefront of audio.

There are a number of motivations for developing auralization systems in addition to the advent of virtual reality. True three-dimensional processing adds an auxiliary creative element to be manipulated by the commercial musician or record producer, providing a new realm of entertainment to explore. Auralization also allows end-users to benefit from the "cocktail party" effect, the brain's ability to use localization cues to isolate a single conversation from a multitude of similar sounds. This is very useful in designing systems where a user must monitor several communication channels at once. Applications include air traffic control, NASA mission control, and fighter pilot communications. Virtual audio displays also potentially utilize auralization algorithms, with promise for use in a general office environment as well as applications for the visually impaired.

The newfound importance of this field is evidenced by the recent explosion of auralization publications, both technical and pedestrian. While only a few years ago pioneers in virtual audio had difficulty getting papers published [2], the past few years have seen numerous articles in the Journal of the Audio Engineering Society (JAES) and the Journal of the Acoustical Society of America (JASA), as well as various virtual reality, telepresence, and human factors publications. The October 1993 AES convention in New York was titled "Audio in the Age of Multimedia" and featured numerous workshops and panel discussions on auralization. In addition, the popular press has latched on to the terms "3-D" and "Virtual", producing a plethora of articles on various degrees of auralization processing available to the consumer and semi-professional musician. Unfortunately, many of these systems are more accurately classified as surround sound or enhanced stereo, rather than true auralization systems [3][4][5]. It is in the AES journal's recent special auralization issue [6] that Kleiner proposed a definition for the term at the center of this activity:

"Auralization is the process of rendering audible, by physical or mathematical modeling, the sound field of a source in space, in such a way as to simulate the binaural listening experience at a given position in the modeled space."

1.1 A bit of history

Some of the earliest research into the spatial perception of sounds was performed by Mills during the 1950's, determining the minimum audible angle for perceived source motion [7]. Investigation into the mechanisms and limits of human spatial auditory perception continued through the next several decades. Researchers such as Perrot explored minimum resolution angles for sources of different velocities and spectral content [8][9][10]. More recently, Makous and Middlebrooks have studied sources varying in two dimensions [11].

Through the years, a variety of means for recording three-dimensional soundfields have been developed. The Ambisonics system [12] attempted to record the three orthogonal velocity vectors and one absolute pressure at a given point in space, thus completely defining the soundfield at that point. A matrix network and equalizers manipulated the four channel direct recording (A-format) to a different four channel form (B-format) which could then be manipulated to produce one of the following: a steerable mono output, a stereo pair whose effective angle, vertical tilt and rotation can be manipulated, or a quadraphonic set of outputs whose effective direction and tilt can be manipulated. While it found some success by providing additional control over a conventional stereo pair in multitrack recording, the need for a four-channel recording medium and specialized playback environment limited the usefulness and commercial viability of this system for general use.

Dummy-head recording has demonstrated great success within the limitations of a two-channel stereo- compatible format which requires no unusual playback apparatus. Microphones in a carefully constructed artificial head record sound from the location of the "eardrums" onto a standard two-track medium for playback through headphones. These systems can provide convincing reconstructions of a number of auditory environments, though results vary based on the accuracy of the artificial ear and the correlation to an individual listener's head and ear shape. They are also limited to playback through headphones. The primary failing of these systems is that they are only capable of recollection, not synthesis. It is not possible for the user to position a pre-recorded sound at an arbitrary position in space; items are locked into their positions as they were recorded.

1.2 The current state of the art

It is only recently that computer processing power has reached a level enabling us to even consider synthesis of three-dimensional soundfields. Unfortunately, the auditory cues used by the brain are fairly fragile with respect to listening environment. Because loudspeaker playback setups vary widely both in physical arrangement and design criteria such as time-alignment, most auralization systems are designed for headphone playback. With headphones, the transducer location is fixed and alignment between multiple drivers is usually no longer a concern. In general, headphones exhibit distortions an order of magnitude smaller than loudspeakers and avoid the difficulty of interaction with widely varying environments.

Wightman and Kistler provided groundbreaking data and validation for free-field simulation over headphones [13][14]. Through extensive experimentation, they recorded the head-related transfer functions (HRTFs) of numerous subjects for an array of sound source locations. The HRTF is one of three cues used by the brain to decipher location information, and is the primary intimation utilized to extract sound source elevation. It is modeled as a filter which accounts for the effect of the reflections off the pinnae (outer ear) and shoulder, as well as the shadowing effect of the head itself. The head-related transfer functions (HRTFs) acquired during their research are still widely used today. Wightman and Kistler's "SDO" HRTF provided a basis for the set utilized in the programs accompanying this research [15].

Begault recently identified a number of challenges to the successful implementation of three- dimensional audio systems [16]. Externalization (distinguishing a source as outside the listener's head) is a problem Begault himself has pursued [17][18]. Often though the listener is able to perceive azimuth and elevation differences, the sound appears to be very close to, or inside, the head. The perception of distance is a difficult illusion to construct. One key to successful externalization is an understanding of the role of room reflections, as noted by Hartmann [19][20]. The human brain utilizes information contained in the first few reflections from a reverberent environment to contribute to a perception of space.

Another obstacle to auralization is user-dependence of the HRTFs. There is a great deal of debate on whether better results are obtained with a "good listener's" HRTF, a "composite average" HRTF, or a "generalized" HRTF such as those developed by Wightman and Kistler [21]. A "good listener" HRTF is the measured HRTF of an individual who exhibits above-average localization ability in free-field conditions. Some listeners actually perform better with such a set than with their own measured HRTF [14]. A "composite average" is simply the HRTFs of many individuals averaged to create a single, representative HRTF. The problem with this approach is that it averages not only the common but the unique filter traits as well. In an effort to extract only the common characteristics across HRTF variation, Wightman and Kistler have performed a principal component analysis of a large number of measured HRTFs, resulting in a small set of basis functions from which HRTFs can be constructed with a high degree of accuracy. Because these functions encompass the shared features, any deviation is a result of the individual's own variation - typically less than 5%. Despite these efforts, the fact remains that HRTF compatibility varies widely in the general population [1].

Currently, most applied research depends on the Convolvotron, developed by Foster and Wenzel at Crystal River Engineering [2]. It is the principal commercial product available for serious auralization work. It combines a control CPU and DSP convolution engine to process audio signals using a library of measured HRTFs. Up to four sources may be auralized at once. A number of smaller auralization products are also under development at CRE, with development goals similar to the goals of this project: to extract more functionality from limited computing resources (constrained primarily by the inter-sample time period) through the application of superior algorithms.

Figure 1.1: An auralization system

1.3 Variables in an auralization system

In an auralization system, there are many parameters which may affect the listener's perception of the sound source. In order to address these issues and provide compensation, it is first necessary to identify them. In its simplest form, an auralization system may be represented as four steps: source recording, HRTF recording (or synthesis), a convolution means, and a playback system, as shown in Figure 1.1. Each of these has a set of parameters which must be controlled and defined to achieve accurate results.

Source recording:

Distance: The distance at which the source was recorded influences the relative SPL level as well as other parameters. Sources recorded at different distances present a difficulty during playback because they retain that discrepancy in depth. A reference distance of one meter is recommended as a standard. For directly synthesized sources (i.e. from a drum machine), the level should be set to correspond to the SPL of a similar natural source at one meter.
Level: Here level is defined as the conversion from sound pressure levels (SPL) to digitally represented values (-32768...+32767 for a 16 bit system). It is necessary to know the conversion factor so that it may be reversed precisely for playback. Without this knowledge, a sound may be perceived as "too close" if the playback level is higher, or "too far away" if it is lower.
Microphone response: The microphone used for recording the source will exhibit a characteristic frequency and phase response, which may be directional as well.
Room response: The environment in which the recording is made will affect the recorded sound. Room reflections and resonant modes will force an undesired sense of environment onto the source recording. This can be avoided with direct recording of synthesized signals, or by performing recording in an anechoic chamber.
A/D converter response: The converters necessary to move the analog signal into the digital domain for processing will also have a characteristic frequency and phase response. Fortunately, most modern converters utilize oversampling to reduce the filter constraints, resulting in passbands which are very flat across the spectrum and also exhibit linear phase. Consequently, the error added here is negligible.

HRTF recording/synthesis:

Distance: Similar to the concerns presented for source recording, this also presents a twist; because the HRTFs are almost always recorded in pairs, the distance reference is to the center of the head. This is a problem since the distance computation for Interaural Level Differences (ILDs) and Interaural Time Differences (ITDs) is referenced to the appropriate ear rather than the center of the head. This can be compensated for in software, though a more accurate result would be accomplished by measuring the HRTFs for each ear individually, with the source located in a one meter radius from the ear. This would generate HRTFs consistent with the ILD and ITDs, and is discussed in more detail in Chapter 3.
Level: Again, the level of the impulse response must be specified. A sound source recorded and replayed through systems with identical gain should have the same perceived level as a natural source at the specified position. For frequency domain filtering, the "zero dB" level must be established.
Impulse non-idealities (recording): When HRTFs are recorded by playing an "impulse" through a speaker at various positions [13], the fact that this stimulus is non-ideal must be acknowledged. The spectral content of this impulse signal will be reflected in the HRTFs generated by it. If the impulse source is significantly non-ideal, a compensation filter should be applied to the HRTFs to remove some of the effects of the stimulus characteristics. Provided the deviations in the stimulus are consistent and well-defined, this does not present a significant obstacle.
Microphone response (recording), room response (recording), A/D converter response: These are identical to corresponding concerns for source recording.
Individuality of HRTFs: The unique nature of each individual's HRTF presents a considerable stumbling block to the implementation of systems for general use. The frequency and phase characteristics of the HRTF vary from person to person, much like fingerprints. Since measuring each listener's HRTFs is not currently practical, alternate solutions must be found, based on a common set of HRTFs. Methods for realizing this are discussed in Chapter 4.

Convolution:

Windowing and phase effects (fast convolution): When HRTF filtering is performed in the frequency domain, there are two main concerns. First, it is erroneous to apply a magnitude-only filter to perform the HRTF, as phase effects are critical for successful implementation [22]. The other matter is windowing effects resulting from the segmentation of the audio stream. Naive implementations using a rectangular (no) window will result in added clicks, pops, and discontinuities in the output stream. The use of a Hamming or Blackman-Harris window and the proper amount of overlap can greatly reduce these artifacts [23].
Linear phase response (direct convolution): When utilizing FIRs for direct convolution, phase response is often overlooked since most classical design methodologies assume linear phase is desired. Using directly measured impulse responses instead of synthesizing an FIR from a frequency magnitude plot eliminates the problem of attempting to reconstruct phase variations during synthesis.

Playback:

Level

Symmetric to the need for recording level specification is the output gain; this should be a fixed gain equal to the conversion used in recording. The user should not have a volume control! If the system is engineered for accurate reproduction, the digital level to SPL conversion must complement the SPL to digital level conversion executed earlier.

D/A converter response

Similar to the concern for A/D response, this too has become less of an issue with the advent of high- quality parts which use oversampling and digital filtering.

Amplifier response

Non-linearities and frequency dependence in the amplifier must be accounted for and compensated. Again, problems can be drastically reduced by using high quality components.

Headphone response

The single-driver construction of most headphones leads to a some frequency dependence, as the 10 octave range of human hearing is beyond the flat-response capabilities of most speaker elements. There will be some rolloff on low frequencies (typically below 40 Hz) as well as on high frequencies (typically 15 kHz). While professional quality headphones greatly reduce this effect and generally approximate the full bandwidth of human hearing, the fluctuation in response across the full spectrum of headphones available to the listener is too significant to ignore, and should be countered by an inverse filter prior to playback (Chapter 4).

Head position

There must be feedback regarding the listener's head position; if the head is moved even slightly, all of the auralization information (HRTFs, ITDs and ILDs) must change to accommodate this change. Head tracking is discussed in detail in Chapter 8.

Expectation

Expectation plays a very large role in human sensory perception in general; as it applies to auralization, listeners will rely on visual cues, their knowledge of the present environment, and memory of recent events to assist in localization. Augmenting auditory stimulation with stereographic visual representations of the environment provides enhanced results. If a listener sees a saxophone in front of her and to her right, she will not confuse the position of a saxophone sound as coming from behind. The problem of front/back reversals disappears as the listener can associate the sounds with items in view (in front) or not in view (in back).

Providing a means for tracking and compensating for head movement also contributes to resolution of these reversals. By combining movement and memory of recent stimuli, the listener can "triangulate" to determine the true location of the source. Since motion affects front source interaural differences contrary to rear sources, the direction of change is sufficient to indicate the hemisphere in which the source is located. This method also has the advantage of relying on the fairly robust ITDs and ILDs, rather than the delicate HRTFs, to provide front/back differentiation.

For more discussion of integration with visual displays and head tracking, see Chapter 8.

Continue on to Chapter 2...

Frank Filipanits Jr. - franko@alumni.caltech.edu

Chapter 1 INTRODUCTION TO AURALIZATION