Go back to Chapter 7 or return to the Index
8.0 Further code refinements
8.1 Indirect reflections
8.2 Head tracking
8.3 Virtual reality
8.4 Custom transforms
8.5 Customization of the HRTF
While the algorithm utilized in this project incorporates several improvements over conventional programs, there are additional enhancements which could increase performance but were not implemented because of time limitations. These include a provision for specifying and processing moving sources, and interpolation at intermediate HRTF positions.
One method for supporting moving sources would involve a separate file providing a "script" of the object's motion, given as a description of the object's state at various points in time. The state includes the x, y and z positions of the source as well as the time derivatives dx, dy and dz (the directional velocities). The file need only contain entries at the appropriate times when the velocity vector changes or if the source makes a discontinuous jump in its path.
The coarse sampling of the HRTFs used (15º azimuth, 18º elevation) is a cause for some concern. An algorithm to perform even a simple linear interpolation for intermediate angles would increase the accuracy of the HRTFs used and prevent any sense of discrete change with moving sources.
The linear interpolation used during sample-rate conversion of the HRTFs is also a place for improvement. Though the symptoms of this simplification are relatively minute (they are correlated to the degree of rate change, which is only 12% from 50 to 44.1 kHz.) a first-order interpolator does add some distortion, most notably aliased terms in the high frequency ranges . Since the sample rate conversion of the HRTFs is performed only once and has no real speed limitation, a true oversampling/decimation conversion is feasible and will be implemented in the future.
The assumption of planar wavefronts (free-field conditions) has been made throughout auralization systems. Unfortunately, this is not necessarily the case. A point source located near the listener will actually present a spherical wavefront, which may not reflect off the pinnae in the same manner as a planar wavefront. The effect is, however, very slight for reasonable source distances from the ear, due to the relatively small solid angle formed by the surface area of the ear. It is mentioned here only for completeness; the computation for accurate modeling of spherical wavefronts is prohibitive, particularly in light of the small magnitude of this effect.
Externalization is a considerable problem in auralization; it is often difficult to generate sounds which appear to originate at any large distance from the head. Much of this difficulty may be attributed to the brain's reliance on indirect reflections to determine distance . When a sound source is in a physical room, the human localization system determines some information from the initial (direct) sound, but also uses the reflected (indirect) sound from the walls to assimilate information regarding the environment and the relation of the source position to that environment.
The use of artificial reverberation with ray-traced early reflections has been shown to increase perceptions of auditory distance . In these systems, the specifications of the room were entered into a ray-tracing software package, which computed the positions of "phantom sources" to represent the first few early reflections. Ray-tracing works by assuming sound travels in a direct line and exhibits specular reflection. While this is only a first approximation for sound waves (ray-tracing originated in the light domain, where it is a more realistic representation), it can produce a fairly accurate characterization of the direction of sound reflected off the room boundaries. These reflections were represented by "virtual" or "phantom" sources located along the angle of incidence at a distance equivalent to the total path length of that reflection. The phantom sources were then processed using conventional auralization techniques, resulting in signals comprised of a combination of direct and indirect sound, all of which contained location and environmental cues similar to those found in a physical environment.
It is unreasonable to expect auralization systems to perform at par with nature when one of the primary localization cues is removed. Yet without a head-tracking system to compensate for head movements, this is the result. The innate human response to an unexpected sound is to turn towards it; it is in the movement of the head that critical and incontrovertible information about location is discerned. While HRTFs do contribute somewhat to front/back differentiation, they are very frequency dependent. If the source does not contain the particular frequency affected by the difference in front and rear HRTFs, these cues simply do not exist. Instead the brain uses memory and comparison to discern hemispherical location.
For a source in front of the head, clockwise movement (positive azimuthal deflection) will result in a decrease in distance to the left ear and an increase to the right (along with the appropriate ITD and ILD changes). A source in the rear will exhibit the opposite behavior; clockwise rotation will result in an increase in distance to the left ear and a decrease to the right. These distance-related cues are robust and rely solely on the laws of physics, not on the characteristics of the source. The addition of a head-tracking system to constantly monitor head position and update relative source position would greatly increase the success of any auralization system, particularly with regard to front/back reversals. There are numerous options available for head-tracking, ranging from highly accurate multi- thousand dollar systems to low-budget home-brew devices. As the technology matures, high quality systems will become reasonably available.
A logical extension of providing a simulated aural environment is to provide a simulated visual environment as well. The generalization of this is the generation of a total simulated sensory environment, or "Virtual Reality". The inclusion of artificially generated visual stimuli with a simulated aural environment allows the participant to become fully immersed in the generated world, and contributes to the principle of "willing suspension of disbelief." This principle states simply that apriori knowledge of the true physical surroundings creates an expectation which greatly inhibits acceptance of the artificially generated sensory cues, since they contradict this expectation. A preponderance of artificial cues - such as combining aural and visual stimuli - can override the natural tendency to remain "in the real world".
Without a combined sensory input, good results are difficult at best. Often the success of such a system depends a great deal on the imagination of the listener. As an example, take a listener seated in a small classroom. If presented with a simulated aural environment of a fighter jet, he must first overcome the visual stimuli which are in overwhelming conflict. In a joint system, presenting visual and aural stimuli and incorporating head-tracking "removes" the user from the room. If the visual and aural senses of the real world are replaced with synthetic substitutes, the only significant obstacles to a true feeling of immersion are tactile sensory input and the knowledge - the belief - that the room entered just moments ago is still there. Without visual and aural reinforcement, the degree of certainty for that belief decreases rapidly. It is much easier to gain an auditory perception of being in a fighter jet when that is not in conflict with your visual and rational senses.
A great deal of work remains to be done to optimize auralization processing, both for accuracy and for speed. One area worthy of investigation is the use of alternative transforms. The dependence of the HRTF upon frequency bands and the logarithmic nature of human hearing make the conventional FFT an inefficient choice for frequency analysis or fast convolution. Much of the processing time and power is wasted gaining information from frequencies that are of little interest. Wavelet transforms appear to have properties which would make them ideal for addressing individual directional bands.
Another possibility is the construction of a new transform utilizing basis functions derived from the HRTFs themselves. The principle component analysis of HRTFs performed by Wightman and Kistler  is a first step in this direction.
As discussed in Chapter 4, the use of a general set of HRTFs is a non-optimal compromise necessitated by the difficulty of measuring individual HRTFs. At the present time, the HRTF recording process is cumbersome, requiring the use of an anechoic chamber and other specialized equipment which preclude incorporation into end-user auralization systems. One approach to increased HRTF compatibility is to provide a selection of several general HRTF sets, with a means for the user to select the HRTF which generates the most effective results for him or her. The principle disadvantage to this method is the increased storage requirement for the additional HRTF data.
Another alternative is the creation or modification of HRTFs based on pinnae structural information extracted by computer imaging techniques. Although presently prohibitively expensive, it is conceivable that a system could be developed to scan each individual's ear to determine the major physical structures; an HRTF could then be constructed from the physical dimensions of the ear, as in the work by Han . This would allow for true individualization of the HRTF, and potentially higher accuracy in spatial localization.