A Room Scale Virtual Reality System with Spatial Audio Generation from Object Collisions
Updated automatically every 5 minutes

A Room Scale Virtual Reality System with Spatial Audio Generation from Object Collisions

The Unity package produced for this thesis is available here (880MB).

The Windows executable version is available here (143MB), playable with any SteamVR headset or with mouse and keyboard.

Introduction

The purpose of this project is to add  sound generation to the Virtual Reality system installed in RIMLAB 2. In a simple rectangular room, measuring 7.5m by 5.4, an OptiTrack array of infrared cameras is set up to track both an Oculus Rift and several arbitrary objects. It is used for object manipulation tasks, used to evaluate the precision and accuracy of movements in Virtual Reality, and as a source of ground truth for robotics projects. To increase the intuitiveness of the user interaction, the Oculus Rift had also been fitted with a Leap Motion tracker, capable of creating a real time 3D model of the user’s hand, with knuckle-level precision.

A complete model of the room, already integrated with the tracking system, had been built in Unity for [Tesi Orlandi].

In order to make these manipulation tasks more realistic, and increase user immersion, it was decided to add sound. Realistic sound should be motivated by the physical interaction causing it. This includes temporal synchronicity, loudness modulation based on the impact velocity, and frequency modulation based on mass, volume and direction.

Just as importantly, the sounds should come from the same location as the visually perceived event.

First of all, the various tracking systems for Virtual Reality were compared, and it was decided to use SteamVR, as that is compatible both with the HTC Vive, with the OptiTrack and with all tethered Oculus headsets. The Vive was used for most of the development process, using its very precise Lighthouse system.

The SteamVR Input system was chosen as the main interaction mechanism. This provides an intuitive way to program interactions between hand and virtual objects. Unlike other systems it does not rely on explicit button presses, but it directly binds natural actions such as pinch or grab. These are then mapped to whatever controller the user has, but the default controls determined by the developer can be autonomously changed by the user during runtime, and even be exported and shared with others.

A brief catalogue was compiled for identifying the kind of sounds that would have to be generated for the purpose of this project. As a first step, and for simplicity, the focus was put on collision sounds, generated by contacts between objects that are limited in duration, but during which an exchange of momentum happens.

The initial idea was to integrate the system with a software for the synthesis of collision sounds, or ideally even with a system that would generate the sounds based on finite elements physical modelling of the objects and their collisions. However, all mainstream efforts in this direction seem to have been abandoned years ago, and they therefore lack the integration layer required for their use with modern game engines.

The effort then shifted to software systems that would natively integrate with Unity and produce sound events for each collision, using recorded sound and modulating them as needed. The NewtonVR Collision Sound framework was chosen for accomplishing this task, however the sound files it provided were hard to distinguish, which made all collisions sound similar. A new set of sounds was recorded from real-life collisions.

The two sound sets were briefly compared based on their waveforms and spectra, highlighting the greater variety of frequency components and harmonics in the new sounds, especially between different materials.

The next step was spatializing the sounds. Several software libraries were considered and compared. In particular the difference between direct binaural rendering and Ambisonics rendering was explored and explained. In the end SteamAudio was chosen, both for its full range of advanced features and for its simplicity of use.

Some free textures were added to the objects in the Unity scene to distinguish the different materials visually. Then a mirrored table was added to the room, and on each table the same set of objects was set up, using a different set of material sounds for each table. Twenty-one users were asked to compare the collision sounds generated by the two sound sets, and a small but measurable preference was recorded for the new sounds.

2 – Description of the Virtual Reality System

In order to accomplish the objectives of this project, I was given access to a robotics laboratory at the University of Parma, denominated RIMLAB 2.

This is a simple rectangular room, 7.5m by 5.4m, used for Virtual Reality and robotics research. It has desks on three sides and cupboards on the other, but more importantly it has a large empty area in the middle, free from obstacles and visual obstructions, of around 25 square meters, depending on the current configuration. For this project, the computer in the North-West corner of the room is used, connected to the HMD either directly, which limits the usable area to about half of the total, or through a system of extension cables held by a pulley system.


Figure 2.1 - RIMLAB 2, as shown on the building plan

It is fitted with several tracking and display systems. I employed an off-the-shelf HTC Vive and an original Oculus Rift CV1 fitted with a 3D printed mount, which allows for the mounting of OptiTrack reference dots and of a Leap Motion unit. The software system allows to use either system transparently, and both allow for room scale tracking

2.1 – HTC Vive and Steam VR

The HTC Vive is a Steam VR device, which employs the Lighthouse tracking system, composed of two base stations producing pulsed infra-red flashes and laser beams.

Figure 2.2 - HTC Vive with Wand, both showing the divots

These are detected by the infra-red sensors on the Vive HMD (head-mounted display) and Vive Wands (the controllers). These sensors can be seen in Figure 2.1, they are located in the divots covering the headset. The base stations need to be synchronized, the master needs to be in ‘b’ mode, the slave in either ‘C’ or ‘A’ mode.

If the stations are in each other’s field of view, which is roughly 120 degrees, they can synchronize optically, setting the slave base station in ‘C’ mode. If the mounting angles or the distance between the stations don’t allow for the use of optical sync, a 10m cable can be connected between the stations, and setting the slave station to ‘A’ mode allows for accurate synchronization.

Once the base stations are synchronized, they work as follows, see [Gizmodo explanation] for a simple reference.

Figure 2.3, base station and main parts

As seen in Figure 2.3, the main parts of the Lighthouse are a grid of 27 infrared LEDs, marked in red, a horizontal laser motor, marked in green, and vertical laser motor, marked in blue. Much like in an actual lighthouse, the laser source is actually stationary, and the rotor spins a mirror at 60Hz, which corresponds to one tick of the 48Mhz clock on its controller chip.

In cycles, the same sequence of operations is repeated, the grid of synchronization LEDs lights up, and stays on for a known amount of time. Then a laser beam sweeps horizontally over the whole field of view of the base station, from left to right, at a known angular velocity.

 The LED grid flashes again, and then a laser beam sweeps vertically over the whole field of view, rotating at the same known velocity. The photodiodes in the headset and on the controllers detect both the initial flash and the laser beams, and calculate the angle of arrival of the laser beams based on the time elapsed between the timing flash and the diode detecting the laser beam.

The photodiodes on the HMD and on the controllers are carefully arranged in a way that minimizes potential geometric ambiguities.

A SteamVR tracked object can have anywhere between 20 and 32 sensors. Given the timing information from several of the photodiodes the system calculates a pose, with positional and angular information, for the headset and controllers, relative to the positions of the two base stations. It is an entirely timing based method, using only the known relative positions between sensors, accurate time keeping, and simple trigonometry. Each tracked object is therefore able to aggregate the light detection pulses, integrate them with data from the IMU, and send it all back to the computer wirelessly, where the positions and orientation are calculated  with sub-millimeter accuracy.

NASA did an interesting analysis of the tracking accuracy and of ways to improve it [HTC Vive Analysis and accuracy improvements] in order to use the system as a source of ground truth pose data for the Astrobee project.

Vive Trackers are available for around €120, and allow for tracking arbitrary objects, as long as they are not subject to deformations.

Several parameters cannot be automatically detected by the system, and have to be setup by the user during the calibration procedure. In particular, the height of the base stations relative to the floor has to be set by putting both controllers on the ground, and a safe area has to be manually delimited. Steam VR displays the borders of this safe area as a blue grid whenever the user gets too close, called Chaperone.

Furthermore, the HMD is fitted with an infrared camera, visible as a rather large circle in the center of the headset in Figure 2.1, allowing for video passthrough, which can be used for limited Mixed Reality applications, or more commonly for navigating the room without removing the headset.

The HTC Vive has since been discontinued, replaced by a plethora of compatible devices. On the HTC side there are the Vive Pro and the Vive Cosmos. The Pro uses the upgraded Steam VR 2 Lighthouses, featuring a single two-axis motor rather than two one-axis motors. The Cosmos uses a camera based inside-out tracking system, but both devices are still compatible with Steam VR software.

On the Valve side, the Index is the new gold standard, featuring Steam VR 2 tracking, improved field of view, resolution and refresh rate. Perhaps more importantly, it features the Knuckles controllers, which allow for knuckle-level finger tracking, as the name implies. The software developed to handle this level of finger tracking, named Steam VR Input, is what allowed for the deployment of a single user input solution across all devices, as explained in the Unity section of this text.

2.2 – Oculus Rift

The Oculus Rift CV1 was the first major commercial HMD of the modern era. It was released in early 2016, some months before the HTC Vive, but it did not initially come with a hand tracking system. It shipped with a single Oculus Sensor and with two input devices, a simple, untracked remote control, designed for media operations, and an Xbox One controller for playing games. Many early Oculus experiences also relied on mouse and keyboard operation, and were limited in mobility.

After a few months Oculus Touch was released, containing a second Oculus Sensor and two tracked controllers, and pretty much matching the capabilities of the Vive for end-user applications. The tracking area and accuracy are still slightly worse, but for media consumption and videogames few are even able to notice a difference.

Figure 2.4 - Oculus Rift with OptiTrack markers and LeapMotion

The way it works is entirely reversed, and could cause some USB bandwidth problems. Both the HMD and the controllers are covered in a pattern of infra-red LEDs, collectively called a Constellation. The Oculus Sensor, connected to the computer with a USB3 cable, is an infra-red camera, and is able to see the Constellation dots, and therefore compute a pose for the headsets and the two controllers.

If the computer is fast enough to avoid bandwidth problems, the tracking is solid, but it is more prone to occlusion problems, and the tracking area is smaller, compared to the HTC Vive. Furthermore, given that each Oculus Sensor needs to be connected to the computer, it is harder to position them in a way that thoroughly covers the tracking area.

In terms of visual fidelity and comfort, the Oculus Rift and the HTC Vive are roughly equivalent, they have exactly the same resolution, field of view and refresh rate. The Vive allows for more precise focus adjustments, but it is a little heavier, its controllers are not as ergonomical, and it relies on Fresnel lenses, which lead to more noticeable color artifacts at the edges of the field of view. Both devices also come with built in headphones, which are not of very good quality, but allow for binaural sound.

Much like the HTC Vive, the Oculus Rift has been discontinued, and replaced by the Oculus Rift S and the Oculus Quest, both of which employ a camera based inside-out tracking. Both new devices also have a lower screen refresh rate and lack interpupillary distance adjustments, but feature 6 degrees of freedom tracking at a much lower price.

2.3 – Optitrack

In order to use the Oculus Rift over the area of the whole room, and to achieve true room scale tracking, another tracking system was set-up. It is composed of an array of 12 IP cameras, a base station, and a proprietary piece of computer software.


Figure 2.5 - OptiTrack Prime

Each camera is connected to an Ethernet cable, providing both power and connectivity to the base station. The cameras are fitted with a circular array of Infrared LEDs, plus a row of visible light LEDs, used to signal status information to the end user. The Infrared light produced by the camera is reflected by the tracking dots, which have to be applied to the tracked object, and the reflection is then seen by the camera itself.

The signal is sent back to the base station, which assumes a static IP address and provides the video streams to the computer. The Motive software, running on the computer, automatically reconstructs the relative positions of the cameras. In order to track an object, it needs to be fitted with several reflective spherical markers. The main advantages of this system are that it allows for the tracking of arbitrary objects, including human bodies, with automatic skeleton reconstruction, and deformable object.

Furthermore, it doesn’t require any electronics to be present on the tracked objects, and with a large enough camera array it can be used to track several objects, or several HMDs. It is therefore theoretically possible to use it for multi-user virtual reality experiences, and it is sold as a solution for professional arcades and arenas, although in those cases an active marker version is employed.

The problems with this system are the considerable cost (just one of the cameras costs more than a full Oculus and a full Vive system put together) and the difficulties in the setup and calibration process.

2.3.1 - Optitrack Volume Setup and Calibration

After clearing all existing masks, the user has to mask all objects in the scene that produce infrared reflections. These can include any shiny objects, such as metal table legs, keys, or reflective inserts in sneakers or backpacks. It is best to remove any moveable reflective items, and to mask out the permanent ones. Even direct sunlight can be a significant cause of interference.

The user can either manually paint over every visible spot in the image, or simply use the Mask Visible option. However, attention must be paid not to block movable objects, as that will affect tracking. In ideal circumstances, the masks should simply cover the direct infrared lights produced by the other cameras.

Once the static scene has been set up, the calibration itself must be performed. The software interface shows the video feed from all cameras. While holding the Wand, the user must walk around the tracking area in a grid, waving the Wand in a figure eight pattern. On each video feed the Wand will leave a covered trail, the calibration will be completed once the trails cover entirely each video frame, ensuring that every point in the camera field of view has been accounted for. In Figure 2.6 an in-progress wanding is shown.

Figure 2.6 - Wanding in progress

Then, the software automatically computes a solution to the room geometry. The accuracy of the calibration thus obtained depends on the number of samples acquired by the software, which is roughly proportional to the time spent by the user waving the wand. Usually, two to three minutes are more than enough.

The final step in the calibration itself is to determine the ground plane, and then to save the resulting file.

2.3.2 - OptiTrack RigidBody Tracking

The next step is to add the tracked objects. First, the object is covered in reflective markers, then, ensuring that no other reflective objects are in the scene, the user selects the dots corresponding to the markers in the Perspective View, and turns them into a RigidBody by pressing Ctrl+T.

OptiTrack recommends using between 4 and 12 markers per RigidBody, and no more than 20 are supported. It is important that the markers are placed asymetrically over the object, to minimize geometrical ambiguities.

When tracking multiple objects, it is also important to place the markers using different patterns for each object, in order to allow for object identification. It is recommended to vary both the shapes formed by the markers, and their distances. There are only so many unique shapes one can arrange markers in, before it reverts to being perceived as just a bunch of triangles.

 Ambiguity between objects and between orientations can be very jarring for the end user, as objects will tend to suddenly flip, or jump around the scene. This is merely annoying if the objects affected are being handled by the user, but it can be physically discomforting, and even nausea-inducing, if the tracked object is the user’s HMD, and their perspective is altered suddenly.


Figure 2.7 - Perspective view in the Motive software, showing the rays connecting a selected marker to the cameras


Figure 2.8 - Perspective view in the Motive software, showing the same markers as in figure 2.7, with the resulting rigid body

As made evident in figure 2.7, the reference point of the RigidBody created by the Motive software is located at the geometric centerpoint of the tracking markers. For HMDs that is not necessarily an acceptable approximation, the ideal point of reference is the middle point between the eyes of the user. Luckily, Motive allows for the arbitrary modification of the center point of a RigidBody, and even with interpupillary distance adjustments the middle point does not move.

Having selected the RigidBody, it is sufficient to click on the pivot point from the Builder pane, and use the Gizmo to translate and rotate the pivot point to the desired location.
However, there is also a special procedure for calibrating the pivot point of HMDs, instead of a generic RigidBody [
https://v21.wiki.optitrack.com/index.php?title=Template:HMD_Setup#Create_an_HMD_Rigid_Body]. Tracking the HMD with its own native system, an OptiTrack RigidBody is created from the markers affixed to the HMD itself. Then, opening the Builder pane, the HMD RigidBody type is selected, and it should automatically detect its make and model. A sampling process is then set up and started, during which the headset but be slowly rotated. By sampling the orientation of the headset both with the markers and with the native tracking system, an offset is calculated for the pivot point, moving it exactly to the base of the nose of the user.

Then, from the Data Streaming pane the IP address and data types need to be set up, in accordance with the Unity settings. This will, in practice, set up a data stream of the rigid body position and orientation, in local coordinates, that Unity will kinematically assign to a corresponding RigidBody object.

It is therefore possible to track real arbitrary objects in a virtual environment, and give them a graphical representation and the ability of acting upon the virtual world. This can be used simply for tracking real objects, like a robot moving around, or as a means of user interaction.

Motive and Unity have to be on the same local network, but not necessarily on the same computer. For this project, a single computer was used, fitted with two network cards and two Ethernet cables. Therefore, the IP address set in both Motive and Unity will simply be 127.0.0.1, and only the data and command ports need to be set up with care. Each tracked RigidBody also has an Id number, that must be matched between the two interfaces.

2.4 - Leap Motion

The objects tracked with OptiTrack do not allow for very natural interactions with simulated objects, because they do not allow for redirected positioning, a key technique used for making collisions less jarring. To avoid this limitation, and at the same time to allow for hands-free interaction, unlike with SteamVR or with Oculus Touch, a Leap Motion input device was also used.

The Leap Motion Controller, as shown in Figure 2.9, is a small box containing two cameras and three infrared LEDs. The LEDs illuminate the scene, the cameras capture the resulting grayscale images, which are streamed to the Leap Motion Service running on the computer via the USB cable, which also provides power to the Controller.

Figure 2.9 - Leap Motion controller with labeled components

This is a limiting factor for room scale applications, as the provided cable is only 1.5m. A system has been setup in the middle of the room, comprised of two 10m USB extension cables (one for the HMD, one for the Leap Motion Controller), one 10m HDMI cable (for the HMD), and a pulley system that keeps the cables high above the ground, preventing the user from stepping over them or getting tangled. This system, shown in Figure 2.9, allows for the operation of the Leap Motion Controller mounted in front of the Oculus Rift, but it does not allow for its use with the HTC Vive, because the latter also requires a DC-IN cable, for which an extension is not available.

The Leap Motion Service constructs a depth map of the scene using the stereoscopic images it receives, and estimates the position and orientation of all joints comprising the hands. Based on these joints, the system automatically rigs the hand models, recognizes which hand is which, and streams all of this information to any compatible software.

The tracking area for the Leap Motion is much smaller than the possible hand positions a human can take, so it is quite easy to lose tracking. Before tracking is established, the hands appear in front of the camera. If tracking is lost during the VR experience, the hands disappear from sight until tracking is reestablished.


Figure 2.10 - Oculus Rift S with a Leap Motion Controller, OptiTrack Markers and the cable extension system

The way in which the tracking data from the Leap Motion will be handled by the Virtual Reality software will be described in the next section.

2.5 - Unity

In order to handle tracking data from different systems, with different input methods, and at the same time to easily produce a model of the room, the Unity game engine was chosen [Tesi Orlandi], both for its ease of use and for the wide variety of integrations available. The only other common alternative is the Unreal engine, which is generally considered to be slightly harder to use, but also more performant.

It should be noted that the purpose of [Tesi Orlandi] was significantly different, there the idea was to integrate the different tracking systems, and evaluate the accuracy that users could achieve in a series of specific handling tasks. In particular, the stacking of boxes and their insertion into different containers was thoroughly evaluated.

The real life laboratory was reconstructed with accurate measurements and off the shelf 3D models, producing the scene shown in Figure 2.11, so that the real furnishings would correspond with the virtual ones, and it would be possible, for example, to use a real tracked object to move a virtual object, and to put both the real object and the virtual object on the same corresponding table. It was, in short, a sort of Mixed Reality system achieved with a Virtual Reality headset.

Figure 2.11 - Virtual Laboratory, from the corner and from a top-down perspective

2.5.1 - Unity basics and terminology

Every file used by a Unity project is an Asset, this includes models, environments, scripts and so on.

Unity calls  the levels of a game Scene. All Assets making up the level are contained in the Scene. By default, the Scene is empty except for a Camera and a Directional Light. The whole graphics rendering pipeline is centered on the Camera, from whose point of view the scene will be seen. Unity handles both 2D and 3D games, for Virtual Reality everything is obviously going to be 3D. The Camera is going to be attached to the actual head of the player, using one of the tracking methods described above. Luckily, each one of them came with a ready-made Unity integration to do just that.

Every object in a Unity Scene is going to be a GameObject, each of which can have several Components, like a Collider, a Mesh Renderer, or simply a Script. Every GameObject, even if it is empty, will have a Transform module, describing its Position, Rotation and Scale along the three cartesian axes, as shown in Figure 2.12. Conveniently, Unity uses meters as its basic unit of distance measurement, while angles default to degrees. It is also possible to use radians in scripts.

Figure 2.12 - Empty GameObject with Transform module

Unity Transforms are hierarchical, GameObjects in the root of the Scene hierarchy are defined based on global coordinates, child objects are defined relative to their parent GameObject.

Scripts are pretty self-explanatory, they are C# files, whose scope is determined by the GameObject they are attached to.

RigidBody modules make GameObjects interact with the physics simulation system. By default, this enables gravity, and more generally forces to act upon the object. If the Is Kinematic checkbox is ticked, the RigidBody can exert forces on other objects, but it remains fixed in space.
For natural object handling, as required by this project, collisions also need to be handled. These happen when two GameObjects that were not previously touching come into contact. However, to save on computation, simplified shapes are used to check whether there is any overlap. There are wrapped into Collider modules, and they have to be carefully chosen based on the shape of the object.

For this project the most common objects are boxes and spheres, so Box Collider and Sphere Collider are simple and efficient choices. However it is not so simple for the tables in the scene, with a Box Collider it wouldn’t be possible for our objects to go under the table. For most objects more complex than a box, a MeshCollider is a good option, but for very complex objects it might be better to use a simplified mesh, in order to reduce the computational effort, with little visible difference in the end results.

Another important property of Colliders is their use of PhysicMaterials. While Materials handle the graphic properties of objects they are applied to, PhysicMaterials interact with the physics simulation, providing important information for the handling of dynamic and static friction, and even more importantly for the purposes of this project, bounciness. With an appropriate PhysicsMaterial it is possible to have a realistic bouncing ball, without having to do any animation or keyframing.

In order to make Assets easy to share it is possible to arrange them in Unity Packages and Prefabs, which can be exported and distributed on the Unity Asset Store. For every third party system employed in this project the corresponding Unity Package was imported, adding all the needed Assets to the Unity Project in just one click, including all scripts.

Unity Packages contain an Asset hierarchy, containing models, scripts, scenes and prefabs. Prefabs contain a collection of GameObjects and their modules, typically hierarchically ordered.

Unity supports several different Rendering Paths, but for simplicity realtime lighting was used, with no pre-baking of any kind. It is not the most efficient way of doing things, but the scene was simple enough that it does not seem to affect performance. The Update function is called to graphically render each frame, while the FixedUpdate function does the same for the physics simulation.

2.5.2 - Integration of tracking systems

The setup used in the [Tesi Orlandi] relied on the OptiTrack system for tracking the Oculus Rift over the whole room, integrated with the Leap Motion for fine hand tracking. For setting up the OptiTrack, it was sufficient to import its Unity plugin, and add a “Client - OptiTrack” GameObject to the Scene. The Client object contains the C# script handling communications with Motive, and an HMD GameObject which handles the position and orientation of the player camera. The Client has coordinates {0, 0, 0}, the HMD assumes a position and orientation relative to its parent, which is the Client itself. By default there should not be any rotation between the Client and the HMD, while a vertical offset allows the user to rapidly adjust the floor height as needed.

It is also possible to use OptiTrack to track and render the user’s body, using more markers and the Skeleton module, but it was not considered for this project.

The OptiTrack HMD GameObject is OpenVR compatible, so it works natively with any headset. In practice, it was only used with the Oculus Rift, as the Vive uses SteamVR tracking.

The next step was integrating the Leap Motion. This required importing its Unity Plugin and adding an Interaction Manager GameObject to the scene. During play, the FixedUpdate function is called for any frame of the Physics simulation, and the Interaction Manager receives updated information about all objects involved in interactions, including the positions of the hands themselves.

Leap Motion can dynamically switch between using hands and using controllers, based on the detected hardware, using a model for the hands or for the controllers as appropriate. Both are parented to the Camera, so that they are always visible to the player. However, it is possible to give an offset to this spatial relationship, simulating longer or shorter arms, which can be useful for some, very different, projects. [Reference to San Diego study on VR play with multiple arms]

When not using the Leap Motion, what was done in [Tesi Orlandi] was employing arbitrary objects tracked by OptiTrack, as shown in Figure 2.13, to manipulate GameObjects, using the Oculus Remote as a clicker, which parented any virtual object in contact with the tracked object at the moment of the click to the tracked object itself. This allowed for various interaction modes, useful for the handling tests that had to be performed, but it is certainly not a very intuitive way of moving objects for the average user.

Figure 2.13 - Arbitrary objects fitted with OptiTrack markers

Therefore, for this project a more conventional approach was chosen. SteamVR Input was imported, and its interaction model was adopted, with the addition of the HandShakeVR library for also handling Leap Motion input. Right now the HTC Vive and the Oculus Rift can both be used with their off the shelf trackers, simply starting the Unity program. SteamVR handles both headsets and tracking systems through OpenVR, without needing any adjustments. SteamVR Input computes a model of the position of the user’s hands, based on Valve’s Skeleton system. It is possible to choose whether to render the position of the hands with or without the controllers.

Attention must be paid to the handedness of the controllers, especially with the Vive Wands, which are physically identical. Also, it is not possible to use the Oculus Touch controllers with the HTC Vive, while it is technically possible to use the Vive Wands with the Oculus Rift, both if it is tracked by the Oculus Sensors and if it is tracked with the OptiTrack system.

In order to use the OptiTrack, one must first disable the SteamVR Camera GameObject, and enable the OptiTrack HMD GameObject. Everything else is handled automatically.
Whether the user is employing physical controllers or the Leap Motion, a model of their hands is rendered, and interaction is handled and scripted based on redefinable actions, not on button presses.

2.5.3 - Interaction system

The SteamVR Interaction System works by sending messages to the objects that are targeted by user action. The SteamVR Input plugin must be imported into the Unity Project, and the InputModule GameObject must be added to the scene as a child of the SteamVR Player Prefab, which also contains the main Camera.

To each object that the user must be able to manipulate  an Interactable script must be attached, which is part of the Steam VR Input Unity Package, and which handles the sending and receiving of interaction-related messages, as well as the rendering changes derived from the interaction. In particular, the hand and the controller can be shown or hidden when the Interactable object is attached to the hand, and the object itself can be highlighted for the duration of the Hover.

The main interactions are Hover, Touch and Grab.

A Hover is when the hand is very near, or in contact, with an Interactable object. The object is not physically perturbed by the Hover, but the model of the hand can be made to change, for example by pointing at the hovered object. It is also common to highlight the hovered object with a thick yellow line, and to change its Material to make it even more evident, as shown in Figure 2.14. It is also possible to show tooltips, which are particularly useful for making the interactions clear to the user.

Figure 2.14 - Left, not hovering, Right, hovering

A Touch happens whenever the physical model of the hand touches, intersects or collides with an Interactable object, or indeed with any RigidBody fitted with a Collider. In practice it is possible to punch or slap a virtual object and have it react accordingly, without any scripting.

A Grab is a scripted interaction, whose activation depends on the specific controller used, but which approximates the intent of closing one’s hand over the grabbed object, and which results in the object being attached to the user’s hand. Technically SteamVR Input also distinguishes between a pinch grip and a grab grip, but it should not affect the performance of the sound system at all.

The easiest way of making an object grabbable is to attach a Throwable script to it, which will also automatically attach an Interactable script. Throwable has several options, and the default options are more appropriate for games than for the manipulations tasks executed in this project. The default Attachment Flags are DetachFromOtherHand (which allows the user to pass the object from one hand to the other, but does not allow for a two handed grasp), ParentToHand (which makes the grabbed object a child of the hand, following it around) and TurnOnKinematic (which makes the grabbed object Kinematic, and thus immune to collisions while being held).

As the focus of the project is on collision sounds, this is obviously unacceptable, the TurnOnKinematic flag has to be removed, and replaced by TurnOffGravity, which prevents the object from falling while being gripped. Furthermore, the VelocityMovement flag was turned on, so that the object will attempt to move towards the position and rotation of the hand. This results in very natural throws, but also in the rotation of the whole combination of hand and object when attempting to drag the object over a surface.

The final flag that can be useful for the kind of interaction used in this project is SnapOnAttach. If left unchecked, the object is grabbed from the point where the hand touches it. If checked, the object teleports so that its center corresponds with the center of the volume enclosed by the hand.

In order to make the interaction even more natural, it is possible to use the Pose system, which can be used to define one or more positions of the hand over the grabbed object. Each pose needs to be adapted to the shape of the held object, which can be tricky with objects of different sizes, while maintaining a naturalistic look for the position assumed by the hand, as shown in Figure 2.15. Also, this system is not really needed for the Leap Motion, where the actual finger position can be detected.

Figure 2.15 - Cube with customised Skeleton Pose

While the system allows for room scale tracking over a virtual room that is sized identically to the real room, it is more convenient to allow the user to teleport around the virtual space, so they don’t have to deal with the extension cable system. The Steam VR Interactions_Examples scene, located in Assets>SteamVR>InteractionSystem, contains a ready-made TeleportArea, which is simply a Plane with a Teleport Area script, and which acts as a target for the Teleporting script. It was sufficient to translate it to the center of the room and scale it along the three axes in order to cover most of the area. Particular attention must be paid to its height, it must be just slightly higher than the floor, otherwise the raycasting from the controller to the TeleportArea would collide with the floor plane.

The Teleporting Prefab, located in Assets>SteamVR>InteractionSystem>Teleport>Prefabs was then added to the Player object, providing pre-made scripts for casting rays between the controller and the target point on the TeleportArea, and highlights and tooltips explaining how to teleport.

Figure 2.16 - Highlighted TeleportArea in the Scene view

3 - Sound Synthesis

The first aspect to cover is what sounds to play, and how to obtain them. Sounds can be either generated from scratch (which is what is generally called synthesis), using a physical model, or they can be recordings of real sounds.

Several kinds of sound events will be covered, but collision sounds are the most relevant to the objectives of this project, as they are by far the most noticeable and easy to localize when manipulating an object.

In real life the sound generated by a collision is dependent on several physical factors, including the materials, shape and velocity of every object involved. In order to handle the physics of collisions, Unity provides Colliders, and the onCollisionEnter() and onCollisionExit() messages, which are sent when two RigidBody objects start and stop having an intersection, respectively. There is also a Collision-class object, which contains, among others, the impulse and relativeVelocity fields [https://docs.unity3d.com/ScriptReference/Collision.html].

Procedural sound synthesis

Sound synthesis is a whole academic field, and it is therefore an immense task just to summarize the main approaches. The following is a summary of the most common synthesis methodologies, and their possible use to reproduce sound events of different categories.

These are techniques frequently used in music production, they are not specifically designed for creating the kind of sound needed for this project, which would mostly fall under the purview of Foley artists.

Common synthesis methodologies

[https://www.soundonsound.com/techniques/introduction-additive-synthesis]

Any waveform can be represented as a set of sine waves. In a harmonic oscillator the frequency of each sine wave is a harmonic of the fundamental frequency. A waveform can be represented by its harmonic content, and it can be derived from an appropriate harmonic series.

So to synthesize an arbitrary waveform, the appropriate sine waves are generated, at appropriate frequencies and amplitudes, and they are added together, thus the name of the methodology. In Figure 3.1 the basic operating principle of additive synthesis is shown.

This technique has been in use since before the transistor was even invented, for example it is the working principle of the Hammond Organ, in which 91 tonewheels positioned each next to a pickup generate something really close to a sine wave. Each extended drawbar adds to the output the sine wave generated by another tonewheel, creating a complex output waveform.

Figure 3.1 - Additive synthesis

Again, waveforms are represented by their harmonic content. Starting from standard waveforms, like square waves or sawtooth waves, their harmonics are attenuated in order to obtain the desired waveform. It is also common to employ envelopes, filters, and Low Frequency Oscillators. This is the method employed by most software synthesizers, and in a way by the human voice. It is therefore what people think of when talking about synths, but it is not the most obvious approach for generating naturalistic sounds.

Figure 3.2 - Subtractive synthesis

First, a sound is sampled. The samples are then cut in very short pieces, of up to 50ms, called grains. These grains are played back at different speeds, overlapping each other and modulating their phase, volume, frequency and so on.

At slow speeds this results in a cloud of grains, which can still be heard as individual sound events. Over a certain speed threshold they merge into a single sound, whose timbre is determined by the characteristics of the grains of which it is composed.

Classification of sound events

Impact sounds

An impact is defined as a brief collision with an exchange of momentum between the bodies, as described in an engine-neutral way in [https://github.com/dylanmenzies/phya-code/blob/master/papers/menzies09_PhyaAndVFoleyPhysicallyMotivatedAudioForVirtualEnvironments.pdf].

In an informal way, impact sounds are generated when an object hits another. In order to trigger this kind of event we will have to define a maximum collision detection and a minimum collision speed.

For most simple materials the impact sound can be approximated by filtering some white noise and applying an appropriate decay. Some more complex materials can be approximated with modal resonators. This is typical of metal, glass and ceramic, where the impact sound can be convincingly approximated by identifying a few modal frequencies and adding them to the collision sound itself, together with some decay formula, usually exponential. Some implementations [the Microsoft paper] create a synthesis model by sampling a real object and its modal frequencies and damping curve.

Other researchers [https://www.mpihome.com/files/pdf/Article_impact_noise.pdf] study the real sounds produced by collisions and attempt to emulate them with Finite Elements Modeling.

These techniques are used mostly in industrial settings, where knowing the characteristics of the noise is more important than creating a realistic simulated sound. They are concerned with noise pollution, not with user interaction.

For a more formal and comprehensive review of the Phenomenology of Sound Events, see [http://www.soundobject.org/papers/deliv4.pdf], which includes an extensive bibliography of the relevant literature.

It should also be noted that the damping of a resonating object is often variable, as holding the object or filling it will change its damping properties. A simple glass holding water is an extremely complex sound source, and there are only a few physics based attempts at modeling it. On the other hand the effects of striking an empty glass can be simulated quite convincingly.

Continuous Contact

A continuous contact event is defined as one in which two bodies enter in contact and stay in contact for a prolonged amount of time, while still exchanging momentum. The latter condition sets it apart from objects simply being static.

In practice these are going to be sliding and scraping sounds.

The slip speed can be a useful discrimination parameter between scraping and rolling. If there is a prolonged contact with zero slip speed, it is going to be considered purely rolling. If it is a pure slip, the contact point will remain exactly the same. It is of course also possible to encounter hybrid events, in which slipping and rolling take place at the same time.

Rolling sound

These tend to have lower energy than sliding, as the relative velocities between the colliding points are quite low, and they also tend to be periodic and with a lower frequency. A naive approach could be to use a low pass filter in front of the sliding sounds.

[http://www8.cs.umu.se/kurser/TDBD12/HT01/papers/foleyautomatic.pdf]

Harmonic Fluids

These are used for simulating liquids, their swooshing and flowing, and their bubbling.

According to [Harmonic Fluids Changxi Zheng Doug L. James - Cornell University], most of the sound energy produced by fluids is caused by the interaction between small bubbles, modeled as harmonic resonators, and surface tensions.

This is the opposite of the intuition according to which the sound is generated by the impact between the liquid surface and its surroundings.

Air and wind

Sounds made by objects moving through the air at speed, or by air moving through the objects (wind). In particular Wwise provides the Soundseed Air plugin. In games they are frequently used in a non-naturalistic way, to give users feedback, in particular to represent the swinging motion of weapons and tools handled by the user. As the attempt in this project is to create a naturalistic sound scape, they are not being considered. It is unlikely that the kind of precise movement required by handling tasks will produce an audible swooshing sound, and there definitely is no audible wind in the room.

It is generally possible to produce air sounds using a recorded swoosh, which is quite constant in its middle part. This allows to arbitrarily change the length of the sound event. All the perceived information about the size and speed of the handled object is encoded in the beginning and end of the sound.

Fracture

Sounds made by objects breaking apart. These can vary widely based on the material and on the kind of event causing the fracture. They are going to be completely ignored in this project, as the kind of simulation required would be very difficult, and it would not match the kind of handling task under consideration. In most games, they are pre-recorded sounds representing the whole event, or in more advanced cases they are composed of an initial fracture sound, then the collisions between the fragments are treated as a series of individual collisions.

Software Packages

The most basic approach to produce a sound is to fit the objects to be handled with an AudioSource and play a pre-recorded audio sample when a collision is detected. This will always be the same sound, so its repetition will sound unnatural for subsequent collisions, but it’s better than nothing.

The next improvement is to reduce the repetitivity of the sound, a naive approach could be to record several collision sounds and play one at random. This is what the PlaySound method in SteamAudio does, and it has been used to great effect in several games. The sound can then be modified based on the velocity and direction of the contact between the objects, as shown in Figure 3.3.

Figure 3.3 - Physical parameters at contact point

PhysSound

Luckily, there already is a Unity plugin that does just this, and even handles slide sounds. It is called PhysSound, and was developed by Kevin Somers in order to sell it on the Unity Asset Store. It was since discontinued and removed from distribution, but old copies started being used by the VRChat creator community, and the author very graciously re-released the software as an open source (MIT license) package.

In order to use PhysSound to produce handling noises, one has to add a PhysSound Object to the GameObject being handled, and give it an appropriate PhysSound Material. The sound can come from an AudioSource module added to the GameObject, or PhysSound can create it autonomously, even taking into account the dimensions of the sound source and the coordinates of the collision.

PhysSound materials are classified based on how hard or soft they are, they can be Hard, Soft or Other. The library changes the volume and pitch based on the material properties, partly based on the characteristics of the collision (velocity, direction, scale of the objects involved), partly based on the material of the other object involved in the collision.

The audio files to be played come from the Audio Sets associated with each PhysSound Material. They are made up of several audio samples for each material type the object can collide with, ordered by the force of the impact.

For example the basketball material has 4 samples of collisions with hard materials, called basketball_hard_[0-4].wav, where basketball_hard_0.wav is the one with the smallest impact force, and basketball_hard_4.wav is the one with the greatest impact force. Similarly, the collision sounds with soft materials are named basketball_soft_[0-4].wav.

PhysSound comes with just 8 predefined materials, which already cover most of the objects present in the virtual room. It is possible to add more sounds, covering more materials, and simply add them to the Audio Sets.

It also supports rolling and sliding sounds, among which it does not distinguish. A special sliding sound can be provided, and it too is modulated according to the velocity. In practice the range of motion performed by the user when dragging an object around can result in audible artefacts. Sometimes, no sound is produced at all, some other times the system interprets the sliding motion not as a continuous contact, but as a series of short contacts.

The problem with PhysSound is that it tends to play sounds at a very low volume. It is possible to use a global gain, but then that also affects every other sound in the scene. It is also difficult to set up the minimum and maximum velocity thresholds, and that leads to sounds either not starting at all, starting with some delay, or starting several times for repeated collisions. It is, therefore, a simple solution to start adding sound to a scene, but not an easy to manage one.

Newton VR

This library can act as a general purpose multi-headset input and interaction layer. However the team has dissolved, and the software base has not been kept up to date with the developments in Unity and in the various VR libraries. As a consequence, attempting to use it as an input and interaction library does not work out of the box in 2019. This is a known problem, but nobody has gone through the effort of fixing the problem, despite the Open Source nature of the product.

The important part of this library, for the purposes of this project, is the collision sound generation. Much like PhysSound, it works by picking an appropriate sound out of a library, and modulating this sound based on physical parameters. The difference is that in this case a sound source must be added to both colliding objects. While simply adding a PhysSound source to a ball and banging it against a table will produce a sound in PhysSound, adding a NVR Collision Sound module to the ball and not to the table will not result in a sound when banging the ball against the table.

Therefore, an NVR Collision Sound module is added to each user-accessible surface, and a material is assigned. The volume and time-correlation for collision events are quite good, but the version of the library distributed in the Unity Asset Store has a rather important bug: it does not actually select the audio files of individual materials, it just plays back the default set of sounds for all and every material.

After going through the code, and recording special debugging sound files that would make the process easier, it was possible to fix the bug. The debugging sound files were recordings of the name of the material, clearly enunciated, so that it would be possible to easily understand which sound file had been loaded.

An NVRCollisionSoundController must be added to a GameObject in the scene.

In NVRCollisionSoundController the sound pool size (the number of concurrent audio threads, which determines how many simultaneous sounds can happen in the scene), the pitch modulation (whether it is on or off, and its range) and the minimum collision volume and maximum collision velocity are set. The Play method accepts the material, position and impact volume of a sound event as arguments, and then passes them to the play method of the sound provider. The Provider can be set to None, Unity, or FMOD. In this project the Unity audio engine was used.

NVRCollisionSoundMaterials builds an enum, in which all the audio samples for the various materials are loaded and ordered. In order to add a new material its name must be added manually to the enum, in lowercase, and its audio files must be put in Assets>NewtonVR>Resources>CollisionSounds in a subfolder named “surface - materialname”, where materialname is the name of the new material. The files must be called materialname__1.wav, materialname__2.wav, and so on, with an arbitrary number of sound samples. It is important to remember to use two underscores, and not just one.

Then, an NVRCollisionSoundObject component is added to each object that will have to produce a sound. A Material and a Collider are assigned to each NVRCollisionSoundObject. For each collision the OnCollisionEnter method is called, during whose execution the impact volume is calculated using a cubic easing equation. If the resulting volume is larger than the minimum collision volume set in the NVRCollisionSoundController, the Play method is called twice, once with the Material of the object spawning the sound event, once with the Material of the object it is colliding with.

Figure 3.4 - NewtonVR Collision Framework structure

FMOD

FMOD is a dedicated gaming sound design software, which allows for the creation of dynamic sounds and their interaction in a user interface that closely resembles a Digital Audio Workstation. It can interface directly with all major game engines, including Unity, and it supports all major deployment platforms. It is not free software, but it is free for non-profit use and for small projects.

It can also interface with Resonance Audio for Ambisonics spatialization.

The fundamental realization for understanding FMOD is that it uses a user interface based on that of DAWs, but with one important difference. Every time a sound has to be produced in the game engine, an event is created in FMOD. Each event has a master track, which is broadly similar to a timeline, and which groups together all the tracks, instruments and parameters. Parameters can be used as inputs for instruments, which put the sounds on tracks. All tracks are mixed on the master track. When looking at an event in FMOD Studio one sees a timeline, just like in DAWs, and several tabbed Parameters. These look just like other timelines, but their horizontal axis is not time, but whatever parameter the user has decided to use. For example in Figure 3.4 the impact speed is used. Once the parameter is set up it is possible to insert several sounds and automate their modulation, and which one to actually play, based on several factors, including the parameter itself and random picking, with the usual affordances of audio workstations.


Figure 3.4 - FMOD collision event, with speed parameter and multiple instruments

Wwise

Wwise is a video game audio middleware, comprising several plugins, which allow developers to handle some of the sound design tasks not directly handled by game engines. Over the years some of the features have been added to the game engines themselves, particularly to Unity, but Wwise still provides enough quality of life features that it remains an industry standard.

In particular, they provide Wwise Soundseed Air for air and wind sounds, with built in variation, and with the possibility of tying the sound simulation parameters to game variables.
Wwise Soundseed Grain is a granular synthesizer, with most derived techniques, like wavetables. This is particularly useful for particle effects.

Finally, Wwise Soundseed Impact is dedicated to the interactive generation of collision sounds, and it is based on cross-synthesis, a technique based on the imposition of the envelope of a sound unto another, similar to a vocoder.

There are also basic audio processing facilities, such as compressors, reverbers and gains.
Of particular importance are the capability of convolving with arbitrary impulse responses, and compatibility with Resonance Audio, Oculus and Steam. Also, there are community plugin called AudioRain and AudioWind, respectively for the synthesis of rain and wind noises.

Phya

Phya was written by Dylan Menzies in C, and a Java reimplementation called JPhya was later released. It is a complete solution for synthesizing physically motivated sounds for games and 3D simulations. It does not rely on a simulation of the actual physics involved, but it uses some physical parameters as the inputs of its sound synthesis.

Several materials and impact types are supported, but it is not possible to add more materials without actually coding their desired response. Both the C and the Java version have not been updated in years, but they are well documented, and they are distributed with the relevant scientific papers.

There is not a Unity package available, but there is an example integration with the Bullet physics engine. A C# reimplementation of Bullet Is available for Unity, so with some effort it should be possible to integrate Phya into Unity. That would be quite an involved process, as it would mean replacing PhysX, the physics engine used by Unity and on which the VR system relies.

Sounding Object Project

The Sounding object project was a collaboration between the universities of Verona, Udine, Limerick and the Kungliga Tekniska högskolan. Its purpose was really similar to that of this project, that is to develop sound models for objects and their interaction during user manipulation. It resulted in a book [http://www.soundobject.org/SObBook/SObBook_JUL03.pdf], several papers [http://www.soundobject.org/papers/sob.pdf, http://www.soundobject.org/papers/fonrocapo.pdf] and some software examples. These are mostly written in Matlab, but some are also available as PureData modules written in C. These are able to physically simulate the model of modal resonators, which are particularly useful for metal and glass virtual objects.

The difficulty is in integrating these scripts with Unity. One possible way would be running a full instance of PureData, adding the UnityOSC library to the project, and communicating the relevant physical information through the OSC protocol. In that case the Unity audio engine could be disabled entirely, and all sounds would have to be produced in PureData, including their spatialization.

This would introduce an external dependency on the project, which could make operating the system harder. However, there is a library version of PureData, called libPD, which can be integrated directly into Unity. The problem is that the PureData scripts provided by the Sounding Object Project are 32bit, the libPD files that are made available in a Unity-compatible format are 32bit, but the VR runtimes all require the 64bit version of Unity. All the source codes are available, so it is not impossible to get past this hurdle, but it would require a concerted development effort.

4 - Sound Spatialisation

In the real world sound is always spatialised, it comes from a specific source in a specific point in space. Historically spatial information has not been a high priority for reproduced sound. Early technological limits have led to decades of mono recordings, with stereo only becoming the default in the late 1960s. While it is perfectly possible to encode spatial information in a stereo recording, using binaural techniques or simple microphone configurations like in the ORTF technique, this has not become the standard in the music industry.

Most music arbitrarily distributes voices and instruments on the two channels without any spatialisation, and as headphones have become the most common form of consumption millions of listeners have gotten used to listening to sounds “in their heads”.

Most people do not really know what “stereo” means, it has become just another branding term, much like “turbo” in the late 1980s.

Just like for noise cancelling, sound spatialisation is not easily noticed until it is taken away, because it just sounds right and natural.

The film industry is the most prominent promoter of spatial audio techniques to the general public, even if it mostly relies on panning mono files to the desired perceived location, without any attempt at reconstructing natural soundscapes. It is an artistic and narrative use of spatial audio.

In the gaming industry, instead, spatial information has been considered crucial since the early 90s (with the advent of 3D games), it is used to locate off-screen enemies in first person shooters, to detect and prevent overtake attempts in racing games, and so on. For this reason there are several ready-made software packages that handle sound spatialisation in a way that is easily compatible with Unity. Unity itself already handles the spatialisation of direct sounds in a basic way, but with the use of some simple plugins it is possible to obtain better results on the direct sound, and include the simulation of reflected sounds, including reverberation, occlusion and diffraction.

4.1 - Acoustics and terminology

The field of acoustics is as wide as it is old, and it tends to be subdivided in endless branches, each with its own jargon and conventions. For the purposes of this project only a little bit of spatial audio and a little bit of room acoustics will be needed.

A proper introduction to the basic concepts of acoustics is beyond the scope of this document, but is readily available [Beranek, etc].

In a room, sound can either be direct or reflected, as clarified by Figure 4.1. The combined effects of all reflected sounds is called reverberation. The reverberation time is one of the prime measurements employed for characterizing the acoustic response of a room. It is a measure of how long the sound keeps being reflected in a given room. Historically [Sabine] the threshold used was simply that of human hearing, which was later standardized to a 60dB decay. The reverberation time is thus frequently called T60.

Of course the frequency response also depends on the room. These can be combined into an impulse response, a complete linear time invariant (although non-linear versions also exist [http://www.angelofarina.it/Public/Papers/241-AES123.pdf]) model of sound transmission from a source position to a receiver position.

As seen in section 4.2, it is possible to use several impulse responses for a room to better characterize its response in different location.

Figure 4.1 - Direct and reflected sound

Occlusion is conceptually very simple, it is the phenomenon by which objects occupying the path between source and receiver affect the arrival of the direct sound.

Finally, directivity is a simple concept, but little known to laymen, and it applies both to sources and receivers. Sound coming from a source does not propagate equally in every direction, but it is characterised by different attenuations at different angles, and at different frequencies. It is often represented as a polar pattern, like in figure 4.2.

Figure 4.2 - Cardioid polar pattern, Wikimedia commons
https://commons.wikimedia.org/wiki/File:Polar_pattern_cardioid.png

This is most commonly seen for microphones, which are often referred to by their polar patterns, like cardioid or figure eight.

4.1.1 - Angular sound localisation

Humans perceive lateral sound direction based on inter-aural time differences (ITD), which allow for the discrimination between right and left source position, and inter-aural level differences (ILD), which allow for the localization of higher frequency sounds, with significant overlap between the two methods. This has been well attested in literature since the works of Lord Raleigh [https://www.tandfonline.com/doi/abs/10.1080/14786440709463595].

Figure 4.3 - Showing paths of different lengths going from Source to Listener, as an example of ITD

For discriminating between front and back and top and bottom the interference patterns generated by the pinnae are very important, but anteroposterior ambiguity is still a commonly experienced phenomenon. Turning the head shifts the interaural axis, thus disambiguating the sound source location.

Figure 4.4 - Anteroposterior ambiguity, the ITD for the green and red object is the same

It is possible to encode the way an individual’s head and ears affects sound reception in a linear time invariant function called Head-Related Transfer Function (HRTF). Given a HRTF for each ear, it is possible to reconstruct a personalised binaural signal, which will allow that individual listener to have an increased spatial location effect.

4.1.2 - Sound source distance

Naturally sound propagation is also affected by distance. Closer sound sources have a higher share of direct sound.

Of course the closer the sound source is to the listener, the louder it will be. This is only really helpful for known sounds, but people will naturally also pay more attention to louder sounds.

Frequency response was already mentioned, but in general low frequencies travel further than high frequency, so if the same sound is originated further away from the listener, it will be perceived as having a lower signature

Similarly to the ITD, there is also the ITDG (initial time delay gap), the time difference between the arrival of the direct sound and the arrival of the first reflection. For close sources the ITDG will be large, for farther sources it will be smaller.

4.2 - A comparison of sound spatialisation libraries

All of the libraries covered in this section are employed to give pre-existing sound assets a spatial component, not to create or synthesize the sounds themselves. The only exception is Steam Audio, which is capable of picking a random sound of a table, a very common technique for avoiding noticeable repetition.

The other important distinction is between libraries that auralize each sound individually and libraries that employ an Ambisonics intermediate step. This allows to have just one HRTF convolution for the whole soundscape, instead of one for each spatialized sound source.

In direct binaural rendering each sound source is processed based on the relative position and orientation between the source and receiver. From this relative position, the Azimuth and Elevation of the incoming sound are obtained. Usually, they do not directly match any sampled impulse response, so an interpolation is performed between the three closest ones. Each available HRTF is stored in a SOFA file, which is a matrix ordered by Azimuth and Elevation, as shown in figure 4.5.

Figure 4.5 - Structure of a SOFA file

The points from the SOFA files are projected on a sphere surrounding the sound receiver, and the surface of the sphere is divided into triangles whose vertices correspond to the points in the SOFA file, as shown in Figure 4.6.

Figure 4.6 - Sphere of incoming sound directions

There are several methods for computing the interpolation between the three sampled points, and a variety of ways of measuring binaural impulse responses, as seen, for example in [http://www.angelofarina.it/Public/Papers/143-IIAV00.PDF].

As the number of directive sources grows, the computational load for direct binaural rendering grows very quickly, and the handling of moving objects requires quickly changing the convolution matrix, with algorithms designed to reduce audible artifacts and “jumps” as much as possible. Several algorithms are established in literature, and several rendering techniques based either on headphones or loudspeakers, as described in [https://ieeexplore.ieee.org/abstract/document/5346545], with near-field rendering, or in [https://asa.scitation.org/doi/full/10.1121/1.5040489] with diffuse rendering.

Conversely Ambisonics rendering, shown in figure 4.7, projects all incoming sounds onto a soundfield, which is a mathematically simple operation, consisting just of appropriate gains. The number of channels employed depends on the Ambisonics order selected, with a minimum of four channels for first order Ambisonics, and a maximum of sixteen channels for third order. The signal is then rotated based on the orientation of the listener’s head, expressed as Azimuth and Elevation, and convolved with a static HRTF.


Figure 4.7 - Ambisonics rendering pathway

  1. Unity Audio

The built in audio engine for Unity is based on attaching Audio Source components to GameObjects that will produce sounds and Audio Listener components to those that will need to receive them. In traditional video games there is just one stereo Audio Listener connected to the main camera. Being attached to GameObjects, these components have a Transform, containing their position. Thus the Unity Audio engine simulates the Interaural Time Difference and perform a basic distance-based attenuation.

It does not, however, take into account scene geometry, materials, occlusion, and so on. These effects can be approximated using Audio Filters and Reverb Zones. Basically, Unity has components for performing any operation a Digital Audio Workstation could do, which is great for achieving the artistic vision of a video game, but not so great for creating something that sounds as natural as possible.

The directivity of an Audio Source can be set, in degrees, but the conceptual framework is not really the directivity of an audio event, but the emulation of different loudspeaker configurations.

  1. Resonance Audio

Developed by Google, it is a complete replacement audio engine. Its unique feature is that it completely based on Ambisonics, a technique for encoding the spatial information of a soundscape using spherical harmonics. As this technique was invented in the 70s [Gerzon], it is very computationally efficient.

Compared to the Unity Audio engine it adds support for several features. The directivity patterns for sound sources can be customised. Reverb can be automatically computed based on the geometry of the scene. It can also handle occlusion, and it adds a special kind of Audio Listener capable of receiving and recording in Ambisonics.

Resonance projects all sounds into a high-order Ambisonics soundfield, so that the HRTFs can be applied only once instead of once for each sound source.

It comes with a readymade database of HRTFs, but being open source it is also theoretically possible to use custom ones, even if no standard file format is supported.

Resonance Audio has great cross-platform support and, as covered in the Sound Synthesis chapter, it can even be integrated with FMOD.

The kind of sound source handled in this project tends to be punctiform and to move in space. It is certainly possible to use Resonance Audio for this sort of stuff, but it is not the most conceptually clean process, as it would mean going from sound objects to an Ambisonics soundfield, and from that to a binaural rendering.

In order to be integrated in a Unity Project, Resonance-specific components must be substituted to the default ones. In particular, Resonance Audio Sources and Listeners must be substituted for Audio Source and Audio Listener components.

  1. Steam Audio

Developed by Valve, it employs the Unity Audio engine, on top of which it adds extra features. Sound diffusion, which is already handled by Unity, gets an added frequency dependent attenuation, making farther sources lose part of their high frequency, just like in real life.

Occlusion is also supported, and reverberation can keep into account both the geometry and the materials of the scene, using several simulated impulse responses.

SteamAudio works on three threads, interacting with the Unity audio engine. The GameThread is synchronized with Unity’s MonoBehaviour.update method, and passes the source and listener information (position and orientation) to the SimulationThread. The MonoBehaviour.update method is synchronized with Unity’s physics simulation.

The latter handles the propagation simulation, and its speed is limited by the Rendering Thread. The Rendering Thread handles direct occlusion, 3D audio, and environmental effects for each sound source. The Rendering Thread, as the name implies, is synchronized with Unity’s graphic rendering thread.

It should be noted that Unity supports several different rendering pipelines, which could affect the rates at which Steam Audio operates, if the configurations are particularly unusual. For this project the configuration was as standard as possible, so everything worked as expected.

For 3D audio, it employs customizable HRTFs contained in SOFA files, which is a well known industry standard. It is therefore possible to find several pre-made compatible HRTFs, and even to pick them based on the conformation of the head of the listener.

For occlusion and reverb it performs a raytracing of the paths between the sound source and the listener, with options for handling dynamic geometries. For example, the geometry of the room changes when a door is opened and closed, which is going to affect reverb.

For occlusion, of course, dynamic handling is much more important, as the user is likely to move their position, or hold objects close to their head.

The directivity of sound sources can be customized, changing between omnidirectional, cardioid and figure eight. However, in this project the focus is mainly on impact sounds, which should be adequately approximated with omnidirectional sources and object occlusion.

  1. Project Acoustics

Developed by Microsoft Research with the internal name Project Triton [https://www.microsoft.com/en-us/research/project/project-triton/#!publications], it is a physics based approach to game sound design.

For each object in the scene, the developer provides information on what material it is composed of, based on a library of previously measured materials. Then, a navigation mesh is created, delimiting the parts of the scene in which the user is allowed to move.

The system then computes impulse responses for the sets of possible source and listener positions, by performing a voxelization of the environment and automatically positioning virtual probes.

The actual room acoustics calculations are performed in the cloud, and saved in a look-up table where the game engine accesses them for auralization. The resulting game can therefore use all the power and conceptual elegance of impulse responses without having to perform costly real-time convolution.

It is an open source project, with plugins available for Unity and Unreal, using Wwise in the latter case. In Unity, it simply uses standard Unity components, adding its own scripts and DLLs, and acting as a Unity Spatializer Plugin. It is important to note that in order to perform the pre-baking of a scene, it is necessary to have an Azure account.

It also supports portaling, the complex way in which sounds passing from one room to another interact with the acoustic response of both environments, and with the occlusion effects caused by the wall separating the two environments.

Being primarily intended for game design, it also allows the developer to customize sound sources in ways that would not be physically possible, excluding them from occlusion or even from distance based decay, if it is deemed important that they are easy to localize. Similarly, the virtual rooms are not limited to natural behaviours, it is possible to arbitrarily after their reflectiveness as desired.

  1. Wwise Spatial Audio

Developed by AudioKinetic as a  series of spatial audio features for their Wwise audio middleware. It supports both the simulation and design of the acoustics of virtual rooms, with the Wwise reverb plugin, and the simulation of real, measured room acoustics, using Wwise Convolution.

It is possible to simulate  dynamic early reflections using their Wwise Reflect plug-in, with support for customized acoustic textures for simulating different materials. This also allows full customization of the reflective properties, for artistic sound design.

It also supports Ambisonics playback and recording up to the third order. By default Wwise replaces the Unity audio engine with its own output system. Hower, it can also be integrated with several third party spatializers, like Steam Audio, Oculus Audio, and Resonance Audio. In the latter case it is possible both to use the Resonance Audio render as a spatializer for the Wwise Ambisonics pipeline, and to use the Resonance Audio Room Effects plugin to simulate the room acoustics. In other words, both Resonance and Wwise provide both a spatializer and a room acoustics package, and they can be freely mixed and matched.

Wwise has the widest cross-platform support of all the systems under considerations, it is free for non-commercial projects, but it is not open source.

  1. Oculus Audio SDK

Developed by Oculus, it supports sound spatialisation and head tracking, real-time reverb, volumetric (non-punctiform) sources and near-field acoustic diffraction rendering.

As a conceptual framework, Oculus Audio expects the developer to use Ambisonics tracks to provide an ambience, and sound-objects for dynamic sounds. It allows for the use of several spatializers, including a native Unity one, which works as a third party spatializer for the Unity Audio Engine. Similar plugins can act as spatializers for Wwise, FMOD, VST or AAX compatible digital audio workstations, and there is even a C++ SDK.

After importing the package, adding it to the project and setting it as the sound spatializer in the project settings, any Unity AudioSource can be fitted with an ONSPAudioSource module. ONSP stands for Oculus Native SPatializer.

Several attenuation, reverberation and reflection parameters can be customized. In order to use reflections an OculusSpatializerReflection plugin must be added to the Unity AudioMixer. While it is possible to set the radius of an AudioSource, it is not possible to determine its directivity. For collisions, the simulation of occlusion and reflections should be enough to simulate the directivity of the sound event adequately.

It is important to note that Oculus Audio performs the calculations necessary for the binaural rendering separately for each spatialized AudioSource. For complex scenes, with several sound sources moving around in space, and several different reflecting surfaces, the computational load might become noticeable. Oculus recommends only spatializing those sounds whose direction is semantically important for the Virtual Reality experience at hand. In the case of this project, it is unlikely that more than a dozen or so sound events could take place in the same time frame.

ONSPAudioSource does not natively support the simulation of occlusion, it only uses the reflections of objects within a short distance of the sound origin.

In the Oculus parlance, reflections are those within a short distance of the sound source, and everything else is considered reverb.

However, by adding the OculusSpatializerUnity.cs script to a static and empty game object [https://developer.oculus.com/documentation/audiosdk/latest/concepts/ospnative-unity-dynroom/], it is possible to override the geometry engine, and substitute with the new, ray-tracing-based one. It is then possible to configure the ray tracing. There is also experimental support for Audio Propagation, taking into account both materials and geometry, but it is still in beta [https://developer.oculus.com/documentation/audiosdk/latest/concepts/ospnative-unity-audio-propagation/].

  1. Two Big Ears - Audio360 SDK

It was developed by a small Scottish startup, since acquired by Facebook. It consists of three main components, the Spatial Workstation allows for the design of 360 audio and video experiences, leveraging existing DAWs. The Encoder puts the mix in standard formats, suitable for distribution. Finally the Audio360 AudioEngine decodes the mix, and renders it binaurally using head tracking.

The AudioObject class, part of the AudioEngine, allows for the spatialisation of mono or stereo files. For the purpose of this project, this would be the important functionality, as this SDK could be used for playback of appropriate collision sounds coming from the right angle and distance. However, there are no affordances for dealing with the acoustics of the virtual room or for occlusion.

It is mainly used for building 360 video players with high order Ambisonics support.

Package

Sound Object

Ambisonics

Reflections

Occlusion

Acoustic Materials

Frequency attenuation

Room Geometry

Sound Generation

Unity Audio

Yes

No

No

No

No

No

No

No

Resonance Audio

Yes

Yes

Yes

Yes

Yes

Yes

Box

No

Steam Audio

Yes

Yes

Yes

Yes

Yes

Yes

Dynamic

Table

Project Acoustics

Yes

No

Yes

Yes

Yes

Yes

Dynamic

No

Wwise Spatial Audio

Yes

Yes

Yes

No

Yes

No

Dynamic

No

Oculus Audio

Yes

Yes

Yes

Yes

Beta

Beta

Dynamic in Beta, or Box

No

Two Big Ears

No

Yes

No

No

No

No

No

No

Figure 4.6 - Spatializer plugin comparison table

4.3 - Implementation

Steam Audio was employed in this project, due to its extreme simplicity of use, low computational overhead, and compatibility both with all available HMDs and with the rest of the software stack. Having already chosen SteamVR as the development target platform, and SteamVR Input for user input and object handling, it made sense to stick with a solution from the same company.

Other solutions would have been just as valid, and some (particularly Project Acoustics) would even have provided a better simulation of the room acoustics. However, the real life acoustics of the room are not particularly noticeable. As any good office space, this lab features little reverberation. Being just one room, there really is no use for advanced handling of portaling or changing geometries. In a first approximation, this project takes place in an empty box, for which a simple simulation is entirely sufficient.

Setting up the project to use SteamAudio was straightforward. After importing the SteamAudio Unity package, click Window > Steam Audio to add a Steam Audio Manager to the scene.

A Steam Audio Manager GameObject will be automatically added to the scene, and from its inspector it is possible to select the audio engine and the HRTFs to employ.


Figure 4.7 - Steam Audio Manager Settings

Then add Steam Audio Source components to every audio source, and Steam Audio Geometry and Steam Audio Material components to all GameObjects that contribute to the acoustics of the room, configuring them appropriately. No configuration is needed for Geometry components, they just act as tags letting SteamAudio know that the tagged GameObject needs to be taken into account. For each SteamAudio Material it is possible to choose from a list of pre-made materials, or to specify the properties of a custom one, as shown in figure 4.8.


Figure 4.8 - Custom Steam Audio Material

For Steam Audio Sources it is possible to turn on Direct Sound Occlusion, Physics Based Attenuation and Air Absorption, as shown in Figure 4.9, and to customize directivity, as shown in Figure 4.10.

The resulting polar pattern will be shown both in 2D in the inspector, and in 3D in the Scene itself.


Figure 4.9 - Steam Audio Source settings


Figure 4.10 - Steam Audio Source directivity settings

5 - Experimental evaluation

In order to evaluate the effectiveness of the work done it was necessary to perform some user testing. For each material sound set provided by NewtonVR a directly recorded replacement was created, and users were asked to compare them. Carpet was excluded, as there is no carpet in the test space, and it is generally not particularly relevant to object manipulation tasks.

To recap what was written in the previous chapters, the system was assembled using SteamVR as an API and tracking system, SteamVR Input for user interaction, and Steam Audio for sound spatialisation. The sounds were all pre-recorded, and modulated by the NewtonVR Collision Sound framework based on the impact velocity, on the mass of the objects, and on the direction of the collision.

As the sounds provided by the NewtonVR Collision Sound framework did not sound distinct from each other, a new set of sounds was produced, and the two sets of sounds were compared experimentally.

5.1 - Recording the sounds

A Zoom H1 was used for recording the sound samples, as shown in Figure 5.1, while various household items were used for producing the sounds themselves. Specifically a simple drinking glass was hit with a teaspoon for producing the glass samples, a screw bottle cap was hit with a fingernail for the plastic sounds. For the wood sounds the convex part of a teaspoon was hit against a wooden table covered with a tablecloth.


Figure 5.1 - Zoom H1

For the cardboard sounds a literal cardboard box was repeatedly dropped onto the same wooden table used in the previous step, making sure to use differently sized sides to obtain different sounds.

Finally the metal was sampled by lightly hitting the sides of a computer tower, letting the metal panels reverberate as long as possible.

The gain on the Zoom recorder was manually set to 30, and the files produced were stereo, sampled 48000 times per second at 24 bits per sample. As the files provided by NewtonVR were instead mono, sampled 44100 times per second at 16 bits per sample, they were downmixed and resampled using Audacity. The loudness of each sample was also normalized, so that the loudness of the sound in the Virtual Reality environment would depend on the impact caused by the user, and not the one used in the recording process.

The audio files were then added as new materials to the NewtonVR Collision Sound framework, as detailed in chapter 3.

While recording the sounds a lot of attention was paid to avoiding any background and handling noise. As the laboratory is in a very quiet building, this was not much of a problem.

Theoretically these recordings should have been performed in an anechoic chamber, as performing them this way also captured the reverberation of the room. However, the room itself is quite dry, any reverberation would also be appropriate to the virtual environment, as it simulates that same room. Furthermore, by keeping the microphones in close proximity to the sound source, the recordings are dominated by direct sound.

Even more importantly, the sample had to be trimmed to remove any excess sound before the impact itself. If there is any amount of time between the start of the file and the impact, that would result in the perceived sound event being delayed. The feedback loop between visual and acoustic stimuli would be weakened, and the feeling of immersion would decrease.

As a comparison, one of the old and one of the new glass samples have been taken and analyzed. From the waveforms, shown in Figure 5.2, it is plain to see that the old sample has a bit less of a gap between the beginning of the file and the beginning of the audio itself. This is a difference of around one hundred samples. The general shape is similar, an initial event is quite evident, followed by harmonics and reverberation.


Figure 5.2 - Old glass sample on top, new glass sample on the bottom

What really sets these two samples apart is not the waveform, but the spectrum, shown in Figures 5.3 and 5.4. It is quite apparent that the new sample has both way more high frequency content and way more harmonics. This should allow for a more realistic depiction of a glass object that is allowed to vibrate without damping.


Figure 5.3 - New glass sample with waveform and spectral frequency


Figure 5.4 - Old glass sample with waveform and spectral frequency

5.2 - Setting the scene

A Unity scene was set up with two mirrored tables, labeled as A and B, upon which the same set of objects was arranged. On table A the objects were tagged with the default sounds of NewtonVR, on table B the collision system of NewtonVR was also used, but with a set of self-recorded sounds. This scene can be seen from above in Figure 5.1. It should be noted that there is a roof, implemented as a Plane, which is a directional 3D primitive, and as such it cannot be seen from above.

Figure 5.2 - Test scene

In order to allow the users to visually distinguish between the materials, free textures were imported from the Unity Asset Store and applied to the objects, making sure each texture corresponded to the sound set applied to each object.

Four metal spheres were employed, shown in Figure 5.2, of which one with a diameter of 35cm and three with a diameter of 5cm. Four plastic cubes of different sizes were also employed, two red and two white, as shown in Figure 5.3, without a texture. There was also a wooden cube, rendered with a mossy planks texture, as shown in figure 5.4.The cardboard box, shown in Figure 5.5, was imported as a ready-made asset.

Finally a glass box was constructed out of semi-transparent parallelepipeds, without using a texture. In Figure 5.6 it is shown with the three smaller metal balls, a combination which allowed the users to use the box as a sort of tennis racquet, which produces interesting sounds.

Figure 5.3 - Metal sphere


Figure 5.4 - Plastic cubes


Figure 5.5 - Wooden cube


Figure 5.6 - Cardboard box

Figure 5.7 - Glass box with small metal spheres

5.3 - Performing user tests

University students are generally unwilling to come into the office block parts of the building, so a mobile virtual reality rig was set up in the main corridor.

A powerful laptop was connected to an Oculus Quest using the just-released Oculus Link beta, so that the Quest can be used as a SteamVR headset. This system was placed on a table, and an area of roughly 2mx2m was cleared and designated for the Oculus Guardian system, which prevents the users from walking too far or unto obstacles.

Test subjects were procured by stopping passersby and asking students sitting at the tables. This resulted in twenty-one test subjects, which is not a lot, but more than many published studies. As usual, the demographics were highly skewed, only two subjects were over the age of thirty, and only three of them were women.

The users were instructed on how to use the locomotion and object handling system, and they were told to grab the objects on the two tables and compare the sounds they produced during collisions. They were specifically instructed to bang the objects together, against the table, and dropping them on the floor, but they were not told anything about what sound qualities they should look for. The subjects were also given closed-back headphones to insulate them from any external noise.

Many of the subjects were clearly first-time users of virtual reality, and they could be seen attempting to grab objects with both hands, and not using the full capabilities of the six degrees of freedom tracking system, but they instead relied entirely on the software controls.

The average length of a test session was around five minutes, but the subjects were not given a time limit, they were just set free and left alone to experiment for how long they desired. As the scene does not have gamification element that usually did not last too long.

As soon as they wanted to stop, the headphones and headset were removed, and the controllers were set down. The subjects were then handed a phone showing a Google Form with a short questionnaire, which they immediately filled out.

Alternatively, they were provided with a link and a QR code to fill out the questionnaire on their own device, but nobody chose to do that.

5.4 - Questionnaire

To accommodate the demographics of the test subjects, the questionnaire was written in Italian. Four psychoacoustic parameters were chosen, along with a general preference rating. For each parameter the subjects were asked to rate if they preferred sound set A or B, on a linear scale ranging from 1 to 7. 1 denoted full preference for set A, 7 full preference for set B, and 4 equivalence between the two sets.


Figure 5.8 - Sample of the linear scale used in the questionnaire

Here is a translation of the parameters and descriptions used in the questionnaire.

Accuracy of spatial localization is a measure of whether the sound caused by a collision was perceived in the same place where the collision was perceived visually.

Naturalness of the reproduced sound is a measure of whether the sound had audible artifacts or distortions.

Responsiveness to mechanical actions denotes whether the sounds were synchronous to the impacts causing them and whether the sounds changed with impact velocity.

Material characterization denotes whether the sound matched the visual material of the objects involved in the collision and whether it was possible to tell the materials apart just by sound.

General preference is simply asking the subjects which sound set they preferred.

In hindsight, it might have been better to show the questions to the users before they tried the Virtual Reality experience, so that they could specifically test for the needed parameters. However, this was avoided in order to minimize the possibility of influencing the subjects.

5.5 - Questionnaire results

An Analysis Of VAriance (ANOVA) test was performed on the data, checking whether individual users responded in a self-coherent manner. A simple chart of the variance, shown in Figure 5.9, made it apparent that five subjects just put the same answer for every answer. Looking at the individual responses it turns out that three of these users just picked an extreme of the linear scale and applied it throughout the questionnaire.

Luckily the other users mostly went down the middle, which should only slightly affect the overall results.


Figure 5.9 - Variance chart

In Figure 5.10, the difference between the data with and without the outliers is shown. In this case the verdict is clear in any case: the accuracy of spatial localization is unaffected by the change in the sound set. This is also the expected outcome, as the sound spatialization system is exactly the same.



Figure 5.10 - Accuracy of Spatial Localization over number of responses

Similarly for Naturalness the outliers do not change the outcome, the two systems are largely equivalent, and the average only goes from 4.19 to 4.18 by removing the outliers. Given the large variance, there is clearly no winner.


Figure 5.11 - Naturalness of the reproduced sound over number of responses

For responsiveness there was a very slight preference for sound set B, as shown in Figure 5.12. This is probably partly caused by the correspondence between visual materials and sound materials. Perhaps more importantly, as there is more timbric variation in the sounds from set B, the difference between a soft collision and a strong one should be more easily heard.


Figure 5.12 - Responsiveness of the reproduced sound over number of responses

Material characterization showed a clear preference for set B, just as expected, with an average score of 5. It is interesting that some subjects seemed to prefer the lack of characterization.

Forms response chart. Question title: . Number of responses: 21 responses.
Figure 5.13 - Material characterization

The overall preference went slightly in favour of set B, but just slightly, the average was 4.47.

Forms response chart. Question title: . Number of responses: 21 responses.
Figure 5.14 - Overall preference

Overall, it became clear that the questionnaire could have been improved. The first order of business would be to make sure that the test subjects had understood correctly how to use the system and what to look for. However, the purpose of the project was to simulate the normal use of a Virtual Reality system, and paying attention to sound quality is not something most actual users would do. On the other hand regular users could be expected to have more familiarity with the controls.

Another confounding factor could be that the system always plays the sound for both materials entering a collision, and this is not a realistic behavior. For example, hitting a glass with a metal spoon and with a piece of cardboard will not result in the same kind of response from the glass itself.

Conclusions

The aims of this project were to fit a system for the generation of sounds caused by the manipulation of virtual objects. These sounds had to be easy to generate, as sound design is not part of the subject area of those normally using this system, and naturalistic, so that they would not distract the users from their tasks, but provide useful auditory feedback on the positions and materials of the objects.

Several solutions were explored, and the final choice was heavily influenced by compliance with still-developing industry standards. During the course of the project a new headset, the Oculus Quest, was made available, and it was effortlessly integrated due to its standards compliance.

In particular, the solution for generating sounds, NewtonVR, allows for the later addition of further materials simply by adding appropriately named audio files in a folder, and their names to a text file.

The validity of the resulting system was evaluated psychoacoustically on twenty-one test subjects. As it is sadly common, the results were not very decisive, but they were at least universally positive.

It is easy, at this point, to think of possible ways to improve the system in the future. In order to improve the realism of the sound produced by collisions between objects of different materials, it would be possible to assign to each material an impedance. When materials with the same impedance collide, the resulting sound would just be the sum of the two material sounds, just like in the current system. If objects of different impedances collided, the relative weight of the two component sounds would be weighed by the impedance.

As an example if glass (low impedance) collides with cardboard (high impedance) the sound is mostly caused by the cardboard.

Another possible improvement would be to add support for the native Oculus interaction system, which would allow the project to be compiled for the Oculus Quest. On the Quest it would be possible to track the entire area of the lab with inside-out tracking, and therefore to move around the room without wires restricting the user motion in any way.

Finally, right now the system supports both OptiTrack and SteamVR, and tethered Oculus headsets through SteamVR, but this requires disabling the GameObject for the tracking system not in use. It should be possible to write a script that automatically detects which VR system is in use and dynamically pick the correct tracking system and assets.