Research | August 2016 Hearing Review

The hearing industry can apply some audio industry tests to hearing aids. Here’s how.

There is no doubt that hearing aid users will encounter more streamed sounds as the technology spreads in both devices and use cases, and streaming technologies will potentially reduce the sound quality experience. Until now, little interest has been shown in measuring the perceived sound quality between the different hearing aid streaming capabilities. However, as this article shows, there are multiple suitable methods available from the audio industry, enabling robust benchmark results for the hearing industry and optimal technology choices from the end-user perspective.

Wireless streaming has been an important feature in hearing aids for decades. From the well-known telecoil solution to more recent 2.4 GHz technologies, wireless streaming is an integrated part of hearing aid usage, and provides support in everyday situations such as receiving phone calls, taking part in education, and listening to an audio stream from TV or mobile devices. Because wireless is now so integral to hearing aid performance, it becomes relevant to explore to what extent various devices and technologies fulfill the sound quality expectation of the user. One wireless device will not necessarily exhibit the same perceptual characteristics as another wireless system, and associated sound quality differences should be expected.

Sending an audio stream wirelessly does not come without challenges, and various technologies balance important issues like battery usage, latency, drop-out compensation, audio encoding in different ways—resulting in different sound quality characteristics. However, as with all consumer-focused audio products, at the end of the day, what matters is how the end-user perceives the sound quality.

This article outlines the concept of perceived sound quality in wireless streaming and suggests a number of methods for measuring this in a stable and robust manner.

Sound Quality

Streaming sound using wireless technologies may degrade the perceived sound quality of the signal. This is a result of compromises made within the technology regarding battery usage, bit rate, range of the device, connectivity speed, storage capacity, etc. To put the expected sound quality into perspective, streaming in the 2.4 GHz band to hearing aids happens with a bit rate around 64 kilobits per second (kbit/s), whereas full compact disc (CD) quality requires around 1,411 kbit/s for a stereo signal. This drastic reduction in bit rate is achieved using various audio coding technologies that reduce the bit rate significantly, with a lossy (highly compressed) audio signal as a consequence.

Regardless of the specific technical implementation, the degradation in sound quality may result in limited band width, distortion artifacts, drop outs in the signal1—all important aspects of the overall perceived sound quality. This means that the hearing aid user experiences a reduced sound quality while streaming, even though the technology in an ideal setting is intended to increase hearing aid use cases, overall satisfaction, and integrate seamlessly with external devices.

Table 1


Table 1. Overview of potential artifacts in hearing aid streaming technologies (based on a preliminary technology review of available devices).1,9,16

In a preliminary technology review, based on devices available on the market, we found obvious challenges with the streaming sound quality. Table 1 provides an overview of the artifacts and potential sound quality problems encountered by the user in current streaming solutions (2.4 GHz specific). This list can be used as a dialogue tool when consulting users facing sound quality issues, or an introduction to “what to listen for” in evaluating devices at hand.

When looking at Table 1, it should be noted that not all users will be equally sensitive to the artifacts introduced in the audio coding process.2 This means that we cannot assume that one listener will notice the same artifacts as another, making it difficult to make generalizable statements about the sound quality without applying further measurements.

Listen! Measuring Sound Quality

From a more general perspective, sound quality measurements have been an integral part of both the hearing aid and audio industry for a long time. Some methods are based on physical measurements and perceptual models, others on structured listening tests. This article focuses on covering measurement methods based on structured listening tests. Even though objective models like Perceptual Evaluation of Audio Quality (PEAQ, ITU-R BS.1387)3 could potentially be of interest, they have not yet been proven efficient for this specific application.

Measuring sound quality with structured listening tests has been approached in many ways, using a large selection of varying methodologies. When reviewing the literature on measurement of sound quality, it is typically divided into two areas of interest2,4-8:

1) Descriptive measurements. Measuring relevant perceptual characteristics or specific qualities about a signal (perceived distortion, speech intelligibility, timbre, loudness, clarity, etc) as seen in Table 1, and

2) Affective measurements. Measuring to which degree a signal is liked/preferred, or meets the listeners overall quality expectations. These typically involve applying liking scales ranging from dislike extremely to like extremely, or overall quality scales ranging from bad to excellent.

Not all applied methods or experimental setups make a distinct separation between descriptive and affective measurements. Frequently, data from both perspectives are collected in more ambitious sound quality studies. As with any type of measurement, it of course makes sense to clarify to what extent you are measuring one thing or the other; are you trying to describe what the listeners are perceiving, or are you trying to find out what they prefer/like the most?

We now turn our attention to a more thorough introduction to the concepts of descriptive and affective measurements. In addition, a selection of potentially applicable affective measurement methods for wireless streaming technologies is described.

Descriptive Measurements

Especially for R&D purposes, it may be necessary to apply descriptive measures of sound quality. This could be motivated by questions like: Do the streaming technologies under test have different timbre characteristics? What are the dominating perceptual characteristics that can be used in describing differences between the technologies? The exact implementation of descriptive measurements vary widely, but is generally divided into qualifying a selection of attributes for the technology under test, and scaling of these attributes according to presented stimuli.

In the case of streaming, it is of interest to examine the extent that various types of noticeable artifacts (eg, as those listed in Table 1) arise from the streaming technology. A descriptive measure in this case would be a scaling by the listener of how much of these individual artefacts are present in the technology under test (see Marins et al, 20079).

Even though data from descriptive measurements may have inherent good/bad connotations (who doesn’t think that a low amount of distortion is desirable?), it is important to restrain from concluding anything about the perceived sound quality if this data is the only source of information. This especially applies when the sound reproduction is multidimensional—meaning that multiple perceptual characteristics contribute to the overall impression. In this case, complicated tradeoffs are to be weighted by the consumer in deciding what is preferred (eg, level of distortion vs bandwidth limitation vs timbre differences). This means that, when measuring which technology is superior from a sound quality perspective, it is not necessarily a requirement to collect descriptive measurements, as these are mainly of interest when trying to understand how the various characteristics contribute to the overall acceptance/liking of a product.

Thus, we will not go into further detail son recommended methods for descriptive measurements. The interested reader can refer to Bech and Zacharov2 or Lorho8 for a more extensive review.

Affective Measurements

Measuring affective response in sound quality is typically an attempt at answering the simple, but probably most important, question in product development: Which technology is preferred by a specific user group for a specific use situation.

In affective measurements of sound quality, listeners are normally asked to integrate their impression and give one rating based on their overall evaluation. This assumes that the individual listeners are able to take their experience of the presented sound stimuli, previous experiences, expectations, and context into account, and form one impression that allows for a response.2 This is commonly seen in the telecommunications and broadcast/codec industry where standard test methodologies like the ITU-T P.800,10 ITU-R BS.1534-3,11 and ITU-R BS.1116-312 are applied. Unfortunately, these test methods have inherently confusing names that refer to the specific documents/recommendations written by the standardization organizations (similar to the ANSI standards with which dispensing professionals are more familiar).

Sound Quality, the ITU-R BS.1534-3, and the MUSHRA Test

One of the more frequently encountered methods used for measuring sound quality is the ITU-R BS.1534-3.11 This is a recommendation from the International Telecommunication Union (Radiocommunication Sector) intended for the subjective assessment of the intermediate quality level of coding systems. “Intermediate quality level” is not defined strictly in the recommendation, but references are made to distribution of audio over the Internet or digital radio—technologies limited by bit rate, similar to the general challenges for wireless streaming in hearing aids.

Figure 1

Figure 1. Screenshot from a commercially available implementation of the MUSHRA test.17 On the left-hand side is the shown reference which is the non-coded version of the sample. Hidden under the rating scale is a sound sample similar to the reference, as well as two anchor systems (3.5 and 7 kHz low-pass filtered sound files). Additionally, the systems under test are presented and rated. By having multiple systems being presented on one screen, the user can directly compare the quality and provide ratings relative to the reference and anchor systems.

The recommendation is commonly referred to as the MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) due to its test paradigm. Over the last two decades, the method has successfully been used in assessing sound quality in standardization organizations, academia, and the industry.13 In a MUSHRA test, a selection of systems under test (typically audio coding technologies) are rated against a shown reference and hidden bandpass limited anchor systems. The typical test question/parameter applied is “Basic Audio Quality,” which addresses any differences from the shown reference, where the rating scale is a 100-point scale with associated verbal labels ranging from “Bad” to “Excellent.” Figure 1 shows a screenshot from software used for data collection with the “Basic Audio Quality” test question.

According to the recommendation, the MUSHRA test should be conducted using experienced listeners. This is to ensure good data and critical evaluation of the systems under test. In fact, the higher quality the systems under test, the more this requirement on experience is emphasized.11 This is further stressed in the recommended post screening of listener data: only listeners who can give consistent data and discriminate among systems under test are included in the final data analysis.14

The MUSHRA test applies a shown reference as an example of the highest possible quality achievable in the test. The reference is typically the uncompressed .wav (PCM) version of the source material, and it is important that none of the technologies under test might potentially be rated “better” than the reference, as this may lead to wrong conclusions about the sound quality of the technology under test.

If the intended purpose of testing is to explore how a system is perceived by a relevant user group, and not a selection of highly critical listeners, it can be argued that the method may be extended to a more representative group of listeners. An example of this adjusted application is found in an evaluation of music sound quality in cochlear implants15 where an adapted method named CI_MUSHRA is used. In this adaptation of the MUSHRA test paradigm, the test logic is preserved, but domain-relevant (CI) listeners and modified anchor systems are used, making it more appropriate in evaluating the specific technology domain.

The MUSHRA is a well-documented method for measuring the quality of codec and streaming technologies. We find it highly relevant for assessing the quality of streaming technologies in the hearing aid domain if the expected quality falls within the appropriate range (intermediate quality, not close to transparent technologies).

The Future and ITU-R BS.1116-3

Figure 2

Figure 2. Screenshot from a commercially available implementation of the ITU-R BS.1116-3 test.17 One of the samples with associated scale is a hidden reference. The method should only be applied when testing very high quality technologies.

The ITU-R BS.1116-3 is to be applied when evaluating small impairment technologies. Small impairments in this context means that the technologies under test introduce such a small degradation to the signal that their effects are undetectable without rigorous control of the experimental conditions, including selection of critical listeners.

The ITU-R BS.1116-3 applies a double-blind, triple-stimulus hidden reference paradigm. This means that a sample encoded with the technology under test is being presented alongside a shown and hidden reference. The listener’s task is to detect and rate the technology under test on a scale ranging from “Not Noticeable” to “Very Annoying” (see Figure 2). When interpreting results from the ITU-R BS.1116-3, one assumes that the lack of artifacts, or alternatively the shortest perceived distance from a reference, is a measure of quality. The goal is that the technologies under test should have complete transparency such that any degradation is not detectable by the listeners.

The included hidden reference allows for a thorough post-hoc screening procedure of listener expertise. Any degradation of the hidden reference indicates that the listener is guessing in rating technologies that score on the upper part of the scale providing false-positive error in the test.

Due to the general critical nature of the test purpose and strict post-hoc tests, it is recommended to use expert listeners within the tested technology. This could potentially pose a problem for this domain, since training listeners for the specific domain and method is a time- and resource-intensive task.

The potential for applying the ITU-R BS.1116-3 test methodology in evaluating streaming technologies in hearing aids might become relevant as the technology evolves. However, looking at the current band-limited streaming technology implementations in hearing aids, it is not currently of relevance.

Discussion

In this paper the concept of sound quality was put into a context of wireless streaming technologies in hearing aids, and a number of potential methods were described for measuring the perceived sound quality.

Based on the artifacts listed in Table 1, it is expected that the end user will experience artifacts that can influence the overall perceived sound quality of the streamed sounds. Currently, no benchmark results exist for the perceived sound quality of 2.4 GHz streaming to hearing aids. In order to properly test and document these technologies, we propose applying structured listening tests as used in the general audio industry.

As discussed, it is important to distinguish the purpose of a sound quality study: either measuring descriptive characteristics or the affective response. Both perspectives may be included in a study, but only measurements from the affective domain allow for a conclusion on liking/overall user impression of the tested technology. Having an interest in both aspects of perceived sound quality will often lead to large and relatively complicated experiments, and is typically only of relevance when exploring the relationship between the descriptive characteristics and the overall preference for the signal.

When selecting an appropriate methodology for measuring the sound quality of wireless devices, one of the main points of interest is whether there is a relevant reference available. If a reference can be considered available for the study, the MUSHRA and ITU-R BS.1116 methodology are well-proved and tested methodologies, with available test software and statistics that allow for easy benchmarking. The MUSHRA methodology is relevant for testing low to intermediate quality technologies, and the ITU-R BS.1116 is relevant for testing close to transparent technologies. If no reference can be considered available, a modified MUSHRA test with no shown reference could be considered appropriate. This approach will maintain some of the benefits of the MUSHRA, such as direct comparison between the technologies under test, as well as the statistics that allow for easy communication of obtained results.

There is no doubt that hearing aid users will encounter more streamed sounds as the technology spreads in both devices and use cases, and streaming technologies will potentially reduce the sound quality experience. Up until now, little interest has been shown in measuring the perceived sound quality between different hearing aids streaming capabilities. However, as we have shown, there are multiple suitable methods available, enabling robust benchmark results for industry and optimal technology choices from the end user perspective.

References

  1. Hoel R, Motos T. Challenges in 2.4 GHz wireless audio streaming. Proceedings of: Audio Engineering Society International Convention (#131). Oslo, Norway; October 2011. Available at: https://secure.aes.org/forum/pubs/conventions/?elib=16075

  2. Bech S, Zacharov N. Perceptual Audio Evaluation–Theory, Method and Application. Indianapolis: John Wiley & Sons, Ltd;2006.

  3. International Telecommunications Union (ITU) Radiocommunication Assembly. ITU-R BS.1387-1 (2001): Method for objective measurements of perceived audio quality. Available at: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1387-1-200111-I!!PDF-E.pdf

  4. Rothauser E, Urbanke G, Pachl W. A comparison of preference measurement methods. J Acoust Soc Am. 1970;49(4)[May]:1297-1308.

  5. Gabrielsson A, Hagerman B, Berg C, Ovegård A, Änggård L. 1980. Clinical assessment of perceived sound quality in hearing aids. Technical Audiology. 1980; Report TA 98. (Stockholm, Sweden: Karolinska institutet, Stockholm).

  6. Letowski T. Sound quality assessment: Concepts and criteria. Proceedings of: 87th Audio Engineering Society Convention, New York City;October 1989. Available at: https://secure.aes.org/forum/pubs/conventions/?elib=5869

  7. Blauert J. Concepts behind sound quality: Some basic considerations. Internoise. 2003;9:72-79.

  8. Lorho G. Perceived Quality Evaluation. An Application to Sound Reproduction Over Headphones. PhD thesis, Helsinki, Finland: Helsinki University of Technology, June 2010.

  9. Marins P, Rumsey F, Zielinski SK. The relationship between selected artifacts and basic audio quality in perceptual audio codecs. In Proceedings of: Audio 120th Engineering Society Convention. Paris;May 2006. Available at: http://epubs.surrey.ac.uk/544

  10. ITU Radiocommunication Assembly. P.800: Methods for subjective determination of transmission quality. August 1996. Available at: https://www.itu.int/rec/T-REC-P.800-199608-I/en

  11. ITU Radiocommunication Assembly. ITU-R BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems. October 2015. Available at: https://www.itu.int/rec/R-REC-BS.1534/en

  12. ITU Radiocommunication Assembly. ITU-R BS.1116-3: Methods for the subjective assessment of small impairments in audio systems. February 2015. Available at: http://www.itu.int/rec/R-REC-BS.1116

  13. Stoll G, Kozamernik F. EBU listening tests on Internet audio codecs. EBU Tech. 2000;1(June):24.

  14. ITU Radiocommunication Assembly. ITU-R BS.2300-0: Methods for assessor screening. 2014. Available at: https://www.itu.int/dms_pub/itu-r/opb/rep/R-REP-BS.2300-2014-PDF-E.pdf

  15. Roy AT, Jiradejvong P, Carver C, Limb CJ. Musical sound quality impairments in cochlear implant (CI) users as a function of limited high-frequency perception. Trends in Amplif. 2012;16(4):191-200.

  16. Liu C-M, Hsu H-W, Lee W-C. Compression Artifacts in Perceptual Audio Coding. IEEE Transactions on Audio, Speech, and Language Processing. 2008;16(4):681-695.

  17. DELTA. MUSHRA test. SenseLabOnline, 2014. Available at: http://www.senselabonline.com

Jesper Ramsgaard

Jesper Ramsgaard

Jesper Ramsgaard is an Audiological Affairs Specialist at Widex A/S in Lynge, Denmark.

Correspondence can be addressed to Jesper Ramsgaard at: [email protected]

Original citation for this article: Ramsgaard J. Sound Quality in Hearing Aid Wireless Streaming Technologies. Hearing Review. 2016;23(8):24.