9 min read

Audio compression in RSI demystified by audio engineer Richard Schiller

By Dora Murgu on August 16, 2022

Topics: Remote Simultaneous Interpretation

sound quality in remote interpreting

We have uploaded this article to our Interprefied podcast and is now available on your preferred podcast directory.

Listen and download the podcast below:

Available for download on:

Sound quality is something that accompanies us all day. From putting on the radio in the morning to binge-watching that new tv series late at night, good audio is something we often take for granted. Whilst it's usually quite easy to spot bad audio, good audio quality is in fact quite a complex matter. A good example is a decade-old discussion between vinyl-lovers and other audiophiles over which format provides superior sound quality: CD or vinyl.

In remote interpreting, being able to both receive and send quality audio is key. To ensure information is accurately processed and secure an enjoyable audio experience while protecting people's hearing.

An audio-first platform, we're constantly working on new ways to influence speaker behavior, as well as implementing innovative audio solutions that secure superior audio quality. An often discussed topic, sound compression can really help positively impact the sound experience - if applied correctly.

We sat down with Richard Schiller, Audio Engineer and Senior Product Manager at Interprefy to understand what sound compression is, how is it used in RSI, and what influences sound quality.

Hello Richard, tell us a bit about your background and what you do.

Hi Dora, great to speak with you again. My role at Interprefy is Senior Product Manager. I look after the direction and detail of the product. I also happen to be a trained sound engineer. I originally worked in the world’s largest broadcast speech radio organisation, the BBC World Service. That background makes you obsess about clarity and consistency.

Consistency was the key to making radio work on a large scale, and clarity was the very essence of what we delivered. I have also worked in music recording and television. I have done most jobs in that profession including being a producer, a director, a presenter, and a scriptwriter.

Then you’re the right person to answer the million-dollar question: what is compression?

There are two different and unrelated things that get called compression in sound: Originally there was dynamic compression which is a circuit or nowadays an algorithm that controls sound level automatically. This is primarily used to reduce dynamic range — the span between the softest and loudest sounds. Then came bit-rate reduction, a system for reducing the amount of audio data that must be stored or transported.

Dynamic compression and bit-rate reduction can be used well, or badly.

So, are they good or bad?

Neither. Like almost everything, these two techniques can be used well or badly. Used badly, then no they are not good, but there is nothing about either form of compression which says that they are inherently bad.

Dynamic compression is essentially like having a device that watches the sound level and turns down the volume knob when the audio gets too loud. It then turns it up again as the audio goes quieter. It helps people hear both loud and quiet passages equally well. It is essentially no different to a human with volume control which they are turning down – and I emphasize that dynamic compression is about reducing volume, hence the name.

So, where does the concern about compression come from?

Dynamic compression makes sound quieter and that is often undesirable, so it is followed by a pre-set volume control to make it louder again. Because compression equalizes the level of the signal, you can go one of two ways. It can be set to be quieter but easier to hear, or louder and more attention-grabbing. If I can just step out of the discussion here for a moment and make an important point; if you think the sound is too loud, turn it down. Always take control of your own listening level.

It’s not just the level that can be wrong, what is called the time constants of a compressor matter too. Finally, there is the ratio. This is often set to be too aggressive and that’s the most common cause of compression making speech incomprehensible.

One of the most annoying applications is poorly designed Automatic Gain Control (AGC) circuits in both old consumer equipment and in algorithms used by some PCs. AGCs and noise gates are often defaulted to being switched on in laptops and other devices. So, dynamics are ever-present in our lives. Badly set compression can clip the plosive and sibilant sounds, making speech hard to comprehend. You can hear this as a dull quality to the hard consonants at the start of words, particularly for the first word of a sentence. Another sign of a poorly set AGC is shown when someone says a loud word followed by a quiet word and you hear the end of the quiet word but struggle to hear its start.

Let’s turn to RSI. How does sound quality in music differ from sound quality in speech?

There’s a lot that’s common but in each case, you must be careful to understand what good is. People take numbers from classical music recording made in acoustically treated studios and apply those to speech. In some ways speech is easier than an orchestra, and in some ways it’s harder.

The upper reaches of the bandwidth, for example, are not as important for speech as for some instruments. There is a good argument to say that with some percussion bandwidth is king while for speech, smoothness should reign. It’s why a recording engineer will use a different microphone for a person than they would for a snare drum or cymbal.

I know that some people will be shouting back at me that the frequencies between 18kHz and 20kHz are vital for speech, but they simply are not. In general, the very best and most expensive microphones recording engineers use for speech aren’t any good at those frequencies because they just don’t need to be.

And this is not just accidental. Say you were in a forest and listening to a person a few metres away with their mouth directly facing your ear (and you were young enough that you could still hear at 20kHz). then if you turned your face so that you could see the speaker and they turned sideways, you would not hear the 20kHz component anymore, or at least much reduced. These very high frequencies are not preserved well in the natural world and so are not important to us, because life would be impossible if they were.

Achieving clarity is more nuanced than people like to represent it.

So, for the purpose of simultaneous interpretation, is it not essential to have access to frequencies up to 15,000 Hz?

The challenge here is that I can sound like I am saying second best is good enough, but the truth of all this is that achieving clarity is more nuanced than people like to represent it. Like-for-like, a bandwidth of 15kHz is better than 10kHz for speech, which is better than 6kHz and so on.

However, a flatter (smoother) response through to 10kHz can be better for comprehension than a lumpy response to 15kHz. Similarly, speech that has not been badly dynamically compressed with a 6kHz bandwidth can be easier to comprehend than 15kHz of bandwidth with terrible compression.

What all this means is that preserving the frequency response is important, of course, but so are other factors, and none will make things perfect on their own. The particular issue with response is that as you go up the scale, returns diminish significantly. So, our tendency to obsess over the higher registers speaks to it being something we understand and can easily describe, rather than reflecting its real position in the value chain.

A bandwidth of 15kHz or more needs to be part of a whole program of good performance, but in a literal sense it is neither essential to good, easy comprehension nor does it guarantee it.

There are claims that RSI platforms apply a dynamic range compression which leads to bad sound. Is this true for Interprefy?

No. There is no need for dynamic range compression in general operation. That’s not to say we don’t ever use it. We have something in the lab at the moment which applies compression that’s really exciting. It’s designed for listeners, be they audience, delegates, or interpreters. It can be turned on by each person if they wish or left off if they don’t. 

Excellence comes from applying technology in the right place and in the right way. It’s about tuning, seeking perfection at each step, and applying small incremental changes across the whole system.

Let’s talk about the delegates for a moment, because we all have had that experience where a speaker sounds just awful.

Yes. Absolutely Dora. And I am really passionate about eliminating that. The really big issues are the very poor equipment used by a lot of speakers and their lack of understanding around what they need to do to ensure sound quality.

How do we solve that?

Like just about everything, the solution is in tackling lots of different factors. We need speakers to be using better microphones, we need them to be more knowledgeable about microphone techniques and for them to pay more attention to background noise and echo. There's a lot of education to do here, something we also started with our speaker housekeeping video campaign.

We can also use technology to assist here. In the future, you and I can come back to this topic and talk about how technology can assist people to improve their own quality and compensate for the problems when they can't.

The big difference is between good equipment well configured and poor equipment, badly configured.

So, if we were to compare the sound received via hardware, such as a hard console, and that received via Interprefy, there wouldn’t be much difference as long as the speaker uses appropriate equipment?

Yes, that’s right Dora. The big difference here is not between local and remote working, it’s between better equipment well configured and poor equipment, badly configured. There is no inherent difference to a hardware-based local system in terms of audio quality. Many meeting and event participants using RSI systems have microphones better than their equivalents on site. Some wish to take part using devices that are worse. Just like everything else in business, it needs to be appropriately managed.

So, what is the difference between RSI and a hardware-based solution?

What RSI delivers is choice. Choice through flexibility. When my wife first became pregnant, her employer, a man, simply told her that she no longer had a job. Thankfully that’s illegal now. I like to think that RSI means that those interpreters who don’t want or can’t travel can work more flexibly. I didn’t like the poor attitude my wife suffered and just as I think employers should do everything they can to allow people to work, no matter their condition or lifestyle needs, I think it is incumbent on us, the system suppliers, to build in that flexibility too.

RSI solutions are flexible for organisations too. You can hold a conference or meeting anywhere and set up or change the configuration instantly. We recently helped an astronaut talk to the world while on the International Space Station. Insisting that a spaceman attend in-person would of course have been ridiculous.

Returning to compression, what would you say to those that are asking to eliminate compression altogether?

Getting rid of compression, of either form of compression, is not a magic bullet. Can I emphasize again here, there is no magic bullet. Part of the holistic solution is to eliminate the bad use of compression - of both bad dynamic compression and poor bit-rate compression. That means having engineers working in the industry that understand the technology and understand it in detail.

What about using more than one compression function one after the other. Is that automatically bad?

This is known as cascading compression. No, it's not automatically bad either for dynamic or bit-rate compression.

There are specific problems with cascading compression and when you engineer solutions, you have to work hard. It's very reasonable to be worried about cascaded compression because it takes a lot of effort to make it work, but if you're competent, it can be done. And done really well. Taking dynamic compression, for example, two of the greatest audio innovations ever came from using cascaded dynamic compression.

Some people seem to be particularly good at assessing factors like compression, should you be using them to help you?

There is only one way of assessing audio and that is what we call blind testing. Ideally double-blind tests. Anyone that tells you that they are particularly good at hearing audio issues, ask whether that was in blind testing, that is testing in a program where they do not know which is which and where it is led by someone unconnected with the assessment. All testing should use a range of listeners too.

Many people, probably most, think they have exceptional hearing, but only about one in twenty does. It's like we all think we are great drivers.

Good sound is something you achieve by taking a lot of care and working holistically.

Some people seem to be very opinionated about sound quality and how to achieve it. What is your answer to them?

People who speak in binary terms, who talk in ‘musts’ and ‘must nots’ are, experience has shown me, mistaken. I don’t like seeing compression or any other audio tool getting an undeserved bad name. Not because I am particularly fond of it, or an advocate for compression in particular, but because, good sound is something you achieve by taking a lot of care and working holistically. True perfectionists are non-binary, use the whole toolkit, and are not given to simplistic reductions.

All sound processing can be done badly and done well. Done well, means the right configuration is used and applied where it is beneficial. Dynamic compression can be awful if it is badly applied but that does not mean it’s universally wrong. Applied correctly, it's an incredible asset.

Dora Murgu

Written by Dora Murgu

Learn about the latest developments at Interprefy by Dora Murgu, Head of Training and Engagement at Interprefy