Supported SoCs

SoC

RTL8721Dx

RTL8720E

RTL8726E

RTL8730E

Supported

N

N

Y

Y

Overview

AIVoice is an offline AI solution developped by Realtek, including local algorithm modules like Audio Front End (Signal Processing), Keyword Spotting, Voice Activity Detection, Speech Recognition etc. It can be used to build voice related applications on Realtek Ameba SoCs.

AIVoice can be used as a purely offline solution on its own, or it can be combined with cloud systems such as voice recognition, LLMs to create a hybrid online and offline voice interaction solution.

Applications

The smart voice system is widely applied in various fields and products, enhancing the efficiency and convenience of human-computer interaction. The applications including:

  • Smart Home: Smart speakers like Amazon Echo and Google Nest, or home appliances with built-in voice control features, allow users to control home lighting, temperature, and other smart devices through voice commands, enhancing convenience and comfort in living.

  • Smart Toys: Intelligent voice systems are being integrated into interactive toys (like AI story machines and voice-enabled educational robots, companion robots). These toys can engage in natural conversations with users, answering endless questions, telling personalized stories, or providing bilingual education.

  • In-Car Systems: Many modern vehicles are equipped with voice recognition systems that enable drivers to navigate, make calls, and play music using voice commands, ensuring driving safety and also making the driving experience more enjoyable.

  • Wearable Products: Many products include smartwatches, smart headphones, and health monitoring devices come equipped with voice assistants. User can use voice control to check and send messages, control music player, answer calls etc, enhancing user experience and interaction methods.

  • Meeting Scenarios: Voice recognition technology can transcribe meeting content in real-time, helping participants better record and review discussion points.

File Path

Chip

OS

aivoice_lib_dir

aivoice_example_dir

RTL8730E

Linux

{LINUXSDK}/apps/aivoice

{LINUXSDK}/apps/aivoice/example

FreeRTOS

{RTOSSDK}/component/aivoice

{RTOSSDK}/component/example/aivoice

RTL8726E

FreeRTOS

{DSPSDK}/lib/aivoice

{DSPSDK}/example/aivoice

Modules

Modules

Functions

AFE (Audio Front End)

Enhancing speech signals and reducing noise

KWS (Keyword Spotting)

Detecting specific wakeup words to trigger voice assistants, such as “Hey siri”, “Alexa”

VAD (Voice Activity Detection)

Detecting speech segments or noise segments

ASR (Automatic Speech Recognition)

Detecting offline voice control commands

Flows

Some algorithm flows have been implemented to facilitate user development.

- Full Flow:

An offline full flow including AFE, KWS and ASR. AFE and KWS are always-on, ASR turns on and supports continuous recognition when KWS detects the keyword. ASR exits after timeout.

- AFE+KWS:

Offline flow including AFE and KWS, always-on.

AFE (Audio Front End)

Introduction

AFE is audio signal processing module for enhancing speech signals. It can improve robustness of speech recognition system or improve signal quality of communication system.

In AIVoice, AFE includes submodules:

  • AEC (Acoustic Echo Cancellation)

  • BF (Beamforming)

  • NS (Noise Suppression)

  • AGC (Automatic Gain Control)

  • SSL (Sound Source Localization)

Currently SDK provides libraries for four microphone arrays:

  • 1mic

  • 2mic_30mm

  • 2mic_50mm

  • 2mic_70mm

Other microphone arrays or performance optimizations can be provided through customized services.

Algorithm Flow

  • Single Mic

afe_flow_1mic
  • Dual Mic

afe_flow_2mic

Input Audio Data

  • Single Mic

    • Input audio data format: 16 kHz, 16 bit, two channels (one is mic data, another is ref data). If AEC is not required, the input is single-channel of mic data.

    • The frame length of input audio data is fixed at 256 samples.

    • The input data is arranged as follows:

../../_images/afe_input_mod_0.png
  • Dual Mic

    • Input audio data format: 16 kHz, 16 bit, there channels (two are mic data, another is ref data). If AEC is not required, the input is two-channels of mic data.

    • The frame length of input audio data is fixed at 256 samples.

    • The input data is arranged as follows:

../../_images/afe_input_mod_1.png

Note

If AEC is not required, Set related parameters as follows: enable_aec = false, ref_num = 0 。

Configurations

Definition of Configuration Parameters

AFE configuration includes microphone array, working mode, submodule switches, etc.

typedef struct afe_config{

    // AFE common parameter
    afe_mic_geometry_e  mic_array;          // microphone array. Make sure to choose the matched resource library
    int ref_num;                            // reference channel number, must be 0 or 1. AEC will be disabled if ref_num=0.
    int sample_rate;                        // sampling rate(Hz), must be 16000
    int frame_size;                         // frame length(samples), must be 256

    afe_mode_e afe_mode;                    // AFE mode, for ASR or voice communication. Only support AFE for ASR in current version.
    bool enable_aec;                        // AEC(Acoustic Echo Cancellation) module switch
    bool enable_ns;                         // NS(Noise Suppression) module switch
    bool enable_agc;                        // AGC(Automation Gain Control) module switch
    bool enable_ssl;                        // SSL(Sound Source Localization) module switch. SSL is not supported in current version.

    // AEC module parameter
    afe_aec_mode_e aec_mode;                // AEC mode, signal process or NN method. NN method is not supported in current version.
    int aec_enable_threshold;               // ref signal amplitude threshold for AEC, the value should be in [0, 100].
                                            // larger value means the minimum echo to be cancelled will be larger.
    bool enable_res;                        // AEC residual echo suppression module switch
    afe_aec_filter_tap_e aec_cost;          // higher cost means longer filter length and more echo reduction
    afe_aec_res_aggressive_mode_e res_aggressive_mode;  // higher mode means more residual echo suppression but more distortion

    // NS module parameter
    afe_ns_mode_e ns_mode;                  // NS mode, signal process or NN method. NN method is not supported in current version.
    afe_ns_cost_mode_e ns_cost_mode;        // low cost mode means 1channel NR and poorer noise reduction effect
    afe_ns_aggressive_mode_e ns_aggressive_mode;        // higher mode means more stationary noise suppression but more distortion

    // AGC module parameter
    int agc_fixed_gain;                     // AGC fixed gain(dB) applied on AFE output, the value should be in [0, 18].
    bool enable_adaptive_agc;               // adaptive AGC switch. Not supported in current version.

    // SSL module parameter
    float ssl_resolution;                   // SSL solution(degree)
    int ssl_min_hz;                         // minimum frequency(Hz) of SSL module.
    int ssl_max_hz;                         // maximum frequency(Hz) of SSL module.
} afe_config_t;

If you need to change mic_array, both configuration and afe resource library should change accordingly.

Please refer to ${aivoice_lib_dir}/include/aivoice_afe_config.h for details.

Attention

Please make sure the mic_array and ref_num in configuration match AFE input audio.

Module Switch Configuration

  • AEC(Acoustic Echo Cancellation)

    • Module Description: To cancel the sound played by the device itself picked up by the microphone known as echo.

    • Enable Condition: When the device is equipped with speaker and there are echo scenes, the AEC module should be enabled.

    • Spectrum when AEC module is off and on

../../_images/afe_aec_spectrogram.png
  • RES(Residual Echo Suppression)

    • Module Description: Nonlinear processing submodule of AEC for further suppressing the residual echo. When this module is enabled, echo suppression is enhanced, but speech distortion also increases.

    • Enable Condition: When the speaker plays with strong nonlinearity and high echo residue, the RES module can be enabled. RES module cannot be enabled individually when AEC is disabled.

    • Spectrum when RES module is off and on

../../_images/afe_aec_res_spectrogram.png
  • NS(Noise Suppression)

    • Module Description: To suppress ambient noise, especially for stationary noise.

    • Enable Condition: When the environment noise is high or the device generates large stationary noise, the NS module is advised to enabled.

    • Spectrum when NS module is off and on

../../_images/afe_ns_spectrogram.png
  • AGC (Automatic Gain Control)

    • Module Description: To adjust the amplitude of the output audio. When adjusting the AGC_fixed_gain parameter, ensure that the processed signal is not clipped under maximum speech volume conditions.

    • Enable Condition: When the output signal amplitude obviously affects the KWS or ASR effect, the AGC module can be enabled to apply the appropriate gain.

  • SSL (Sound Source Localization)

    • Module Description: To calculate the direction of the speaker. Only dual-microphone arrays are supported, and the output Angle ranges from 0° to 180°.

    • Enable Condition: Enable SSL module when the direction information of speaker is needed.

Hardware Design Requirements

Microphone performance requirements

Omnidirectional MEMS microphone is recommended, it has better consistency.

  • sensitivity: analog microphones ≥-38dBV, digital microphones ≥-26dBFS, ±1.5dB;

  • Signal-to-noise ratio (SNR): ≥60dB

  • Overall-harmonic-distortion(THD) :≤ 1%(1kHz)

  • Acoustic overload point(AOP) :≥120dB SPL

Speaker performance requirements

  • Harmonic distortion (THD) : under rated power 100~200Hz THD≤5% , 200~8kHz THD≤3%

Microphone Array Design Recommend

  • the distance between two microphones should be 3.0~7.0 cm, preferably 5cm.

  • All microphone pickup holes are located in the same straight line, which is parallel to the horizontal plane.

  • The microphone orientation can be at any Angle between up and forward (towards the speaker).

  • Use the same microphone models from the same manufacturer for the array. It’s not recommended to use different microphone models in the same array.

  • It is recommended to use the same structural design for all the microphones in the same array to ensure consistency.

Receive Path Performance Requirements

  1. Consistency

    • Frequency response consistency: free field spectrum (100~7kHz) response fluctuation: < 3dB.

    • Phase consistency: phase difference between microphones (1kHz) : < 10°.

  2. Leakproofness

    • External speaker playback, the overall volume attenuation (100~8kHz) between blocked microphone pickup hole and unblocked microphone pickup hole : > 15dB.

  3. No Abnormality in the Spectrum

    • There should be no abnormal electrical noise.

    • There should be no data loss.

  4. Spectrum Attenuation

    • There should be no significant attenuation below 7.5kHz.

  5. Frequency Aliasing

    • Play the sweep signal (0~20 kHz), and the recording signal has no significant frequency aliasing.

Echo Path Performance Requirements

  1. Loopback mode for echo reference

    • Only supports hardware loopback for echo reference.

  2. Echo reference signal position

    • It is recommended that the echo reference signal be as close to the speaker side as possible, and should be after EQ to avoid nonlinear caused by sound effects.

  3. Reference signal gain

    • When the speaker playback at the maximum volume, the echo reference signal should not have clipping, the Recommended signal peak value is -3 to -6 dB.

  4. Latency

    • Don’t have latency.

  5. Total harmonic distortion

    • When the speaker playback at the maximum volume: 100Hz, THD≤10%; 200~500Hz, THD≤6%; 500~8kHz, THD≤3%.

  6. Leakproofness

    • Device speaker playback, the overall volume attenuation (100~8kHz) between blocked microphone pickup hole and unblocked microphone pickup hole : > 15dB.

KWS (Keyword Spotting)

Introduction

KWS is the module to detect specific wakeup words from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.

In AIVoice, two solutions of KWS are available:

Solution

Training Data

Available Keywords

Feature

Fixed Keyword

specific keywords

keywords same as training data

better performance; smaller model

Customized Keyword

common data

customized keyword of the same language as training data

more flexible

Currently SDK provides a fixed keyword model library of Chinese keyword “xiao-qiang-xiao-qiang” or “ni-hao-xiao-qiang” and a Chinese customized keyword library. Other keywords or performance optimizations can be provided through customized services.

Configurations

KWS configurable parameters:

- keywords:

Keywords for wake up, and available keywords depend on KWS model. If the KWS model is a fixed keyword solution, keywords can only be chosen from the trained words. For customized solution, keywords can be customized with any combinations of same language unit(such as pinyin for Chinese). Example: “xiao-qiang-xiao-qiang”.

- thresholds:

Threshold for wake up, range [0, 1]. The higher, less false alarm, but harder to wake up. Set to 0 to use sensitivity with predefined thresholds.

- sensitivity:

Three levels of sensitivity are provided with predefined thresholds. The higher, easier to wake up but also more false alarm. ONLY works when thresholds set to 0.

Please refer to ${aivoice_lib_dir}/include/aivoice_kws_config.h for details.

KWS Mode

Two KWS modes are provided for different use cases. Multi mode improves KWS and ASR accuracy compared to single mode, while also increases computation consumption and heap.

KWS Mode

Function

Description

single mode

void rtk_aivoice_set_single_kws_mode(void)

less computation consumption and less heap used

multi mode

void rtk_aivoice_set_multi_kws_mode(void)

better kws and asr accuracy

Attention

KWS mode MUST set before create instance in these flows:

  • aivoice_iface_full_flow_v1

  • aivoice_iface_afe_kws_v1

VAD (Voice Activity Detection)

Introduction

VAD is the module to detect the presence of human speech in audio.

In AIVoice, a neural network based VAD is provided and can be used in speech enhancement, ASR system etc.

Configurations

VAD configurable parameters:

- sensitivity:

Three levels of sensitivity are provided with predefined thresholds. The higher, easier to detect speech but also more false alarm.

- left_margin:

Time margin added to the start of speech segment, which makes the start offset earlier than raw prediction. Only affects offset_ms of VAD output, it won’t affect the event trigger time of status 1.

- right_margin:

Time margin added to the end of speech segment, which makes the end offset later than raw prediction. Affects both offset_ms of VAD output and event time of status 0.

Please refer to ${aivoice_lib_dir}/include/aivoice_vad_config.h for details.

Note

left_margin only affects offset_ms returned by VAD, it won’t affect the VAD event trigger time. If you need get the audio during left_margin, please implement a buffer to keep audio.

ASR (Automatic Speech Recognition)

Introduction

ASR is the module to recognize speech to text.

In AIVoice, ASR supports recognition of Chinese speech command words offline.

Currently SDK provides libraries for 40 air-conditioning related command words, including “打开空调” and “关闭空调” etc. Other command words or performance optimizations can be provided through customized services.

Configurations

ASR configurable parameters:

- sensitivity:

Three levels of sensitivity are provided with predefined internal parameters.The higher, easier to detect commands but also more false alarm.

Please refer to ${aivoice_lib_dir}/include/aivoice_asr_config.h for details.

Interfaces

Flow and Module Interfaces

Interface

flow/module

aivoice_iface_full_flow_v1

AFE+KWS+ASR

aivoice_iface_afe_kws_v1

AFE+KWS

aivoice_iface_afe_v1

AFE

aivoice_iface_vad_v1

VAD

aivoice_iface_kws_v1

KWS

aivoice_iface_asr_v1

ASR

All interfaces support below functions:

  • create()

  • destroy()

  • reset()

  • feed()

Please refer to ${aivoice_lib_dir}/include/aivoice_interface.h for details.

Event and Callback Message

aivoice_out_event_type

Event Trigger Time

Callback Message

AIVOICE_EVOUT_VAD

when VAD detects start or end point of a speech segment

struct includes VAD status, offset.

AIVOICE_EVOUT_WAKEUP

when KWS detects keyword

json string includes id, keyword, and score. Example: {“id”:2,”keyword”:”ni-hao-xiao-qiang”,”score”:0.9}

AIVOICE_EVOUT_ASR_RESULT

when ASR detects command word

json string includes fst type, commands and id. Example: {“type”:0,”commands”:[{“rec”:”play music”,”id”:14}]}

AIVOICE_EVOUT_AFE

every frame when AFE got input

struct includes AFE output data, channel number, etc.

AIVOICE_EVOUT_ASR_REC_TIMEOUT

when no command word detected during a given timeout duration

NULL

AFE Event Definition

struct aivoice_evout_afe {
    int     ch_num;                       /* channel number of output audio signal, default: 1 */
    short*  data;                         /* enhanced audio signal samples */
    char*   out_others_json;              /* reserved for other output data, like flags, key: value */
};

VAD Event Definition

struct aivoice_evout_vad {
    int status;                     /*  0: vad is changed from speech to silence,
                                           indicating the end point of a speech segment
                                        1: vad is changed from silence to speech,
                                           indicating the start point of a speech segment */
    unsigned int offset_ms;         /* time offset relative to reset point. */
};

Common Configurations

AIVoice configurable parameters:

- no_cmd_timeout:

ASR exits when no command word detected during this duration. ONLY used in full flow.

- memory_alloc_mode:

Default mode uses SDK default heap. SRAM mode uses SDK default heap while also allocate space from SRAM for memory critical data. SRAM mode is ONLY available on RTL8713EC and RTL8726EA DSP now.

Please refer to ${aivoice_lib_dir}/include/aivoice_sdk_config.h for details.

Examples

AIVoice Full Flow Offline Example

This example shows how to use AIVoice full flow with a pre-recorded 3 channel audio and will run only once after EVB reset. Audio functions such as recording and playback are not integrated.

Steps of Using AIVoice

  1. Select aivoice flow or modules needed. Set KWS mode to multi or single if using full flow or afe_kws flow.

    /* step 1:
     * Select the aivoice flow you want to use.
     * Refer to the end of aivoice_interface.h to see which flows are supported.
     */
    const struct rtk_aivoice_iface *aivoice = &aivoice_iface_full_flow_v1;
    rtk_aivoice_set_multi_kws_mode();
    
  2. Build configuration.

    /* step 2:
     * Modify the default configure if needed.
     * You can modify 0 or more configures of afe/vad/kws/...
     */
    struct aivoice_config config;
    memset(&config, 0, sizeof(config));
    
    /*
     * here we use afe_res_2mic50mm for example.
     * you can change these configuratons according the afe resource you used.
     * refer to aivoce_afe_config.h for details;
     *
     * afe_config.mic_array MUST match the afe resource you linked.
    
     */
    struct afe_config afe_param = AFE_CONFIG_ASR_DEFAULT();
    afe_param.mic_array = AFE_LINEAR_2MIC_50MM;  // change this according to the linked afe resource.
    config.afe = &afe_param;
    
    /*
     * ONLY turn on these settings when you are sure about what you are doing.
     * it is recommend to use the default configure,
     * if you do not know the meaning of these configure parameters.
    
     */
    struct vad_config vad_param = VAD_CONFIG_DEFAULT();
    vad_param.left_margin = 300; // you can change the configure if needed
    config.vad = &vad_param;    // can be NULL
    
    struct kws_config kws_param = KWS_CONFIG_DEFAULT();
    config.kws = &kws_param;    // can be NULL
    
    struct asr_config asr_param = ASR_CONFIG_DEFAULT();
    config.asr = &asr_param;    // can be NULL
    
    struct aivoice_sdk_config aivoice_param = AIVOICE_SDK_CONFIG_DEFAULT();
    aivoice_param.no_cmd_timeout = 10;
    config.common = &aivoice_param; // can be NULL
    
  3. Use create() to create and initialize aivoice instance with given configuration.

    /* step 3:
     * Create the aivoice instance.
     */
    void *handle = aivoice->create(&config);
    if (!handle) {
        return;
    }
    
  4. Register callback function.

    /* step 4:
     * Register a callback function.
     * You may only receive some of the aivoice_out_event_type in this example,
     * depending on the flow you use.
     * */
    
    rtk_aivoice_register_callback(handle, aivoice_callback_process, NULL);
    

    The callback function can be modified according to user cases:

    static int aivoice_callback_process(void *userdata,
                                        enum aivoice_out_event_type event_type,
                                        const void *msg, int len)
    {
    
        (void)userdata;
        struct aivoice_evout_vad *vad_out;
        struct aivoice_evout_afe *afe_out;
    
        switch (event_type) {
        case AIVOICE_EVOUT_VAD:
                vad_out = (struct aivoice_evout_vad *)msg;
                printf("[user] vad. status = %d, offset = %d\n", vad_out->status, vad_out->offset_ms);
                break;
    
        case AIVOICE_EVOUT_WAKEUP:
                printf("[user] wakeup. %.*s\n", len, (char *)msg);
                break;
    
        case AIVOICE_EVOUT_ASR_RESULT:
                printf("[user] asr. %.*s\n", len, (char *)msg);
                break;
    
        case AIVOICE_EVOUT_ASR_REC_TIMEOUT:
                printf("[user] asr timeout\n");
                break;
    
        case AIVOICE_EVOUT_AFE:
                afe_out = (struct aivoice_evout_afe *)msg;
    
                // afe will output audio each frame.
                // in this example, we only print it once to make log clear
                static int afe_out_printed = false;
                if (!afe_out_printed) {
                        afe_out_printed = true;
                        printf("[user] afe output %d channels raw audio, others: %s\n",
                                   afe_out->ch_num, afe_out->out_others_json ? afe_out->out_others_json : "null");
                }
    
                // process afe output raw audio as needed
                break;
    
        default:
                break;
        }
    
        return 0;
    }
    
  5. Use feed() to input audio data to aivoice.

    /* when run on chips, we get online audio stream,
    
     * here we use a fix audio.
     * */
    
    const char *audio = (const char *)get_test_wav();
    int len = get_test_wav_len();
    int audio_offset = 44;
    int mics_num = 2;
    int afe_frame_bytes = (mics_num + afe_param.ref_num) * afe_param.frame_size * sizeof(short);
    while (audio_offset <= len - afe_frame_bytes) {
            /* step 5:
             * Feed the audio to the aivoice instance.
             * */
    
            aivoice->feed(handle,
                          (char *)audio + audio_offset,
                          afe_frame_bytes);
    
            audio_offset += afe_frame_bytes;
    }
    
  6. (Optional) If need reset status, use reset().

  7. If aivoice no longer needed, use destroy() to destroy the instance.

    /* step 6:
    * Destroy the aivoice instance */
    
    aivoice->destroy(handle);
    

Please refer to ${aivoice_example_dir}/full_flow_offline for more details.

Build Example

  1. Build Tensorflow Lite Micro Library for DSP, refer to Build Tensorflow Lite Micro Library.

    Or use the prebuilt Tensorflow Lite Micro Library in {DSPSDK}/lib/aivoice/prebuilts.

  2. Import {DSPSDK}/example/aivoice/full_flow_offline source in Xtensa Xplorer.

  3. Set software configurations and modify libraries such as AFE resource, KWS resource if needed.

    • add include path (-I)

      ${workspace_loc}/../lib/aivoice/include

    • add library search path (-L)

      ${workspace_loc}/../lib/aivoice/prebuilts/$(TARGET_CONFIG) ${workspace_loc}/../lib/xa_nnlib/v1.8.1/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/lib_hifi5/project/hifi5_library/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/tflite_micro/project/bin/$(TARGET_CONFIG)/Release

    • add libraries (-l)

      -laivoice -lafe_kernel -lafe_res_2mic50mm -lkernel -lvad -lkws -lasr -lfst -lcJSON -ltomlc99 -ltflite_micro -lxa_nnlib -lhifi5_dsp

  4. Build image, please follow steps in dsp_build.

Glossary

AEC

Acoustic Echo Cancellation, or echo cancellation, refers to removing the echo signal from the input signal. The echo signal is generated by a sound played through the speaker of the device then captured by the microphone.

AFE

Audio Front End, refers to a combination of modules for preprocessing raw audio signals. It’s usually performed to improve the quality of speech signal before the voice interaction, including several speech enhancement algorithms.

AGC

Automatic Gain Control, an algorithm that dynamically controls the gain of a signal and automatically adjust the amplitude to maintain an optimal signal strength.

ASR

Automatic Speech Recognition, or Speech-to-Text, refers to recognition of spoken language from audio into text. It can be used to build voice-user interface to enable spoken human interaction with AI devices.

BF

BeamForming, refers to a spatial filter designed for a microphone array to enhance the signal from a specific direction and attenuate signals from other directions.

KWS

Keyword Spotting, or wakeup word detection, refers to identifying specific keywords from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.

NN

Neural Network, is a machine learning model used for various task in artificial intelligence. Neural networks rely on training data to learn and improve their accuracy.

NS

Noise Suppression, or noise reduction, refers to suppressing ambient noises in the signal to enhance the speech signal, especially stationary noises.

RES

Residual Echo Suppression, refers to suppressing the remained echo signal after AEC processing. It is a postfilter for AEC.

SSL

Sound Source Localization, or direction of arrival(DOA), refers to estimating the spatial location of a sound source using a microphone array.

TTS

Text-To-Speech, or speech synthesis, is a technology that converts text into spoken audio. It can be used in any speech-enabled application that requires converting text to speech imitating human voice.

VAD

Voice Activity Detection, or speech activity detection, is a binary classifier to detect the presence or absence of human speech. It is widely used in speech enhancement, ASR system etc, and can also be used to deactivate some processes during non-speech section of an audio session, saving on computation or bandwidth.