SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound



Haohe Liu 📮,1, Xuenan Xu2, Yi Yuan1, Mengyue Wu2, Wenwu Wang1, Mark D. Plumbley1

1CVSSP, University of Surrey, Guildford, UK

2Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

📮Corresponding author


Abstract

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling language modelling techniques to be applied to audio data. However, traditional codecs often operate at high bitrates or within narrow domains like speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, sound effects, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder leveraging the self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining acoustic details. The output of the semantic and acoustic encoder is reconstructed into audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of $25$, $50$, and $100$ per second, offering a balance between compression and quality. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information compared to all evaluated audio codecs, even at significantly lower bitrates.

Highlights

  • Ultra-Low bit rate We focus on bitrate between 0.31 kbps and 1.43 kbps, with token rate of 25, 50, or 100 per second.
  • Strong semantic in the audio token Indicated by classification accuracy.
  • Supporting variable vocabulary sizes One model that supporting four different vocabulary sizes.


  • Figure 1: The overview of the SemantiCodec architecture. For an input audio clip, quantized semantic representation $E_{s}$ is obtained via a codebook pre-computed using k-means clustering on the AudioMAE embeddings. Then $\mathbf{Y}$ and $\mathbf{E_s}$ are concatenated and fed to a residual encoder to complement acoustic details, which is discretized to $E_{a}$ by a vector quantization module. SemantiCodec embedding $E$ is obtained by concatenating $E_{s}$ and $E_{a}$. A latent diffusion model is trained to generate the original audio clip conditioned on $E$.

    Figure 2: Comparison with state-of-the-art neural audio codec



    Waveform Reconstruction

    ✅ We provide the original and the reconstructed audio samples of Encodec (3.0, 1.5 kbps), HiFi-Codec (2.0 kbps), Descript codec (1.41, 0.78, 0.47 kbps, reproduced using open-sourced code), and proposed SemantiCodec (1.43, 0.71, 0.35 kbps).

    ✅ We show the evaluation metrics score using ViSQOL, Word Error Rate (WER), and classification accuracy.


    Samples from MUSDB18 (Music)

    ↔️ Scroll horizontally to view the full table.

    ID Original Encodec HiFi-Codec Encodec DAC SemantiCodec DAC SemantiCodec DAC SemantiCodec
    Bit rate / 3.0 kbps 2.0 kbps 1.5 kbps 1.41 kbps 1.43 kbps 0.78 kbps 0.71 kbps 0.47 kbps 0.35 kbps
    Token rate / 300/sec 200/sec 150/sec 141/sec 100/sec 78/sec 50/sec 47/sec 25/sec
    ViSQOL-Avg ↑ / 3.82 3.57 3.33 3.13 3.81 2.82 3.55 2.39 3.17
    WER (%)↓ / 3.7 3.6 5.0 5.0 3.4 11.6 5.1 28.2 19.6
    Accuracy (%)↑ / 37.0 40.3 35.5 43.5 52.5 43.0 50.3 41.3 46.1
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11

    Samples from AudioSet (General Audio)

    ↔️ Scroll horizontally to view the full table.

    ID Original Encodec HiFi-Codec Encodec DAC SemantiCodec DAC SemantiCodec DAC SemantiCodec
    Bit rate / 3.0 kbps 2.0 kbps 1.5 kbps 1.41 kbps 1.43 kbps 0.78 kbps 0.71 kbps 0.47 kbps 0.35 kbps
    Token rate / 300/sec 200/sec 150/sec 141/sec 100/sec 78/sec 50/sec 47/sec 25/sec
    ViSQOL-Avg ↑ / 3.82 3.57 3.33 3.13 3.81 2.82 3.55 2.39 3.17
    WER (%)↓ / 3.7 3.6 5.0 5.0 3.4 11.6 5.1 28.2 19.6
    Accuracy (%)↑ / 37.0 40.3 35.5 43.5 52.5 43.0 50.3 41.3 46.1
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13


    Samples from LibriTTS (Speech)

    ↔️ Scroll horizontally to view the full table.

    ID Original Encodec HiFi-Codec Encodec DAC SemantiCodec DAC SemantiCodec DAC SemantiCodec
    Bit rate / 3.0 kbps 2.0 kbps 1.5 kbps 1.41 kbps 1.43 kbps 0.78 kbps 0.71 kbps 0.47 kbps 0.35 kbps
    Token rate / 300/sec 200/sec 150/sec 141/sec 100/sec 78/sec 50/sec 47/sec 25/sec
    ViSQOL-Avg ↑ / 3.82 3.57 3.33 3.13 3.81 2.82 3.55 2.39 3.17
    WER (%)↓ / 3.7 3.6 5.0 5.0 3.4 11.6 5.1 28.2 19.6
    Accuracy (%)↑ / 37.0 40.3 35.5 43.5 52.5 43.0 50.3 41.3 46.1
    1
    2
    3
    4
    5
    6
    7
    8

    For more audio samples, please refer to the github page repo.