#llm/audio

Public notes from activescott tagged with #llm/audio

Thursday, June 4, 2026

Gemma 4 12B : Run Locally, Fine-Tune, Benchmark Performance

www.labellerr.com/blog/gemma-4-12b-run-locally-and-fine-tune/

Because vision, audio, and text inputs share the exact same weights, you no longer need to co-tune separate frozen encoders. Downstream adapter fine-tuning (such as LoRA) or full fine-tuning naturally updates the entire multimodal token loop in a single pass.

#5:14 PM

open-source fine-tuning goog llm/training llm/audio

Google's new open source Gemma 4 12B analyzes audio, video — and runs entirely locally on a typical 16GB enterprise laptop | VentureBeat

venturebeat.com/technology/googles-new-open-source-gemma-4-12b-analyzes-audio-video-and-runs-entirely-locally-on-a-typical-16gb-enterprise-laptop

While many AI open source model providers are pursuing larger and more powerful models, Google is still giving attention to the smaller, more local side of the market. Today, the tech giant released Gemma 4 12B, an 11.95-billion-parameter open-weights model with permissive Apache 2.0 license optimized to execute locally on a standard enterprise laptop using just 16GB of VRAM or unified memory.

Traditional multimodal systems typically utilize discrete, separate encoders to translate audio waveforms and visual data into representations that the core language model can process.

This conventional approach inherently increases both inference latency and total memory consumption.

Gemma 4 12B radically alters this pipeline by functioning entirely without these secondary encoders. Instead, visual patches and raw audio waveforms are projected directly into the core large language model's embedding space through lightweight linear layers.

The vision encoder is replaced by a 35-million-parameter module utilizing a single matrix multiplication, while the audio encoder is eliminated entirely.

For enterprise engineering teams, this unified architecture delivers distinct operational advantages: lower latency for multimodal tasks, reduced VRAM requirements (down to 16GB — typical for laptops), and the ability to fine-tune the entire multimodal system in a single, cohesive pass.

Google has ensured that Gemma 4 12B is not an isolated experiment; it is ready for production. Weights are available on Hugging Face and Kaggle, and the model integrates seamlessly with industry-standard deployment frameworks such as vLLM, SGLang, MLX, and llama.cpp.

#5:04 PM

open-source fine-tuning goog llm/training llm/audio

Monday, May 11, 2026

Real Time Voice AI, Built to Scale

smallest.ai/

#11:17 PM

llm/audio llm

Wednesday, February 25, 2026

Free AI Voice Generator & Voice Agents Platform | ElevenLabs

elevenlabs.io/

Powering the best enterprises, creators, and developers. From ElevenAgents for customer experience, ElevenCreative for content creation, to the leading AI voice generator.

#2:21 AM

llm/audio text-to-speech

Wednesday, February 4, 2026

Voice emerges as AI’s next frontier as Deepgram raises $130M | The Deep View

www.thedeepview.com/articles/voice-emerges-as-ai-s-next-frontier-as-deepgram-raises-usd130m

One of Deepgram’s goals for the upcoming year is to pass the Audio Turing Test, which assesses how realistic and human-like AI-generated audio sounds.

#3:34 PM

speech-to-text llm/audio

Mistral surcharges voice AI with new models | The Deep View

www.thedeepview.com/articles/mistral-surcharges-voice-ai-with-new-models

Voxtral Realtime - A 4 billion parameter model aimed at live transcription, achieving “state of the art” transcription with 480ms latency across 13 languages. It can be configurable down to sub-200ms latency.

Performance on the FLEURS benchmark shows that Voxtral Mini Transcribe V2 performs competitively against models from Gemini and OpenAI, with the lowest diarization error rate.

#3:33 PM

speech-to-text llm/audio mistral

Thursday, January 22, 2026

Simon Willison on text-to-speech

simonwillison.net/tags/text-to-speech/

#8:47 PM

llm/audio

Wednesday, January 7, 2026

Build voice, video, and physical AI | LiveKit

livekit.io/

#6:01 PM

llm/audio llm

Vapi - Build Advanced Voice AI Agents

vapi.ai/

#6:00 PM

llm/audio llm

AI police report turns Heber City officer into a frog

www.police1.com/artificial-intelligence/utah-pd-testing-ai-report-writing-software-shares-comical-error-caused-by-the-princess-and-the-frog

“The body cam software and the AI report writing software picked up on the movie that was playing in the background, which happened to be ‘The Princess and the Frog,’” a Heber City sergeant told FOX 13 News. “That’s when we learned the importance of correcting these AI-generated reports.”

#2:49 AM

ai llm/audio emergency-services