Built in India, the app turns garbled speech into clear speech in near real-time

Whisper. A few unintelligible words. For those who suffer from dysarthria, a motor speech disorder, basic communication is a challenge that indelibly affects their professional and personal lives. But now a new innovation based on artificial intelligence (AI) and developed in India could be life-changing.

A team led by Associate Professor Vineet Gandhi from the International Institute of Information Technology (IIIT) in Hyderabad has developed a simple app that can help people speak as audio translation translates the speaker’s voice in near real-time. The app can either convert slurred speech into clear, natural-sounding speech, or use a camera to analyze lip movements and subtle throat vibrations to generate intelligible speech.

While the current project runs in English, the team’s next goal is to bring these technologies to regional languages, including Hindi, Telugu and Tamil, as many across the country lack the means to benefit from accessibility-focused AI models. For this work, Mr. Gandhi received the Anusandhan National Research Foundation (ANRF) Award in 2026.

Excerpts from the interview:

What inspired you to start working on this AI humanitarian project?

My research has always been guided by a simple question: what real problem can technology help solve?

While my academic background is primarily in computer vision, about four years ago I began to see the exciting possibilities emerging in speech research and decided to explore the field further. I became increasingly aware of the challenges faced by many individuals who lose the ability to speak due to health problems: the impact of this loss goes far beyond communication – it affects independence, identity and connection.

Recognizing this need inspired me to focus my work on accessibility-based technologies designed to restore or enable speech to help people regain their voice.

Could you describe how the app works for people with speech impairments?

The application is designed to convert garbled or distorted speech into clear, natural-sounding speech with only a few hundred milliseconds of delay. The user simply speaks in his own voice and the system processes it to create intelligible speech for the listener.

We are also developing an additional lip-to-speech capability, where a person can move their lips silently and the system generates the corresponding speech.

A key aspect we focus on is personalization, where users can calibrate and refine the app according to their voice by reading a few minutes of text in the app.

Our goal is for these technologies to be integrated into common communication platforms, such as web-based calling applications, making everyday communication easier for people with speech impairments.

You also aim to extend this technology to regional Indian languages. How do you want to achieve this?

Currently, much of the global speech technology ecosystem is designed primarily for English, and our initial experiments naturally followed the same trajectory. However, the main focus of our research is to extend these capabilities to regional Indian languages where available speech technologies are equally important.

To achieve this, we plan to collect speech data in Indian languages and develop data-efficient models suitable for low-resource scenarios. Our approach involves data augmentation and efficient fine-tuning of pre-trained models.

We have already conducted preliminary experiments in Hindi with promising results, and with the support of the Anusandhan National Research Foundation, we are looking to further improve and extend this work to other Indian languages.

You believe that “accessibility and linguistic diversity” are crucial for AI research in India. Could you elaborate on that?

Accessibility and linguistic diversity are essential factors for AI research in India. After spending several years in Europe, I noticed that accessibility is much more systematically integrated into public infrastructure and digital services there.

In contrast, India still has significant gaps, even in public spaces such as railway stations, where basic accessibility provisions are often limited. This highlights the wider need to design technologies that consciously include people with disabilities.

At the same time, India’s linguistic diversity represents another important dimension. In many parts of the country, especially in rural areas, speech remains the most natural and primary mode of interaction. Text-heavy or typing-based interfaces may not always be practical or inclusive in such contexts. Therefore, AI systems designed for India must favor speech-based interaction and support multiple regional languages.

Together, meaningful accessibility and strong support for linguistic diversity are essential if digital technologies are to be truly inclusive and widely applicable across the country.

WHO has declared that “the future of healthcare is digital”…

The World Health Organization has emphasized that the future of healthcare will be increasingly digital. In a country like India, telemedicine can play a transformative role, especially when supported by a basic diagnostic infrastructure at the local level that enables more accurate remote consultations.

Another important direction is artificial intelligence-assisted diagnostics, where machine learning systems analyze medical images, speech or health records to support early disease detection and prediction.

Practical solutions are already emerging. For example, ‘Shishu Maapan’ developed by Wadhwani AI helps measure the weight and height of a newborn baby simply from mobile photos and is adopted by frontline health workers like ASHA workers.

Digital tools also enable assistive health technologies, including speech restoration systems for individuals who have lost the ability to speak, and wearable devices that continuously monitor health parameters and alert physicians to potential anomalies. These developments show how digital innovation can make healthcare more affordable and scalable.

A common criticism of AI-generated speech is that, while it is intelligible, it often fails to capture the unique cadence of the speaker. When restoring voice to someone with dysarthria, how do you balance the need for clear communication with the need to preserve the individual human nature of the user?

This is an important concern. If recordings of the speaker’s original voice from before the onset of dysarthria are available, modern voice cloning techniques can recreate that voice in as little as 10 seconds of speech. Thus, preserving an individual’s vocal identity is technically feasible today, and there is extensive research demonstrating this ability. However, our current application focuses primarily on restoring the intelligibility of content and ensuring that what the user intends to say is communicated clearly. For now, the generated speech uses a common voice, not a personalized one.

This means that text-to-speech systems are becoming more and more natural to the point where they are now being integrated into conversational bots that are replacing many traditional customer service applications. Emotional nuances remain more challenging, as we discussed in our earlier work on generating empathetic speech, but progress is rapid.

How does the model distinguish between distorted speech and background noise as the user moves down, say, a busy Indian street?

This is a really significant challenge in India where the real world environment can be extremely chaotic. Anyone who has thought about deploying self-driving cars here will quickly realize how unpredictable our roads can be: traffic patterns, honking, pedestrians and vehicles all interact very dynamically. Speech technology faces a similar level of complexity.

In our experiments, we improve robustness using noise augmentation, where we simulate different noisy environments during training so that the model learns to handle background sounds. Ultimately, the most efficient solution is to collect and train more real data from noisy settings. Even then, some performance degradation is inevitable, as separating distorted speech from strong background noise is a fundamentally difficult problem.

divya.gandhi@thehindu.co.in

Built in India, the app turns garbled speech into clear speech in near real-time

Two teachers, one officer suspended to delay from the survey duty in Davangere

‘Cheated by the wife’: Chris Martin Coldplay ‘accidentally exposes the CEO of Andy Byron cuddling with the chief of HR; Netizens React | Tech Word News

Indian cartoon in honor

Anti-Teroric Simulated Exercise Throughout Delhi 17 July 18 July | Tech Word News

US State Department official visits India as trade talks take off | Today’s news

Delhi residents, get ready! IMD warns against rain and cold spells this week | Tech Word News

Similar Posts