What does Komal Vardhan Lolugu specialize in?

Komal Vardhan Lolugu specializes in agentic AI systems, voice AI using Azure OpenAI Realtime API, RAG pipelines, LLM observability, and production-grade full-stack AI applications built with LangGraph, Mastra, Next.js, and Python.

How can I hire Komal Srinivasan for AI consulting?

Komal Srinivasan is available for agentic AI consulting, speaking engagements, hackathon judging, and 1:1 mentorship. Book a session at topmate.io/komal_vardhan_lolugu or reach out via komalvardhan.com/contact.

What agentic AI frameworks does Komal use?

Komal primarily builds with LangGraph and Mastra for multi-agent orchestration. He uses Azure OpenAI and OpenAI GPT-4 models for inference, Langfuse and Arize Phoenix for LLM observability, and Qdrant or pgvector for vector search in RAG pipelines.

Where is Komal Vardhan Lolugu based?

Komal Vardhan Lolugu is based in Hyderabad, Telangana, India, and works with clients globally on AI consulting, mentorship, and speaking engagements.

What open-source projects has Komal Srinivasan published?

Komal has published azure-realtime-webrtc (npm) and az-realtime-webrtc (PyPI) for the Azure OpenAI Realtime API, AI Dev Lens (a local-first AI usage analytics tool), and AI Universe (110+ curated open-source AI tools). All are on GitHub at github.com/KomalSrinivasan.

What is Komal Srinivasan's work experience?

Komal Vardhan Lolugu has 6+ years of experience. He spent 3 years 7 months at Hexaware Technologies (March 2022 – October 2025) building enterprise AI systems, LLM pipelines, and agentic applications. He is now Lead Product Engineer at Oraczen, architecting production-grade voice agents, LangGraph workflows, and real-time WebRTC pipelines.

What is Komal Srinivasan's most notable AI project?

Memory Agent is among his most notable projects - a dual-agent system that turns expert institutional knowledge into a searchable living knowledge graph, achieving 70% knowledge domain coverage at launch with zero documents written manually. Visit Agent, which cut enterprise site-visit report processing from 1-2 hours to 20 minutes using voice AI, is another production highlight.

Does Komal Srinivasan do mentorship or speaking?

Yes. Komal mentors engineers on Topmate, focusing on agentic AI systems, LLM engineering, and full-stack AI product development. He is available for conference speaking, hackathon judging, and corporate workshops on Agentic AI and Generative AI.

What is the difference between AEO and traditional SEO for an AI Engineer portfolio?

Traditional SEO optimizes for keyword rankings in Google's blue links. AEO (Answer Engine Optimization) ensures that when someone asks ChatGPT, Perplexity, or Google AI Overviews 'who is a good AI Engineer in Hyderabad?' or 'who builds LangGraph agents?', your name appears as the trusted answer - pulled from structured data, FAQs, and authoritative content like komalvardhan.com.

How does Komal Vardhan Lolugu build RAG pipelines?

Komal builds RAG (Retrieval-Augmented Generation) pipelines using pgvector or Qdrant for vector storage, Azure OpenAI or OpenAI embedding models, and LangChain or LangGraph for orchestration. He instruments pipelines with Langfuse for tracing, cost tracking, and latency monitoring in production.

Building Voice Agents with Azure OpenAI Realtime API

Voice agents are not chatbots with a mic taped on. They require a completely different mental model - one where latency is the product, where state machines replace request-response cycles, and where the wrong architectural decision at the ephemeral token layer will haunt you in production at 2 AM.

This is everything I learned building two voice agent systems - a Memory Agent and a Visit Agent - and shipping azure-realtime-webrtc, the TypeScript SDK that emerged from that work.

Why voice agents are architecturally different

The standard LLM integration looks like this: user sends text → you call an API → you wait → you return a response. Latency of a few seconds is annoying but tolerable. Users expect it.

Voice is the opposite. Humans start evaluating a conversation within 200–400ms of speaking. If your agent does not respond within that window, users assume the call dropped. Every architectural decision you make flows from this constraint.

The second difference is state. A text chat can be stateless - each message carries full context. Voice is a continuous audio stream. Your agent must track:

Whether the user is currently speaking
Whether it should be listening or processing
Whether audio output is playing (and must be interrupted)
Whether a tool call is in flight

Get any of these wrong and you get the most embarrassing failure mode in voice AI: the agent speaking over itself, or the user speaking while the agent is talking, with no one listening to either.

The Azure OpenAI Realtime architecture

Azure OpenAI's Realtime API runs on WebRTC (browser) or WebSocket (server), with a data channel carrying JSON events alongside the audio stream. The key insight that took me a while to internalize: the model is not stateless. It maintains conversation state server-side across the entire session. You're not sending messages - you're sending events that mutate shared state.

The session flow:

code

Browser                    Your Server               Azure OpenAI
  │                             │                         │
  │  POST /api/realtime/token   │                         │
  │────────────────────────────>│  POST /client_secrets   │
  │                             │────────────────────────>│
  │                             │  { value: token }       │
  │  { token }                  │<────────────────────────│
  │<────────────────────────────│                         │
  │                                                       │
  │  WebRTC SDP offer + ephemeral token                   │
  │──────────────────────────────────────────────────────>│
  │                              SDP answer               │
  │<──────────────────────────────────────────────────────│
  │                                                       │
  │  ◄══ Bidirectional audio (WebRTC media track) ═══►   │
  │  ◄══ JSON events (WebRTC data channel) ════════►     │

Your API key never reaches the browser. The ephemeral token is short-lived (minutes), scoped to a single session, and carries the session configuration - instructions, voice, tools, turn detection. This is a security model you have to build deliberately.

The ephemeral token is not just auth

This was the biggest conceptual shift. Most developers treat the ephemeral token as equivalent to an API key - a credential you swap for access. It is not. The token is the session configuration. Whatever you encode in the POST to /client_secrets becomes the session:

const response = await fetch(
  `https://${resource}.openai.azure.com/openai/v1/realtime/client_secrets`,
  {
    method: 'POST',
    headers: { 'api-key': apiKey, 'Content-Type': 'application/json' },
    body: JSON.stringify({
      session: {
        type: 'realtime',
        model: deployment,
        instructions: systemPrompt,
        tools: toolDefinitions,
        tool_choice: 'auto',
        audio: {
          input: {
            transcription: { model: 'whisper-1', language: 'en' },
          },
          output: { voice: 'cedar' },
        },
      },
    }),
  }
);
const { value: token, expires_at } = await response.json();

The implications of this:

System prompt injection happens server-side, before the token is minted. You can inject the authenticated user's name, role, or context into the instructions at token generation time. The browser never sees the raw prompt.
Tool definitions travel with the token. You choose which tools are available for this session at this moment - not globally. In the Memory Agent, the "capture" persona gets save_insight + get_session_context. The "assistant" persona gets retrieve_knowledge + get_session_context. Same model deployment, different tool sets, determined server-side.
tool_choice: "required" is a trap. I tried it. The model would loop calling tools after receiving results and never actually speak. Stay on "auto" and enforce retrieval through the tool description and system prompt instead.

Turn detection and the speaking state machine

Voice turn detection is where most agents break in the real world. Azure offers two modes:

server_vad: volume-threshold based voice activity detection. Fast but will fire on background noise, TV audio, keyboard clicks.
semantic_vad: model-aware turn detection that understands sentence boundaries. Slower but dramatically fewer false positives.

In the Memory Agent, which ran in noisy open-office environments, semantic VAD was the correct choice despite the latency cost. In the Visit Agent, which ran structured interviews in quieter settings, server VAD worked fine.

The state machine that matters is not in Azure - it is in your client code:

type VoiceAgentState =
  | 'idle'
  | 'connecting'
  | 'connected'
  | 'listening'      // user can speak, mic is active
  | 'thinking'       // model received audio, processing
  | 'speaking'       // model is playing audio output
  | 'error';

Transitions you must handle explicitly:

listening → thinking: triggered by input_audio_buffer.speech_stopped
thinking → speaking: triggered by output_audio_buffer.started
speaking → listening: triggered by output_audio_buffer.stopped
Any state → listening: on user interrupt (input_audio_buffer.speech_started while model is speaking)

The interrupt case is the one that kills demos. When a user starts speaking while the model is talking, you need to cancel the current audio output immediately and start processing the new input. Azure handles this on its end - but your UI must reflect it instantly or users think the agent is broken.

Function calling over the data channel

Tool calls in the Realtime API are asynchronous events on the data channel, not synchronous returns. The sequence:

Model emits response.function_call_arguments.done with call_id, name, arguments
You execute the tool handler (can be async, can hit a database)
You send back a conversation.item.create event with the tool result
You send response.create to trigger the model's next response

In the Memory Agent, tool execution involved pgvector similarity search against hundreds of knowledge chunks. The round-trip for retrieve_knowledge averaged 280ms on warm connections. This is fine - the model holds state and waits. But you need to handle errors explicitly: if your tool handler throws, you must still send a result event back, or the model hangs waiting for a response that never comes.

client.on('response.function_call_arguments.done', async (event) => {
  let result: string;
  try {
    result = await executeToolHandler(event.name, event.arguments);
  } catch (err) {
    result = JSON.stringify({ error: 'Tool execution failed', details: String(err) });
  }

  // Always send the result back, even on error
  client.send({
    type: 'conversation.item.create',
    item: {
      type: 'function_call_output',
      call_id: event.call_id,
      output: result,
    },
  });
  client.send({ type: 'response.create' });
});

The transcript timing problem

This tripped me up on the Visit Agent frontend. There are two separate transcript streams:

response.audio_transcript.delta - transcript of what the model is about to say
conversation.item.input_audio_transcription.completed - transcript of what the user said

The model's transcript arrives before the audio plays. If you render it immediately, users see the agent's words appearing on screen before they hear them - which is uncanny and undermines trust in the voice interaction.

The fix is to buffer the transcript and "drip" it in sync with audio playback:

let wordBuffer: string[] = [];
let dripTimer: ReturnType<typeof setInterval> | null = null;

client.on('output_audio_buffer.started', () => {
  // ~3.5 words/sec = natural speech cadence at cedar voice speed
  dripTimer = setInterval(() => {
    const word = wordBuffer.shift();
    if (word) appendToDisplay(word);
  }, 285);
});

client.on('output_audio_buffer.stopped', () => {
  // Flush remaining buffer instantly when audio ends
  if (dripTimer) clearInterval(dripTimer);
  appendToDisplay(wordBuffer.join(''));
  wordBuffer = [];
});

client.on('response.audio_transcript.delta', ({ delta }) => {
  wordBuffer.push(delta); // buffer, do not display yet
});

Session lifecycle and post-call processing

Both the Memory Agent and Visit Agent needed substantial post-call processing: session summaries, knowledge graph extraction, actionable item generation, vector indexing. The pattern that worked was background job queues triggered on session end, with atomic database flags to prevent duplicate processing:

// Atomically claim the summarization job
const log = await db.voiceSession.findOneAndUpdate(
  {
    sessionId,
    summaryGenerating: { $ne: true },
    summaryGenerated: { $ne: true },
  },
  { $set: { summaryGenerating: true } },
  { returnDocument: 'after' }
);

if (!log) {
  // Another worker already claimed it - skip
  return;
}

This pattern prevents the most common production bug with voice agents: a user clicking "End Call" twice (or a network retry) causing two summarization jobs to race, double-saving insights, or generating conflicting knowledge graph nodes.

In the Visit Agent, the post-call pipeline was:

Generate markdown summary from full transcript (gpt-4o-mini, temperature 0.3)
Extract actionable items as structured JSON (gpt-4o-mini with response_format)
Index the voice log to a vector store for semantic search across sessions
Create a thread with the transcript for human review

Each step was idempotent and had its own _generating / _generated flag pair. The pipeline could fail at any step and be safely re-triggered without corrupting state.

What the azure-realtime-webrtc package abstracts

After building the Memory Agent and Visit Agent from scratch - raw WebRTC SDP negotiation, data channel event parsing, audio stream management, the full state machine - I extracted everything reusable into azure-realtime-webrtc.

The package has four entry points:

Entry point	What it gives you
`azure-realtime-webrtc`	Low-level `RealtimeClient` with typed events for all 32+ server events
`azure-realtime-webrtc/sdk`	High-level `VoiceAssistant`, `TextChat`, `ToolAgent` classes
`azure-realtime-webrtc/streaming`	Async iterators, `ReadableStream`, SSE helpers
`azure-realtime-webrtc/server`	Express middleware for token server, SDP proxy, rate limiting

The 5-line version:

import { VoiceAssistant } from 'azure-realtime-webrtc/sdk';

const assistant = new VoiceAssistant({
  resource: 'my-azure-resource',
  deployment: 'gpt-4o-realtime-preview',
  tokenProvider: () =>
    fetch('/api/realtime/token', { method: 'POST' })
      .then(r => r.json())
      .then(d => d.token),
  instructions: 'You are a helpful assistant.',
  voice: 'alloy',
});

assistant.on('transcript', (entries) => renderConversation(entries));
assistant.on('stateChange', (state) => updateUI(state));
await assistant.start();

The React hook pattern I use in every project now:

export function useVoiceAssistant() {
  const ref = useRef<VoiceAssistant | null>(null);
  const [state, setState] = useState<VoiceAssistantState>('idle');
  const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);

  const start = useCallback(async () => {
    const a = new VoiceAssistant({ /* config */ });
    a.on('stateChange', setState);
    a.on('transcript', setTranscript);
    ref.current = a;
    await a.start();
  }, []);

  const stop = useCallback(() => {
    ref.current?.stop();
    ref.current = null;
  }, []);

  useEffect(() => () => { ref.current?.stop(); }, []);

  return { state, transcript, start, stop };
}

Things I wish I knew on day one

1. East US 2 and Sweden Central only. Realtime deployments only work in these two Azure regions. If you deploy your GPT-4o-realtime model in any other region and wonder why SDP negotiation silently fails - now you know.

2. Voice config nesting is not flat. The token POST body has audio.output.voice, audio.input.transcription.model, audio.input.turn_detection. Not voice, transcriptionModel, turnDetection at the top level. Getting this wrong gives you a 500 with no helpful error message.

3. Both transcript event names. Azure emits both response.audio_transcript.delta and response.output_audio_transcript.delta depending on context. Listen to both or you will miss chunks.

4. WebSocket mode for server-side. RTCPeerConnection is not available in Node.js. Use mode: "websocket" for server-side agent execution.

5. The data channel has a 10-second open timeout. If the channel does not open within that window, the session is broken. This usually means the deployment is in the wrong region or the SDP negotiation failed silently. Add explicit timeout handling.

6. Knowledge retrieval must be mandatory, not optional. In the Memory Agent, I put "MANDATORY - call this tool SILENTLY AND IMMEDIATELY when the user asks any question" in the tool description itself. tool_choice: "required" breaks the agent; strongly-worded tool descriptions actually work.

What is next

The next hard problem in voice agents is multi-turn memory across sessions. The Memory Agent handles this through vector search - every saved insight is embedded and retrievable in future sessions. The Visit Agent indexes voice logs so subsequent visits can surface relevant history. Neither approach is ideal because embedding quality degrades on short, spoken utterances versus written prose.

The better answer is probably a hybrid: dense retrieval for factual recall, sparse BM25 for exact term matching, and a lightweight graph for relationship traversal. More on that in a future post.

If you are building on Azure OpenAI Realtime, the package is at npmjs.com/package/azure-realtime-webrtc. Issues and contributions welcome.