Beyond the Hype: Real-World Applications of Generative AI in Healthcare
By Ganesh Raut, Lead Machine Learning Engineer, Mount Sinai Health System
From Predictive to Generative: A Strategic Engineering Shift
For much of the last decade, AI in healthcare has revolved around building predictive models tools that helped anticipate hospital readmissions, flag abnormal lab results, or support early diagnosis and more. As a Lead Machine Learning Engineer, I was part of this wave. These models were essential, but inherently reactive. They worked within the boundaries of existing data to detect what was already happening.
Today, we stand at a new frontier Generative AI. Unlike predictive models, which identify patterns in historical data, generative models can be utilized for Classification/Regression tasks, NLP tasks, clinical summaries, synthetic patient records, structured data from unstructured notes, and even simulated scenarios for testing workflows. But the true challenge isn’t the capability it’s operationalizing it in complex, regulated healthcare environments with adherence to compliance.
At the heart of this transformation lies a core engineering mission: to move from prototypes and research papers to scalable, production-grade solutions that clinicians can trust and adopt.
AI in healthcare is no longer mere hype. With advanced LLM models and the rise of Agentic AI, we must responsibly build trust in applications, designing systems clinicians can truly rely on and unlocking new possibilities for delivering better and safer care for everyone.
Designing AI Systems with purpose: Picking Models for Real-World Constraints
Choosing the right Generative AI model requires more than simply comparing accuracy scores or leaderboard rankings. With new architectures emerging frequently, organizations are overwhelmed with options some offering marginal performance gains, others prioritizing smaller footprints and faster inference times and some with billions of parameters. However, real-world deployment demands that technical sophistication be balanced against operational feasibility.
Instead of chasing the latest breakthrough, model selection should be guided by practical constraints:
- Compatibility with domain-specific context
- Scalability for high-demand environments,
- Response time under load,
- Compute and memory efficiency,
- Ease of ongoing fine-tuning, and continuous iteration
- Seamless integration with monitoring and infrastructure tools—all within a defined budget.
Open-source large language models (LLMs) like Llama, Mistral, and Phi present promising alternatives to commercial APIs. They offer greater transparency, cost efficiency at scale, guardrails and complete control over customization. However, deploying these models requires thoughtful engineering. Key considerations include:
- Ensuring tokenizer and vocabulary alignment with specialized language (e.g., Medical Terminology),
- Applying model compression techniques (e.g., quantization, distillation) to reduce memory and compute needs, or using advanced techniques like RAG, GraphRAG.
- Managing model versions and checkpoints across development and production environments.
- Implementing strong guardrails to monitor, validate, and constrain model outputs for safety and compliance.
To support real-time AI services in production, it’s crucial to build tailored inference pipelines. These pipelines should emphasize:
- Fast and consistent performance,
- Dedicated endpoints for each service,
- Hardware-aware scheduling, particularly in containerized or cloud-native ecosystems.
- Load-balancing endpoints for better inference handling
Using modern tooling such as Hugging Face Transformers or vLLM, teams can create robust pipelines that power diverse applications from summarizing clinical notes to triaging patient data in hospital systems.
Ultimately, the “best” model is rarely the most powerful on paper, it’s the one that balances performance with maintainability, safety, and deployment-readiness. AI architecture should always be built with real world constraints in mind. Models that can’t operate under real-world pressure won’t deliver value, no matter how impressive they look in the research lab. Generative AI applications go far beyond mere chat functionalities, offering superior performance for both batch and real-time inference in various other use cases.
Architectural decisions should always be made with deployment in mind. Sometimes the technically superior model is not the one we choose. Instead, we favor maintainability, ease of fine-tuning, and operational resilience factors that determine whether the model will survive in production under real-world pressure. More importantly, the model is safer and less prone to hallucination to be deployed in healthcare settings.
Building Infrastructure for Scale: More Than Just a Model
Behind every generative output is a well-orchestrated system of tools, pipelines, and safeguards. Unlike traditional models that sit behind static APIs, generative applications demand dynamic inputs, contextual memory, and prompt sensitivity. This adds layers of complexity to infrastructure.
In our stack, we’ve modularized core services:
- Prompt templating engines to handle clinical context
- Observability and versioning systems to track every output
- Validator services to apply post-generation rules
- Async job orchestration for long-running summarization or synthesis tasks. This is where Agentic AI workflow plays a major role today in building intelligent automations that can think and take actions
At scale, even minor inefficiencies get amplified. Token counts, GPU memory use, and thread pool exhaustion can all introduce bottlenecks. We’ve had to optimize our inference servers for multi-user concurrency, load spikes, and real-time feedback collection all while ensuring that PHI (Protected Health Information) never leaks into logs or outputs.
One of our recent efforts involved containerizing multiple versions of a lightweight LLM, auto-scaling them in response to prompt complexity. We employed a multi-tenant architecture with per-department namespaces so different clinical teams could experiment without stepping on each other’s outputs.
Evaluating Output: Beyond Accuracy, Toward Clinical Usability
Evaluating Generative AI in healthcare requires a new mindset. Metrics like BLEU scores, ROUGE and perplexity metrics and few others don’t reflect the nuances of medical content. Instead, building a domain-specific evaluator to check for completeness, accuracy, and alignment with clinical expectations is essential.
We should conduct structured testing of models using synthetic inputs based on real workflows. Simulate clinician queries, validate outputs using rule-based checkers, and track feedback to refine prompts or adjust model configurations. Evaluate using the human-in-loop method to gather feedback on each inference output.
Create an internal tooling for rapid model iteration:
- Prompt playgrounds with versioning
- Human-in-loop feedback systems in UIs
- Shadow deployments to compare LLM-generated content against existing manual workflows without going live
In doing so, it can accelerate the feedback loop between engineers and clinicians, a critical factor in reducing deployment risk and increasing model trustworthiness.
Leading Engineering Teams Through AI Productization
Driving Generative AI adoption in healthcare isn’t just about coding it’s about leadership. As engineers, we face three overlapping responsibilities: technical execution, cross-functional coordination, and strategic prioritization.
First, we lead by architecting systems that can evolve. LLMs are changing rapidly what worked six months ago may now be obsolete. Our teams must be equipped to swap models, fine-tune on custom data, and retrain pipelines without re-architecting everything and, more importantly, an evaluation framework.
Second, we serve as translators bridging the language of machine learning with the needs of product managers, clinicians, and compliance teams.
Finally, we must think long-term. Not every LLM use case is ready for production, but the infrastructure we build now must anticipate future growth multi-lingual support, audit requirements, fine-tuning interfaces, etc. Engineering leadership in this space is as much about future-proofing as it is about delivery.
Challenges persist, but the true hurdle lies in the speed of evolution and the development of more reliable and resilient workflows.
Operationalizing a Generative AI application is not only a technical milestone but also aligns with the enterprise healthcare goals. Workflow must be designed to scale across other departments, support governance and ensure compliance and integrate with clinical operations.
Final Thoughts: Scaling the Right Way
Generative AI in healthcare holds immense potential, but potential doesn’t translate to clinical value without a strong engineering foundation. With the advent of the Agentic AI era, we are closing the gap on how clinicians and AI will collaborate in the future. We must scale responsibly, design systems that are resilient and observable, and embrace open-source tools that give us flexibility and control.
This is not a race to build the biggest model. It’s a race to the most trusted, clinically aware, maintainable, and useful system, one that integrates with real clinical workflows, adapts to clinical feedback, is more trustworthy and evolves with the pace of AI innovation.
As engineers, we have the opportunity and the responsibility to drive that transformation from inside the infrastructure. Beyond the hype, that’s where real impact happens.
AI in healthcare is no longer mere hype. With advanced LLM models and the rise of Agentic AI, we must responsibly build trust in applications, designing systems clinicians can truly rely on and unlocking new possibilities for delivering better and safer care for everyone.
