Deploying Medical Semantic Search with Lightweight MLOps Pipelines

This content originally appeared on DEV Community and was authored by SciForce

Client Profile

The client is a professional healthcare technology provider whose platform is used by multiple medical institutions to support clinical data workflows. The project focused on enabling fast, reliable semantic search across standardized medical terminologies, allowing healthcare teams to extract structured meaning from free-text input.

To achieve this, the platform integrates large language models for real-time query normalization and a locally deployed Quadrant vector database for high-performance concept retrieval. The solution was designed to deliver accurate concept mapping at scale, while aligning with MLOps and DevOps best practices to ensure reproducibility, modularity, and operational stability across environments.

Challenge

1) Manual embedding pipeline
The lack of automated tools for embedding transfer creates a bottleneck: embeddings are generated locally, versioned manually, and moved to production via Google Drive and manual file placement. This slows down updates, introduces room for human error, and prevents reproducible, CI-integrated deployment.

2) Fragmented infrastructure
Limited ownership beyond the ML service and vector DB prevents the ML team from controlling or debugging full-system behavior. Integration issues, environment mismatches, and deployment delays often require cross-team coordination, slowing iteration and complicating root-cause analysis.

3) Unstructured input/output
The lack of structure in clinical input, ranging from shorthand notes to long-form symptom descriptions, makes reliable parsing and interpretation difficult. Without standardized formatting or annotated data, conventional models can't be trained, and rule-based systems are brittle. Relying on LLMs introduces additional complexity: output may vary, integration becomes fragile, and downstream systems must be tightly controlled. These challenges require not just flexible language models, but a resilient MLOps/DevOps foundation to manage variability, enforce schema constraints, and ensure consistent behavior across evolving environments.

4) Semantic QA bottlenecks
Ensuring clinical accuracy is a core challenge: even minor semantic mismatches can have significant implications. Because model output must align with subtle, context-specific interpretations, traditional automated tests fall short. Historically, every embedding or model update required manual validation by domain experts — creating a critical path delay and slowing iteration. To reduce reliance on manual expert review, the team introduced a comprehensive benchmark dataset with ground truth mappings, establishing a reliable evaluation set for further automation.

5) Scalability shift
The system initially relied on RAM-based vector search for speed, but as embedding volumes grew into multiple gigabytes, it exceeded available memory and caused instability. This forced a migration to disk-based storage, adding slight latency and requiring adjustments to indexing and retrieval under tight resource limits.

Solution

To enable reliable semantic matching from unstructured clinical input, we implemented a lightweight, modular ML integration pipeline using:

Azure-hosted LLMs for flexible parsing
The project uses pre-deployed Azure OpenAI models (e.g., GPT-4) to transform unstructured clinical input into structured representations, eliminating the need for custom training or labeled datasets. This allows the system to handle highly variable user queries while keeping model maintenance centralized and scalable through endpoint-based access.

Embedding generation handled offline
We used OpenAI model to locally generate vector embeddings for each medical terminology (e.g., SNOMED, Nometrics), ensuring full control over preprocessing and versioning. Collections are packaged as versioned archives (e.g., snomed_v1.0.zip) for easy transfer, enabling reproducible deployments and consistent behavior across environments without relying on real-time API calls.

Quadrant vector DB deployed in container
It uses a containerized Quadrant instance to serve vector similarity search via REST API, enabling fast, local retrieval of medical concepts. Embeddings are organized into named collections by terminology, allowing modular updates and scalable indexing. The container-based deployment ensures consistent behavior across environments and simplifies maintenance without relying on external vector services.

ML services via REST API
Project’s embedding search is delivered through a standalone, containerized Flask API that cleanly separates ML logic from the core application stack. This design allows backend systems to send free-text queries and receive structured concept matches via simple JSON calls, making integration lightweight and enabling independent updates or scaling of the ML component without affecting other services.

Version-Controlled Deployment
Embedding sets are packaged as versioned .zip archives and deployed across environments using containerized services with environment-scoped configuration. Each version is explicitly tracked and loaded into the local Quadrant vector database, ensuring consistent indexing and retrieval behavior across DEV and PROD. This setup enables clean separation of concerns, modular updates to specific terminology collections, and reliable promotion of embedding versions—aligning with key DevOps and MLOps principles such as reproducibility, observability, and version governance.

Features

Real-time semantic normalization
A single API endpoint accepts free-text clinical input, transforms it into embeddings via LLM-based preprocessing, and performs vector similarity search against curated, versioned medical terminologies in the Quadrant database. The API returns ranked, standardized concepts with codes, labels, and confidence scores—enabling fast, consistent mapping of unstructured input to structured clinical concepts.

Support for multiple terminologies
Users can query against specific medical dictionaries by selecting from versioned embedding collections, such as SNOMED CT (English and Spanish), RxNorm, and LOINC. Each collection is independently maintained, versioned, and accessible via API, enabling domain-specific semantic search without retraining or reconfiguration.

Operational Version Introspection
A dedicated API endpoint exposes the currently loaded versions of terminology collections, embedding sets, and LLM configurations. This enables engineers and QA teams to quickly verify deployment state, troubleshoot inconsistencies across environments, and ensure alignment during testing and release validation—without accessing infrastructure directly.

Human-in-the-loop validation
To ensure semantic accuracy, domain experts created a benchmark dataset of 200 mapped clinical concepts with validated ground truth. This allows consistent testing of different models and prompt configurations without requiring repeated expert review. While expert oversight remains critical before release, the benchmark enables partial automation of QA and supports more efficient, comparative evaluation within the MLOps workflow.

Development Process

1) Embedding Preparation
Each medical terminology (SNOMED CT, RxNorm, and LOINC) is processed locally to produce dense vector embeddings using embedding model, with each concept, along with its labels and synonyms, converted into a single vector. Each medical terminology (SNOMED CT, RxNorm, and LOINC) is processed locally from CSV or JSON files containing concept codes, preferred labels, and synonyms, producing a dense vector embedding for every concept to support semantic search.

Embeddings are generated in batches via local scripts that call the LLM API, respecting rate limits and error handling with exponential backoff. Each resulting vector is associated with its original concept metadata and stored in structured format. The full collection is archived by terminology
and version (e.g., snomed.0.zip), enabling deterministic re-use and alignment with specific QA snapshots.

2) Versioning & Transfer

Each environment (DEV/PROD) runs a containerized stack (Flask API + Quadrant DB) with embedding archives pulled from a shared repository. Versioning and configuration are handled via .env files and Jenkins deployment. Graylog provides centralized logging and monitoring.

3) Environment Setup

The Flask-based API and Quadrant vector database are containerized and deployed via Docker Compose on dedicated virtual machines for DEV and PROD. These environments are configured independently, with no orchestration layer like Kubernetes. Each VM is configured as follows:

Local volume for embedding storage

Mounted path: /mnt/data/collections
Used to store and access versioned embedding sets

Docker Compose stack

Defines and runs the Flask API and Quadrant DB containers
Internal communication uses the default Docker bridge network

Environment-specific configuration via .env file

Embedding collection version (e.g., SNOMED_v1.2)
Azure OpenAI credentials and endpoint names
Port bindings and container aliases

Deployment method

Jenkins pipeline pulls versioned embedding archive, updates configs, and restarts Docker Compose stack (Flask API + Quadrant DB)
Deployed manually via docker-compose up -d on pre-configured VMs
Configuration via .env file (embedding version, Azure credentials, port mappings)

4) LLM Configuration
LLMs are provisioned as named deployment endpoints per environment using automated scripts integrated into the CI/CD process. The Flask-based ML service accesses these endpoints via HTTPS using the Azure SDK or REST calls, with API keys and deployment names managed through environment variables.

The Flask-based ML service accesses these models via HTTPS using Azure OpenAI SDK or direct REST calls. Authentication is handled through API keys stored in environment variables, and requests specify both the deployment name and model version.

To handle Azure rate limits (e.g., 60 RPM or token-based limits), the service includes exponential backoff with jitter and logs each retry event to Graylog. All prompts are constructed dynamically based on free-text input; no prompt templates are hardcoded.

The system performs no fine-tuning or embedding generation at runtime — all LLM usage is stateless and on-demand, with inference only.

5) Query Processing
When a user submits a free-text clinical phrase, the system processes it in real time through LLM-driven normalization and vector-based concept retrieval. This ensures accurate semantic mapping even from unstructured, informal input.
The end-to-end process involves:

Receiving the query
The user sends a POST request to the /search endpoint of the Flask API, providing a JSON payload containing the raw input text (e.g., “severe shortness of breath and wheezing”).
LLM-based normalization
The API forwards the text to a pre-configured Azure OpenAI endpoint (such as GPT-4). The model returns a normalized or clarified version of the input — for example, standardizing synonyms or removing irrelevant modifiers.
Embedding generation and similarity search
The normalized text is embedded using a local embedding function based on OpenAI’s Ada-002 model. This vector is then compared against a selected embedding collection (e.g., snomed_v1.2) in the Quadrant vector database using cosine similarity.
Returning matched concepts
The API responds with a ranked list of the most semantically relevant medical concepts, typically returning 5 to 10 matches. Each result includes the concept code, preferred label, and similarity score.

This setup allows to handle noisy clinical input and return consistent, terminology-aligned results — without requiring pre-structured queries or labeled training data.

6) Logging & Monitoring
Each query triggers structured logging from input receipt to result delivery. The Flask API logs request metadata, LLM prompts, responses, and any errors such as timeouts or Azure rate limits. Retry attempts are handled via exponential backoff in the service code and recorded with full context. Embedding generation and vector search operations are also logged with duration and status for traceability.

All logs are sent to Graylog, which is used for centralized monitoring, error tracking, and system observability.

7) QA & Validation
Each embedding update or logic change is validated in DEV using a combination of automated tests and curated clinical test cases. Automated checks verify embedding integrity, index loading, and similarity scoring behavior, while benchmark-based evaluations assess semantic accuracy across known inputs. Targeted expert review is applied when needed to ensure clinical alignment before promoting changes to PROD.

Impact

Scalable semantic search
Enabled flexible semantic search across four medical embedding collections (e.g., SNOMED, Nometrics), each containing hundreds of thousands to several million terms.

Low-latency responses
Maintained <1 second search latency, even after shifting from in-memory to disk-based vector indexing due to RAM constraints.

Lightweight infrastructure
The entire system runs in approximately 11 Docker containers, with the ML component isolated in 2 dedicated containers (Flask API and Quadrant DB).

Human-in-the-loop validation
Each release validated using 100+ curated test prompts, with clinical experts reviewing semantic accuracy and QA engineers confirming technical correctness.

LLM integration via Azure OpenAI
Used GPT-4 via Azure endpoints, handling up to 120,000 tokens/minute. Retry logic with exponential backoff ensured resilience to rate limits.

Zero training, high adaptability
No local model fine-tuning was required. Instead, runtime prompts enabled adaptable free-text normalization across domains.

This content originally appeared on DEV Community and was authored by SciForce

Print Share Comment Cite Upload Translate Updates

APA

SciForce | Sciencx (2025-08-11T13:36:03+00:00) Deploying Medical Semantic Search with Lightweight MLOps Pipelines. Retrieved from https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/

MLA

" » Deploying Medical Semantic Search with Lightweight MLOps Pipelines." SciForce | Sciencx - Monday August 11, 2025, https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/

HARVARD

SciForce | Sciencx Monday August 11, 2025 » Deploying Medical Semantic Search with Lightweight MLOps Pipelines., viewed ,<https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/>

VANCOUVER

SciForce | Sciencx - » Deploying Medical Semantic Search with Lightweight MLOps Pipelines. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/

CHICAGO

" » Deploying Medical Semantic Search with Lightweight MLOps Pipelines." SciForce | Sciencx - Accessed . https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/

IEEE

" » Deploying Medical Semantic Search with Lightweight MLOps Pipelines." SciForce | Sciencx [Online]. Available: https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/. [Accessed: ]

rf:citation

» Deploying Medical Semantic Search with Lightweight MLOps Pipelines | SciForce | Sciencx | https://www.scien.cx/2025/08/11/deploying-medical-semantic-search-with-lightweight-mlops-pipelines/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.