Retrieval-augmented systems for technical domains often struggle with the substantial terminological gap between user queries and domain-specific documents, particularly evident in pharmacopoeia repositories where scientific nomenclature differs significantly from natural language expressions. This research addresses the critical challenge of low recall rates in pharmacopoeia document retrieval, where a significant portion of relevant documents remain inaccessible to conventional retrieval methods. Through extensive analysis of query-document relationships, we establish that the semantic disconnect between general user language and specialized chemical terminology necessitates a multi-faceted approach to effectively bridge this domain-specific retrieval gap. Conventional methods―whether dense semantic encoders or sparse lexical matchers―independently fail to capture the full spectrum of relevance patterns across these specialized scientific documents.
We present a novel multi-query retrieval architecture specifically designed for in-domain datasets, combining several innovative components to address the terminology gap. Our approach implements dual query expansion strategies: keyword-based expansion generating domain-specific terms and passage-based expansion using LoRA-fine-tuned language models to generate pharmacopoeia-style contextual expansions. These expansions feed into a hybrid retrieval system utilizing both fine-tuned dense encoders (optimized with Multiple Negative Ranking Loss) and sparse BM25 retrievers, with results combined through a two-stage Reciprocal Rank Fusion (RRF) methodology. This architecture enables the system to simultaneously leverage semantic understanding and terminology matching while maintaining appropriate balance between different retrieval streams. Additionally, we develop a comprehensive document processing pipeline specifically for pharmacopoeia content, including chemical-specific segmentation, section-level chunking, and synthetic query generation.
Experimental evaluation across multiple metrics demonstrates substantial performance improvements, with our complete architecture achieving significant gains over baseline approaches. Component analysis reveals crucial insights: hybrid retrieval alone provides meaningful improvement over single-method approaches, domain-specific encoder fine-tuning contributes substantial performance gains, and our novel two-stage RRF delivers additional improvement over conventional fusion techniques. These findings confirm that effectively bridging the terminology gap in specialized domains requires a multi-faceted approach integrating domain-adapted representations, diverse query formulations, and sophisticated result fusion. The principles and architecture developed in this research have significant implications for information retrieval in other specialized scientific and technical domains where similar terminological barriers exist between users and document collections.