Extending the Observational Medical Outcomes Partnership Common Data Model to Harmonize Dietary Exposure Survey Questions Across Studies
Abstract Body: Introduction: Numerous dietary assessments and scores in survey format have been used in cardiovascular disease (CVD) research jointly with study participants’ EHR data. While harmonization of EHR data across institutions has been facilitated through established Common Data Models (CDMs), equivalent methods for harmonizing dietary exposure survey questions remain underdeveloped. Currently, investigators must undertake labor-intensive, study-specific remapping efforts. Moreover, treating each food item as a distinct question generates hundreds of variables, a number that increases exponentially when qualifiers such as portion size, food frequency, or preparation method are included.
Methods: We compiled four dietary survey instruments used in CVD studies (Mediterranean Diet Score, AHEI, DASH, and FFQ) and applied fuzzy matching, sentence embeddings, and large language model (LLM)-assisted inference to map survey questions to the NIH Common Data Elements (CDEs). To address low direct mapping rates against NIH CDEs, we developed a new dietary exposure table within the Observational Medical Outcomes Partnership (OMOP) CDM, a widely adopted community standard for observational health data. The proposed relational database table includes fields such as dietary concept, dietary exposure duration, exposure unit, frequency, frequency per day, quantity concept, quantity per day, dietary preparation method, and family relation. Mapped dietary questions were then represented as rows in this table.
Results: Direct mapping of dietary survey questions to NIH CDEs achieved less than 10% coverage. Incorporation of the new dietary exposure table increased mapping coverage to over 80%. All introduced columns were standardized using SNOMED vocabulary concepts via embedding and LLM-based mapping. Approximately 99% of item fields conformed to a frequency–quantity composite schema, enabling normalization into a single occurrence table and reducing more than 1,500 raw survey variables into a concise, query-ready dataset. The total runtime of mapping was 4 hours with GPUs.
Conclusion: We developed a dietary exposure table within OMOP CDM format to represent, store, and harmonize dietary exposure survey questions and responses using embeddings, LLM, and SNOMED. This extension enables both syntactic and semantic interoperability of dietary data, substantially simplifying cross-study integration and facilitating large-scale, comparative nutrition research across diverse data sources.
Kim, Jihoon
(
Department of Biomedical Informatics & Data Science, Yale School of Medicine
, New Haven , Connecticut , United States )
Liu, Youwen
(
Department of Biomedical Informatics & Data Science, Yale School of Medicine
, New Haven , Connecticut , United States )
Willmott, Heather
(
Center for American Indian Health Research, the University of Oklahoma Health Sciences
, Oklahoma City , Oklahoma , United States )
Reese, Jessica
(
Center for American Indian Health Research, the University of Oklahoma Health Sciences
, Oklahoma City , Oklahoma , United States )
Xu, Chao
(
Department of Biostatistics and Epidemiology, the University of Oklahoma Health Sciences
, Oklahoma City , Oklahoma , United States )
Kota, Pravina
(
Center for American Indian Health Research, the University of Oklahoma Health Sciences
, Oklahoma City , Oklahoma , United States )
Hong, Na
(
Department of Biomedical Informatics & Data Science, Yale School of Medicine
, New Haven , Connecticut , United States )
Zhang, Vincent
(
Department of Biomedical Informatics & Data Science, Yale School of Medicine
, New Haven , Connecticut , United States )
Zhang, Ying
(
Center for American Indian Health Research, the University of Oklahoma Health Sciences
, Oklahoma City , Oklahoma , United States )