Senior Scholars Program

UMMS Affiliation

Department of Surgery, Surgical Outcomes Analysis and Research; Senior Scholars Program

Faculty Mentor

Jennifer F. Tseng

Publication Date


Document Type



Health Information Technology | Health Services Research | Surgery


Background: The importance of an electronic medical record has been highlighted for both clinical care and research. In the current era, data warehouses and repositories have been established to serve the dual function of patient care and investigation.

Purpose: The aim of this study was to compare a newly developed institutional clinical data warehouse, linked with the hospital information system (HIS), to a prospectively-maintained departmental database.

Methods: A novel HIS-linked institutional clinical data warehouse was queried for 9 primary and secondary ICD-9-CM discharge diagnosis codes for pancreatic cancer. The database captured inpatient and outpatient clinical and billing information from a pool of over 2 million patients evaluated at an academic medical institution and its affiliates since 1995. A cohort was identified; following Institutional Review Board approval, demographic and clinical data was obtained. This data was compared to a manually-entered and prospectively-maintained surgical oncology database of the same institution, tracking 394 patients since 1999. Duplicated patients, and those unique to either dataset, were flagged. Patients with diagnosis dates prior to 1999 were excluded to allow comparison over the same time period. For validation purposes, a 10% random sample of remaining patients unique to each dataset underwent manual review of medical records including clinic notes, admission/discharge notes, diagnostic imaging, and pathology reports.

Results: 1107 patients were identified from the HIS-linked dataset with pancreatic neoplasm-associated diagnosis codes dating from 1999 to 2009. Of these, 254 (22.9%) were captured in both datasets, while 853 (77.1%) were only in the HIS-linked dataset. Manual review of the 10% subset of the HIS-only group demonstrated that 55.6% of patients were without identifiable pancreatic pathology, suggesting miscoding, while 31.7% had diagnoses consistent with pancreatic neoplasm, and 12.7% had pseudocyst or pancreatitis. Of the 394 patients tracked by surgical oncology, 254 (64.5%) were captured in both datasets, while 140 (35.5%) had not been captured in the HIS-linked dataset. Manual review of the 10% subset of the non-captured patients demonstrated 93.3% with pancreatic neoplasm and 6.7% with pseudocyst or pancreatitis. Lastly, a review of the 10% subset of the 254 patient overlap demonstrated that 87.5% of patients were with pancreatic neoplasm, 8.3% with pseudocyst or pancreatitis, and 4.2% without pancreatic pathology.

Conclusions: While technological advances provide a powerful means to automate institutional-level cohort identification and data collection, a high degree of misclassification may be present if queries are based solely on ICD-9-CM discharge codes. For that reason, careful validation and data cleaning are critical steps prior to research use. These results also suggest cautious interpretation of national-level administrative data utilizing ICD-9-CM diagnosis codes. Our findings suggest that the current state-of-the-art data warehouses continue to require clinical correlation and validation through traditional retrospective mechanisms.


Electronic Health Records, Medical Records Systems, Computerized, Biomedical Research, Pancreatic Neoplasms, Data Collection, International Classification of Diseases

Rights and Permissions

Copyright is held by the author(s), with all rights reserved.

DOI of Published Version


Journal/Book/Conference Title

Senior Scholars Program


Medical student Edward Arous participated in this study as part of the Senior Scholars research program at the University of Massachusetts Medical School.