MEDICAL
INFORMATICS
Knowledge Management
and Data Mining in
Biomedicine
INTEGRATED SERIES IN INFORMATION SYSTEMS
Series Editors
Professor Ramesh Sharda Prof. Dr. Stefan Vo13
Oklahoma State University Universitat Hamburg
Other published titles in the series:
E-BUSINESS MANAGEMENT:
Integration of Web Technologies with Business
Models1
Michael
J.
Shaw
VIRTUAL CORPORATE UNIVERSITIES:
A Matrix of Knowledge and Learning
for the New Digital
DawdWalter R.J. Baets
&
Gert Van der Linden
SCALABLE ENTERPRISE SYSTEMS:
An Introduction to Recent Advances1
edited by Vittal Prabhu, Soundar Kumara, Manjunath Kamath
LEGAL PROGRAMMING:
Legal Compliance for RFID and Software Agent
Ecosystems in Retail Processes and Beyond1
Brian Subirana and Malcolm Bain
LOGICAL DATA MODELING:
What It Is and How To Do It1
Carol Friedman William Hersh
Columbia University, USA Oregon Health
&
Science Univ., USA
Library of Congress Cataloging-in-Publication Data
A
C.I.P. Catalogue record for this book is available
from the Library of Congress.
ISBN-10: 0-387-2438 1-X (HB)
ISBN- 10: 0-387-25739-X (e-book)
ISBN- 13: 978-0387-2438 1-8 (HB) ISBN- 13: 978-0387-25739-6 (e-book)
O
2005 by Springer Science+Business Media, Inc.
All rights reserved. This work may not be translated or copied in whole or in
part without the written permission of the publisher (Springer Science
+
Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except
for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now
know or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and
similar terms, even if the are not identified as such, is not to be taken as an
expression of opinion as to whether or not they are subject to proprietary rights.
Printed in the United States of America.
98765432
1
SPIN
1
1055556
Knowledge Management, Data Mining, and Text Mining
Applications in Biomedicine
12
3.1 Ontologies
13
3.2 Knowledge Management
14
3.3 Data Mining and Text Mining
18
3.4 Ethical and Legal Issues for Data Mining
22
Summary
22
References
23
Suggested Readings
31
Online Resources
31
41
4
.
Data Description
42
5
.
Results
44
5.1 Basic Analysis
44
5.2 Content Map Analysis
47
5.3 Citation Network Analysis
55
6
.
Conclusion and Discussion 57
7
.
Acknowledgement 58
References
Biological Perspective
79
3
.
Case Study 83
3.1
Informatics Perspective
-
The BIOINFOMED Study
and Genomic Medicine
83
3.2 Biological Perspective
-
The BioResearch Liaison
Program at the University of Washington
85
4
.
Conclusions and Discussion
89
5
.
Acknowledgements 91
99
3
.
Review of the Literature: Data Mining and Privacy
and Security 109
vii
3.1
General Approaches to Assuring Appropriate Use
110
3.2 Specific Approaches to Achieving Data Anonymity
112
3.3 Other Issues in Emerging "Privacy Technology"
116
3.4 "Value Sensitive Design": A Synthetic
Approach to Technological Development
117
3.5 Responsibility of Medical Investigators
119
4
.
Case Study: The Terrorist Information Awareness
Program (TIA)
12 1
Online Resources
135
Questions for Discussion
13 7
Chapter
5:
Ethical and Social Challenges of Electronic
Health Information
139
1
.
Introduction
141
2
.
Overview of the Field
142
2.1 Electronic Health Records
142
2.2 Clinical Alerts and Decision Support
146
2.3 Intemet-based Consumer Health Information
Introduction 165
1.1 Use-cases 165
2
.
Context
168
2.1 Concept Characteristics
169
2.2 Domains
170
2.3 Structure 171
3
.
Biomedical Concept Collections
172
3.1 Ontologies 172
3.2 Vocabularies and Terminologies 174
3.3 Aggregation and Classification 175
3.4 Thesauri and Mappings 176
4
Background and Overview: The Use of Concept
Relationships for Knowledge Creation 188
2.1 Indexing Strategies and Vocabulary Systems 190
2.2 Integrating Document Structure in Systems 192
2.3 Text Mining Approaches
194
2.4 Literature-based Discovery IR Systems 195
2.5 Summary 198
.
3 Case Examples 198
3.1 Genescene 199
3.2 Telemakus 200
3.3 How Can a Concept Relationship System Help
with the Researcher's Problem and Questions?
202
3.4 Summary
206
4
.
Ontologies
215
2.1 OpenCyc
215
2.2 WordNet
215
3
.
Examples of Medical Ontologies
217
3.1 GALEN
217
3.2 Unified Medical Language System
219
3.3 The Systematized Nomenclature of Medicine
220
3.4 Foundational Model of Anatomy
222
3.5 MENELAS ontology
223
4
Acknowledgments
232
References
232
Suggested Readings
234
Online Resources
234
Questions for Discussion
235
Appendix: Table showing characteristics of selected ontologies 235
Chapter
9:
Information Retrieval and Digital Libraries
237
Overview of Fields
239
Information Retrieval
241
2.1 Content
4.2 User-oriented Evaluation 265
4.3
Changes in Publishing
267
Acknowledgements
269
References
269
Suggested Readings
273
Online Resources
274
Questions for Discussion
275
Chapter
10:
Modeling Text Retrieval in Biomedicine
277
1
.
Introduction
289
5.2 XplorMed
290
5.3 AI3View:HivResist
291
5.4 The Future
292
xi
References
294
Suggested Readings
295
Online Resources 296
Questions for Discussion
296
Chapter 11: Public Access to Anatomic Images
299
Introduction
301
Background
328
Summary
330
Acknowledgements
330
References
330
Suggested Readings
331
Online Resources
332
Questions for Discussion
332
Chapter 12: 3D Medical Informatics: Information Science
in Multiple Dimensions
333
Introduction
335
Overview
.
3D Medical Informatics
348
4.1 Background and Related work 348
4.2 Design and Software Tools for Template Planning
Workstation
349
4.3 Results and Discussion
350
5
.
Grand Challenges in
3D
Medical Informatics
353
6
.
Conclusion
354
7
.
Acknowledgements
355
References
355
2.4 Infectious Disease Data Analysis and Outbreak
Detection
372
Infectious Disease Information Infrastructure and Outbreak
Detection: Case Studies
378
3.1 New York State's Health Information Network System
378
3.2 The BioPortal System
379
3.3 West Nile Virus Outbreak Analysis 386
Conclusions and Discussion
388
Acknowledgements
391
References
401
2.1 Overview
401
2.2 Levels of Linguistic Structure
402
Domain Knowledge: The UMLS
403
3.1 SPECIALIST Lexicon
404
3.2 Metathesaurus
404
Semantic Network
405
Semantic Interpretation for the Biomedical Literature
406
4.1 Overview
406
4.2 AQUA
407
4.3 PROTEUS-BIO
422
Chapter
15:
Semantic Text Parsing for Patient Records
423
1
.
Introduction
425
2
.
Overview
427
2.1
Challenges of Processing Clinical Reports
427
2.2
Components of an NLP System
431
2.3 Clinical Applications
437
3
from Text Documents 449
Introduction
451
Overview of the Field
453
2.1 Background
453
2.2 Biological Information Extraction 453
2.3 Bioinformatics Tools
456
Case Studies
457
3.1 Identification of Flat Relationships from Text Documents 457
3.2
TransMiner: Formulating Novel, Implicit Associations
through Transitive Closure 461
3.3 Identification of Directional and Hierarchical Relationships 466
BioMap: A Knowledge Base of Biological Literature
477
491
1
.
Introduction
493
2
.
Overview
494
2.1 Metabolic Pathway Databases
494
2.2 Network Modeling and Reconstruction
494
2.3
Extracting Biological Interactions from Text
495
3
.
Metnet
498
3.1
Acknowledgements
514
References
515
Suggested Readings
7 Online Resources
7
Questions for Discussion
518
Chapter 18: Gene Pathway Text Mining and Visualization
519
1
.
Introduction
521
2
.
Literature Review/Overview
S21
542
Suggested Readings
544
Online Resources
545
Questions for Discussion
545
Chapter 19: The Genomic Data Mine
547
1
.
Introduction
549
2
.
Overview
550
2.1 Genomic Text Data
551
2.2 Genomic Map Data
556
xvi
Exploratory Genomic Data Analysis
573
1
.
Introduction
575
2
.
Overview
576
2.1 Gene Expression Data
576
2.2 Mixed Populations 577
2.3 Methods for Mixed Populations
579
2.4 Distance 582
2.5 Hypothesis Selection
584
3
.
Case Studies
586
595
2
.
Overview of the Field
597
2.1 Large-scale Biological Data and Knowledge Resources
597
2.2 Joint Learning Using Multiple Types of Data
599
2.3 Joint Learning Using Data and Knowledge
602
3
.
Kernel-based Data Fusion of Multiple Types of Data
604
3.1 Protein Function Prediction
604
3.2 Kernel-based Protein Function Prediction
604
4
.
Learning Regulatory Networks Using Microarray and
Existing Knowledge
Questions for Discussion
624
Author Index
625
Subject Index
627
EDITORS' BIOGRAPHIES
Hsinchun
Chen
is the McClelland Professor of
Management Information Systems (MIS) at the Eller
College of the University of Arizona. He received his
Ph.D. degree in Information Systems from New York
University. He is the author of more than nine books
and 200 articles covering medical informatics,
knowledge management, homeland security, semantic
retrieval, and Web computing in leading information
technology publications. He serves on the editorial
boards of Journal of the American Society for Information Science and
Technology, ACM Transactions on Information Systems, IEEE Transactions
on Systems, Man, and Cybernetics, IEEE Transactions on Intelligent
Transportation Systems, and Decision Support Systems. He is a scientific
counselor/advisor of the Lister Hill Center of the National Library of
Medicine (NLM/USA) and the National Library of China. Dr. Chen is the
director of the University of Arizona's Artificial Intelligence Lab (40+
researchers). Since 1990, Dr. Chen has received more than $17M in research
funding from various government agencies and major corporations. He has
Department of Medical Education and Biomedical
of Medicine; Professor, Information School and
Adjunct Professor, Department of Health Services, School of Public Health
and Community Medicine. Dr. Fuller has a BA degree in Biology, a Master's
in Library Science from Indiana University, and a Ph.D. in Library and
Information Science from the University of Southern California. Dr. Fuller's
areas of research include: developing new approaches to represent and map
the results
of scientific research; design and evaluation of information
systems to support decision making at the place and time of need; and
integrated health sciences information systems design.
Dr. Fuller serves as Principal Investigator ofthe Health Sciences
Libraries and Information Center contract from the NLM to serve as the
Regional Medical Library for the Pacific Northwest (Alaska, Idaho,
Montana, Washington and Oregon);
Principal Investigator, Telemakus:
Mining and Mapping Research Findings to Promote Knowledge Discovery
in Aging funded by the Ellison Medical Foundation; Co-Investigator of
Biomedical Applications of the Next Generation Internet (NGI): Patient-
centric Tools for Regional Collaborative Cancer Care Using the NGI funded
by the National Library of Medicine; Co-Investigator of an International
Health and Biomedical Research and Training grant from the Fogarty
International Center; and advisor to a Health Services Research
Administration (HRSA) grant to explore models of Faculty Leadership in
Interprofessional Education to Promote Patient Safety.
Dr. Fuller has served as a member of the President's (White House)
Information Technology Advisory Committee and the Board of Regents of
the National Library of Medicine and on the Boards of the American
Medical Informatics Association and the Medical Library Association. She
is an elected fellow of the American College of Medical Informatics. (Email:
York. Dr. Friedman is involved in advancing biomedical informatics, text
mining, and knowledge management research and education. Dr. Friedman
was a conference co-chair for the Natural Language Track of the 2002
Pacific Symposium in Bioinformatics, the Workshop in Biomedicine in the
2002 and 2003 Association for Computational Linguistics Conferences, and
the 2004 BioLink Workshop in the Human Language Techology Conference
of the North American Chapter of the Association for Computational
Linguistics. Dr. Friedman is a member of the Board of Scientific Counselors
of the National Library of Medicine, is a member-at-large of the Executive
Board of the American College of Medical Informatics, is on the Editorial
Boards of the Journal of Biomedical Informatics and the Journal of the
Association of Medical Informatics, and is a reviewer for numerous journals
associated with bioinformatics. Dr. Friedman has been a guest editor of
special issues of the Journal of Biomedical Informatics, has published over
100 articles on NLP, has co-authored a book on natural language processing
(NLP), and is the author of various chapters on NLP.
(Email:
;
URL:
xxii
William
Hersh,
M.D. William Hersh, M.D. is
Professor and Chair of the Department of Medical
Informatics
&
Clinical Epidemiology in the School of
Medicine at Oregon Health
&
Editorial Board of five scientific journals. He is also a member of the
program committee of the Text Retrieval Conference (TREC) and currently
chairs TREC's Genomics Track. Dr. Hersh also serves as Associate Director
of the OHSU Evidence-Based Practice Center funded by the Agency for
Healthcare Research and Quality. Dr. Hersh's work in medical informatics
education is equally well-known. He serves as Director of OHSUYs
educational programs in biomedical informatics. He also teaches medical
informatics to medical students, nursing students, and internal medicine
residents. (Email: ;
URL:
AUTHORS' BIOGRAPHIES
Daniel Berleant, PhD, received the B.S. degree in
1982. After practicing in the software engineering field,
he received the MS (1990) and PhD (1991) degrees from
the University of Texas at Austin. He then developed a
research program in text mining and interaction and in
1
inference under severe uncertainty. In 1999 he accepted a
'I)
position at Iowa State University where he continues to
/
pursue research on text mining and text interaction, as
well as uncertainty quantification and software
engineering. He has advised or co-advised
24
master's theses and six PhD
students who have either graduated or are in progress. He has authored over
50 refereed papers and book chapters. (Email:
;
Hong Kong. He received his PhD degree in
Management Information Systems from the University
of Arizona and a Bachelor degree in Computer Science
!
(Information Systems) from the university of Hong
Kong. He was an active researcher in the Artificial
Intelligence Lab at the University of Arizona, where he
participated in several research projects funded by NSF,
NIH, NIJ, and DARPA. (Email:
;
URL:
Christopher
G.
Chute, MD, DrPH, received his
undergraduate and medical training at Brown University,
internal medicine residency at Dartmouth, and doctoral
training in Epiden~iology at Harvard. He is Board
Certified in Internal Medicine, and a Fellow of the
American College of Physicians, the American College
of Epidemiology, and the American College of Medical
Informatics. He became Head of the Section of Medical
Information Resources at Mayo Foundation in 1988 and
is now Professor and Chair of Biomedical Informatics. As a career scientist
at Mayo, Dr. Chute's NIH and AHCPRIAHRQ funded research in medical
concept representation, clinical information retrieval, and patient data
repositories have been widely published. He is Vice-chair of the ANSI
Health Information Standards Board, Convener of Healthcare Concept
Representation WG3 within the IS0 Health Informatics Technical
Committee, chair-elect of the US delegation to IS0 TC215 for Health
Ted Cooper,
MD, Clinical Associate Professor,
Department of Ophthalmology, Stanford University,
received his MD and completed a residency in
ophthaln~ology at the George Washington University.
He is a fellow of the American College of Medical
Informatics and the American Academy of
Ophthalmology. As National Director for
Confidentiality and Security he helped guide Kaiser
Permanente's response to HIPAA. He has lectured
widely on information assurance. He has participated in a number of health
informatics activities including director and chairperson of the Computer-
based Patient Record Institute. He is currently the chairperson of the Health
Information and Systems Society Privacy and Security Steering Committee,
a member of the Health Information and Systems Society Electronic Health
Record Steering Committee, and chairperson of
The CPRZ Toolkit:
Managing Information Security in Health Care
Work Group. (Email:
;
URL:
pl
Julie Dickerson,
PhD, received her B.S. degree
from the University of California, San Diego and her
MS and PhD degrees from the University of Southern
California. She is currently an Associate Professor of
University. His research interests include systems biology, genetic network
modeling and inference, microarray data analysis, signal processing and
pattern recognition. (Email: )
Shauna Eggers is a computer programmer at the
University of Arizona Artificial Intelligence Lab. She
earned a B.S. in Computer Science and a B.A. in
Linguistics and German Studies from the University of
Arizona in May 2004. Her research interests include
natural language processing for biomedical applications
and knowledge visualization.
Millicent Eidson, MA, DVM, DACVPM
(Epidemiology) is State Public Health Veterinarian and
Director of the Zoonoses Program, New York State
Department of Health. She is also an Associate
Professor in the Department of Epiden~iology,
University at Albany School of Public Health.
Dr.
Eidson previously served as an Epidemic Intelligence
Z
Service (EIS) Officer with the Centers for Disease
Control and Prevention based at the National Cancer