Bioinformatics: Managing Scientific Data (Hardcover)
內容描述
Life science data integration and interoperability is one of the most
challenging problems facing bioinformatics today. In the current age of the
life sciences, investigators have to interpret many types of information from
a variety of sources: lab instruments, public databases, gene expression
profiles, raw sequence traces, single nucleotide polymorphisms, chemical
screening data, proteomic data, putative metabolic pathway models, and many
others. Unfortunately, scientists are not currently able to easily identify
and access this information because of the variety of semantics, interfaces,
and data formats used by the underlying data sources. Bioinformatics:
Managing Scientific Data tackles this challenge head-on by discussing the
current approaches and variety of systems available to help bioinformaticians
with this increasingly complex issue. The heart of the book lies in the
collaboration efforts of eight distinct bioinformatics teams that describe
their own unique approaches to data integration and interoperability. Each
system receives its own chapter where the lead contributors provide precious
insight into the specific problems being addressed by the system, why the
particular architecture was chosen, and details on the system’s strengths and
weaknesses. In closing, the editors provide important criteria for evaluating
these systems that bioinformatics professionals will find
valuable.
Contents
1 Introduction Zoe Lacroix and Terence Critchlow1.1 Overview
1.2 Problem and Scope 1.3 Biological Data Integration 1.4
Developing a Biological Data Integration System 1.4.1 Specifications
1.4.2 Translating Specifications into a Technical Approach 1.4.3
Development Process 1.4.4 Evaluation of the System References
2 Challenges Faced in the Integration of BiologicalInformation
Su Yun Chung and John C. Wooley2.1 The Life Science Discovery Process
2.2 An Information Integration Environment for Life Science Discovery
2.3 The Nature of Biological Data 2.3.1 Diversity 2.3.2
Variability 2.4 Data Sources in Life Science 2.4.1 Biological
Databases Are Autonomous 2.4.2 Biological Databases Are Heterogeneous in
Data Formats 2.4.3 Biological Data Sources Are Dynamic 2.4.4
Computational Analysis Tools Require SpecificInput/Output Formats and
Broad Domain Knowledge 2.5 Challenges in Information Integration 2.5.1
Data Integration 2.5.2 Meta-Data Specification 2.5.3 Data Provenance
and Data Accuracy 2.5.4 Ontology 2.5.5 Web Presentations
Conclusion References 3 A Practitioner’s Guide to Data
Management and DataIntegration in Bioinformatics Barbara A.
Eckman3.1 Introduction 3.2 Data Management in Bioinformatics 3.2.1
Data Management Basics 3.2.2 Two Popular Data Management Strategiesand
Their Limitations 3.2.3 Traditional Database Management 3.3 Dimensions
Describing the Space of Integration Solutions 3.3.1 A Motivating Use Case
for Integration 3.3.2 Browsing vs. Querying 3.3.3 Syntactic vs.
Semantic Integration 3.3.4 Warehouse vs. Federation 3.3.5 Declarative
vs. Procedural Access 3.3.6 Generic vs. Hard-Coded 3.3.7 Relational
vs. Non-Relational Data Model 3.4 Use Cases of Integration Solutions
3.4.1 Browsing-Driven Solutions 3.4.2 Data Warehousing Solutions
3.4.3 Federated Database Systems Approach 3.4.4 Semantic Data
Integration 3.5 Strengths and Weaknesses of the Various Approaches to
Integration 3.5.1 Browsing and Querying: Strengths and Weaknesses
3.5.2 Warehousing and Federation: Strengths and Weaknesses 3.5.3
Procedural Code and Declarative Query Language:Strengths and Weaknesses
3.5.4 Generic and Hard-Coded Approaches:Strengths and Weaknesses
3.5.5 Relational and Non-Relational Data Models: Strengthsand
Weaknesses 3.5.6 Conclusion: A Hybrid Approach to Integration Is Ideal
3.6 Tough Problems in Bioinformatics Integration 3.6.1 Semantic Query
Planning Over Web Data Sources 3.6.2 Schema Management 3.7 Summary
Acknowledgments References 4 Issues to Address While Designing
a BiologicalInformation System Zoe Lacroix4.1 Legacy 4.1.1
Biological Data 4.1.2 Biological Tools and Workflows 4.2 A Domain in
Constant Evolution 4.2.1 Traditional Database Management and Changes
4.2.2 Data Fusion 4.2.3 Fully Structured vs. Semi-Structured 4.2.4
Scientific Object Identity 4.2.5 Concepts and Ontologies 4.3
Biological Queries 4.3.1 Searching and Mining 4.3.2 Browsing 4.3.3
Semantics of Queries 4.3.4 Tool-Driven vs. Data-Driven Integration 4.4
Query Processing 4.4.1 Biological Resources 4.4.2 Query Planning
4.4.3 Query Optimization 4.5 Visualization 4.5.1 Multimedia Data
4.5.2 Browsing Scientific Objects 4.6 Conclusion Acknowledgments
References 5 SRS: An Integration Platform for Databanksand
Analysis Tools in Bioinformatics Thure Etzold, Howard Harris, and Simon
Beaulah5.1 Integrating Flat File Databanks 5.1.1 The SRS Token Server
5.1.2 Subentry Libraries 5.2 Integration of XML Databases 5.2.1
What Makes XML Unique? 5.2.2 How Are XML Databanks Integrated into SRS?
5.2.3 Overview of XML Support Features 5.2.4 How Does SRS Meet the
Challenges of XML? 5.3 Integrating Relational Databases 5.3.1 Whole
Schema Integration 5.3.2 Capturing the Relational Schema 5.3.3
Selecting a Hub Table 5.3.4 Generation of SQL 5.3.5 Restricting Access
to Parts of the Schema 5.3.6 Query Performance to Relational Databases
5.3.7 Viewing Entries from a Relational Databank 5.3.8 Summary 5.4
The SRS Query Language 5.4.1 SRS Fields 5.5 Linking Databanks
5.5.1 Constructing Links 5.5.2 The Link Operators 5.6 The Object
Loader 5.6.1 Creating Complex and Nested Objects 5.6.2 Support for
Loading from XML Databanks 5.6.3 Using Links to Create Composite
Structures 5.6.4 Exporting Objects to XML 5.7 Scientific Analysis
Tools 5.7.1 Processing of Input and Output 5.7.2 Batch Queues 5.8
Interfaces to SRS 5.8.1 The Web Interface 5.8.2 SRS Objects 5.8.3
SOAP and Web Services 5.9 Automated Server Maintenance with SRS Prisma
5.10 Conclusion References 6 The Kleisli Query System as a
Backbone forBioinformatics Data Integration and Analysis Jing Chen, Su
Yun Chung, and Limsoon Wong6.1 Motivating Example 6.2 Approach 6.3
Data Model and Representation 6.4 Query Capability 6.5 Warehousing
Capability 6.6 Data Sources 6.7 Optimizations 6.7.1 Monadic
Optimizations 6.7.2 Context-Sensitive Optimizations 6.7.3 Relational
Optimizations 6.8 User Interfaces 6.8.1 Programming Language Interface
6.8.2 Graphical Interface 6.9 Other Data Integration Technologies
6.9.1 SRS 6.9.2 DiscoveryLink6.9.3 Object-Protocol Model (OPM)
6.10 Conclusions References 7 Complex Query Formulation
Over DiverseInformation Sources in TAMBIS Robert Stevens, Carole
Goble, Norman W. Paton,Sean Bechhofer, Gary Ng, Patricia Baker, and Andy
Brass7.1 The Ontology 7.2 The User Interface 7.2.1 Exploring the
Ontology 7.2.2 Constructing Queries 7.2.3 The Role of Reasoning in
Query Formulation 7.3 The Query Processor 7.3.1 The Sources and
Services Model 7.3.2 The Query Planner 7.3.3 The Wrappers 7.4
Related Work x Contents7.4.1 Information Integration in Bioinformatics
7.4.2 Knowledge Based Information Integration 7.4.3 Biological
Ontologies 7.5 Current and Future Developments in TAMBIS 7.5.1 Summary
Acknowledgments References 8 The Information Integration
System K2 Val Tannen, Susan B. Davidson, and Scott Harker8.1 Approach
8.2 Data Model and Languages 8.3 An Example 8.4 Internal Language
8.5 Data Sources 8.6 Query Optimization 8.7 User Interfaces
8.8 Scalability 8.9 Impact 8.10 Summary Acknowledgments
References 9 P/FDM Mediator for a Bioinformatics
DatabaseFederation Graham J. L. Kemp and Peter M. D. Gray9.1
Approach 9.1.1 Alternative Architectures for Integrating Databases
9.1.2 The Functional Data Model 9.1.3 Schemas in the Federation
9.1.4 Mediator Architecture 9.1.5 Example 9.1.6 Query Capabilities
9.1.7 Data Sources 9.2 Analysis 9.2.1 Optimization 9.2.2 User
Interfaces 9.2.3 Scalability 9.3 Conclusions Acknowledgment
References 10 Integration Challenges in Gene Expression
DataManagement Victor M. Markowitz, John Campbell, I-Min A.
Chen,Anthony Kosky, Krishna Palaniappan,and Thodoros
Topaloglou10.1 Gene Expression Data Management: Background 10.1.1 Gene
Expression Data Spaces 10.1.2 Standards: Benefits and Limitations 10.2
The GeneExpress System 10.2.1 GeneExpress System Components 10.2.2
GeneExpress Deployment and Update Issues 10.3 Managing Gene Expression
Data: Integration Challenges 10.3.1 Gene Expression Data: Array Versions
10.3.2 Gene Expression Data: Algorithms and Normalization 10.3.3 Gene
Expression Data: Variability 10.3.4 Sample Data 10.3.5 Gene
Annotations 10.4 Integrating Third-Party Gene Expression Data in
GeneExpress 10.4.1 Data Exchange Formats 10.4.2 Structural Data
Transformation Issues 10.4.3 Semantic Data Mapping Issues 10.4.4 Data
Loading Issues 10.4.5 Update Issues 10.5 Summary Acknowledgments
Trademarks References 11 DiscoveryLink Laura M. Haas,
Barbara A. Eckman, Prasad Kodali,Eileen T. Lin, Julia E. Rice, and Peter
M. Schwarz11.1 Approach 11.1.1 Architecture 11.1.2 Registration
11.2 Query Processing Overview 11.2.1 Query Optimization 11.2.2 An
Example 11.2.3 Determining Costs 11.3 Ease of Use, Scalability, and
Performance 11.4 Conclusions References 12 A Model-Based
Mediator System for Scientific DataManagement Bertram Ludascher,
Amarnath Gupta,and Maryann E. Martone12.1 Background 12.2
Scientific Data Integration Across Multiple Worlds: Examplesand Challenges
from the Neurosciences 12.2.1 From Terminology and Static Knowledgeto
Process Context 12.3 Model-Based Mediation 12.3.1 Model-Based
Mediation: The Protagonists 12.3.2 Conceptual Models and
Registrationof Sources at the Mediator 12.3.3 Interplay Between
Mediator and Sources 12.4 Knowledge Representation for Model-Based
Mediation 12.4.1 Domain Maps 12.4.2 Process Maps 12.5 Model-Based
Mediator System and Tools 12.5.1 The KIND Mediator Prototype 12.5.2
The Cell-Centered Database and SMART Atlas:Retrieval and Navigation
ThroughMulti-Scale Data 12.6 Related Work and Conclusion 12.6.1
Related Work 12.6.2 Summary: Model-Based Mediationand Reason-Able
Meta-Data Acknowledgments References 13 Compared Evaluation of
Scientific DataManagement Systems Zoe Lacroix and Terence
Critchlow13.1 Performance Model 13.1.1 Evaluation Matrix 13.1.2
Cost Model 13.1.3 Benchmarks 13.1.4 User Survey 13.2 Evaluation
Criteria 13.2.1 The Implementation Perspective13.2.2 The User
Perspective 13.3 Tradeoffs 13.3.1 Materialized vs. Non-Materialized
13.3.2 Data Distribution and Heterogeneity 13.3.3 Semi-Structured Data
vs. Fully Structured Data 13.3.4 Text Retrieval 13.3.5 Integrating
Applications 13.4 Summary References Concluding Remarks
Summary Looking Toward the Future Appendix: Biological Resources
Glossary System Information SRS Kleisli TAMBIS K2
P/FDM Mediator GeneExpress DiscoveryLink KIND Index