Data mining
A variety of software products and services are being
introduced to analyze complex biological and chemical data in an intuitive and
efficient manner.
Introduction
Data mining has been defined as "the nontrivial
extraction of implicit, previously unknown, and potentially useful information
from data"1. In areas other than the life sciences and healthcare, data
mining is a huge industry, with more than a hundred companies providing a vast
array of software products and services to clients that obtain, generate, and
rely on large quantities of data. The industries that rely daily on data mining
for a number of their functions include marketing, manufacturing, database
providers, government, the travel industry, banking and the financial industry,
telecommunications, and engineering, among others. The common theme is that
these industries all have truly massive amounts of information—about their
operations and also about their clients—collected in a variety of ways. In
order to maximize the usefulness of this information, they rely on software
that helps glean specific patterns and trends from the data, in addition to
making predictions and offering simulations of future events.
It should come as no surprise that the biopharmaceutical
industry is increasingly employing a variety of data-mining methodologies to
help it deal with the enormous amounts of biological information of various
forms that the industry collects. Ranging from annotated databases of disease
profiles and molecular pathways to sequences, structure–activity relationships
(SAR), chemical structures of combinatorial libraries of compounds, individual
and population clinical trial results, the biopharmaceutical industry is
inundated with information, and data mining is the centerpiece of advanced
methodologies to help the industry deal with this information overload2 (see
Competitive business intelligence, pp. 5–6).
The technology
Data mining uses so-called machine learning and also
statistical and visualization methodologies to discover and represent knowledge
in a form that is easily understood by humans. The objective is to reduce
complexity and extract, or mine, as much relevant and useful information from a
large data set as possible. It is important not to confuse data mining in the
biophamaceutical industry with bioinformatics, which is more focused typically
on sequence-based extraction of specific patterns or motifs and also on
specific pattern matching (see Bioinformatics, pp. 31–34).
The biopharmaceutical industry is generating more chemical
and biological screening data than it knows what to do with or how best to
handle. As a result, deciding which target and lead compound to develop further
is often a long and arduous task. Any technology that reduces the
"noise" in the system and makes better use of the vast reams of
information collected would represent a significant competitive advantage. One
contributor to this inefficiency is the software that exists for analyzing and
interpreting chemical and biological information, which by all accounts has not
really kept pace with the development of new discovery methodologies. For
example, software currently used by medicinal chemists to analyze screening
results presents data for individual compounds either in the form of tables, or
by showing SAR correlations in tables of structures that are not user-friendly
in terms of helping design compounds for further testing. Enter data mining,
which aims at nothing less than helping to make sense of these complex data
sets in an intuitive and efficient manner.
Current state
Table 1 lists selected companies that offer specific
data-mining products and services tailored to the biopharmaceutical industry.
The specific examples illustrate the breadth of data-mining applications, and
also how they differ from more traditional bioinformatics ones. For example,
Chiron Informatics focuses on healthcare delivery systems and is developing
decision-support, data mining–based products and services to facilitate the
implementation of comprehensive medical management.
Table 1: Selected companies with data-mining products and
services
Full table
Another example, Lexical Technology, founded in 1984,
specializes in the development of lexically based products and services for
healthcare vendors and enterprises. Lexical's Metaphrase software product
family was developed with contributions from collaborators that include the
National Library of Medicine, the National Cancer Institute, Kaiser Permanente,
the Mayo Clinic, the American College of Physicians.
Bioreason is a new company with proprietary chemoinformatic
knowledge discovery and data-mining software to help identify
structure–activity relationships in large quantities of relevant data.
Finally, Columbus Molecular Software develops and markets
its LeadScope software for visualizing, browsing, and interpreting chemical and
biological screening data, thus accelerating the extraction of information that
helps validate targets and leads for further preclinical or clinical
development.
It is interesting to note that although data-mining
companies that specialize in the biopharmaceutical sector are relatively few
(again, excluded here are traditional bioinformatics companies), more than 100
general data-mining companies serve other industries, with very significant
revenues2.
Methodologies and applications
Data-mining applications are being developed using
essentially six major approaches, which lend themselves to different types of
biological data analysis. The first approach is generically known as
influence-based mining. Here, complex and granular (as opposed to linear) data
in large databases are scanned for influences between specific data sets, and
this is done along many dimensions and in multi-table formats. These systems
find applications wherever there are significant cause-and-effect relationships
between data sets—as occurs, for example, in large and multivariant gene
expression studies, which are behind areas such as pharmacogenomics.
A variant of influence-based mining is the method
generically referred to as affinity-based mining. Again, large and complex data
sets are analyzed across multiple dimensions, and the data-mining system
identifies data points or sets that tend to be grouped together. These systems
differentiate themselves by providing hierarchies of associations and showing
any underlying logical conditions or rules that account for the specific
groupings of data. This approach is particularly useful in biological motif
analysis, whereby it is important to distinguish "accidental" or
incidental motifs from ones with biological significance.
Yet another approach is generically referred to as
time-delay data mining. Here, the data set is not available immediately and in
complete form, but is collected over time. The systems designed to handle such
data look for patterns that are confirmed or rejected as the data set increases
and becomes more robust. This approach is geared toward long-term clinical
trial analysis and multicomponent mode of action studies, for example.
In the fourth approach, trends-based data mining, the
software analyzes large and complex data sets in terms of any changes that
occur in specific data sets over time. The data sets can be user-defined, or
the system can uncover them itself. Essentially, the system reports on anything
that is changing over time. This is especially important in cause-and-effect
biological experiments. Screening is a good example, where responses over time
to particular drugs or other stimuli are being collected for analysis. The
software is designed specifically for this purpose, and can identify multiple
trends very efficiently.
The fifth approach is generically known as comparative data
mining, and it focuses on overlaying large and complex data sets that are
similar to each other and comparing them. This is particularly useful in all
forms of clinical trial meta analyses, where data collected at different sites
over different time periods, and perhaps under similar but not always identical
conditions, need to be compared. Here, the emphasis is on finding
dissimilarities, not similarities.
Finally, data mining alone is lacking somewhat if it is
unable to also offer a framework for making simulations, predictions, and
forecasts, based on the data sets it has analyzed. So-called predictive data
mining combines pattern matching, influence relationships, time set
correlations, and dissimilarity analysis to offer simulations of future data
sets. One advantage here is that these systems are capable of incorporating
entire data sets into their working, and not just samples, which make their
accuracy significantly higher. Predictive data mining is used often in clinical
trial analysis and in structure–function correlations.
The future
A key application of data mining is the protein folding
process and the derivation of structure–function relationships. Here,
contributions from related fields, such as machine learning developed in
engineering, are making significant contributions that are producing software
better able to handle this complex task, and we will see a lot more of these
approaches in the future3.
In addition, data mining methodologies will be increasingly
applied to the extraction of information not just from biological data, such as
sequences, but also from the scientific literature itself. With the increase in
electronic publications, there is an opportunity and a need to develop
automated ways of searching and summarizing the literature. A recent report
describes the use of automated keyword extraction to produce up-to-date entries
on human inherited diseases from the OMIM (Online Mendelian Inheritance in Man)
database4.
Another major development for the future is the application
of data mining to clinical information databases, such as heart disease
databases. The methodology here can help reveal patients at higher risk for
heart disease and therefore promise significant preventative potential5.
Finally, data mining methods are used to improve
computer-assisted drug design, by using techniques such as genetic algorithms
and others to detect chemical entity features that occur in clusters within the
high-dimensional analytical data of drug design experiments6. This type of
cluster analysis helps optimize the search for relevant new drug structures and
therefore has major importance for the industry.
Conclusions
The explosive growth of biological data generation and
availability has shifted bottlenecks in drug development from the discovery
phase to the high-throughput analysis phase. Here, humans alone cannot go over
the vast tracts of data sets that are being generated. Data mining is emerging
within the biopharmaceutical industry as a significant ally in this effort, in
a way that compliments and expands traditional bioinformatics. Eventually, data
mining and bioinformatics will be indistinguishable, but for the time being
they are distinct. It is important to remember that this is an example of a
technology that has been successfully deployed in many other industries whose data
requirements are similar to those of the biopharmaceutical industry. Data
mining has met with very significant success in these other industries, and it
is expected that in the next few years it will contribute significantly in the
optimization of the data analysis process of the biopharmaceutical industry as
well.
This Is Video NJIT School of Management professor Stephan P Kudyba describes what data mining is and how it is being used in the business world.
No comments:
Post a Comment