Computer Techno: Data mining

Data mining

A variety of software products and services are being introduced to analyze complex biological and chemical data in an intuitive and efficient manner.

Introduction

Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data"1. In areas other than the life sciences and healthcare, data mining is a huge industry, with more than a hundred companies providing a vast array of software products and services to clients that obtain, generate, and rely on large quantities of data. The industries that rely daily on data mining for a number of their functions include marketing, manufacturing, database providers, government, the travel industry, banking and the financial industry, telecommunications, and engineering, among others. The common theme is that these industries all have truly massive amounts of information—about their operations and also about their clients—collected in a variety of ways. In order to maximize the usefulness of this information, they rely on software that helps glean specific patterns and trends from the data, in addition to making predictions and offering simulations of future events.

It should come as no surprise that the biopharmaceutical industry is increasingly employing a variety of data-mining methodologies to help it deal with the enormous amounts of biological information of various forms that the industry collects. Ranging from annotated databases of disease profiles and molecular pathways to sequences, structure–activity relationships (SAR), chemical structures of combinatorial libraries of compounds, individual and population clinical trial results, the biopharmaceutical industry is inundated with information, and data mining is the centerpiece of advanced methodologies to help the industry deal with this information overload2 (see Competitive business intelligence, pp. 5–6).

The technology

Data mining uses so-called machine learning and also statistical and visualization methodologies to discover and represent knowledge in a form that is easily understood by humans. The objective is to reduce complexity and extract, or mine, as much relevant and useful information from a large data set as possible. It is important not to confuse data mining in the biophamaceutical industry with bioinformatics, which is more focused typically on sequence-based extraction of specific patterns or motifs and also on specific pattern matching (see Bioinformatics, pp. 31–34).

The biopharmaceutical industry is generating more chemical and biological screening data than it knows what to do with or how best to handle. As a result, deciding which target and lead compound to develop further is often a long and arduous task. Any technology that reduces the "noise" in the system and makes better use of the vast reams of information collected would represent a significant competitive advantage. One contributor to this inefficiency is the software that exists for analyzing and interpreting chemical and biological information, which by all accounts has not really kept pace with the development of new discovery methodologies. For example, software currently used by medicinal chemists to analyze screening results presents data for individual compounds either in the form of tables, or by showing SAR correlations in tables of structures that are not user-friendly in terms of helping design compounds for further testing. Enter data mining, which aims at nothing less than helping to make sense of these complex data sets in an intuitive and efficient manner.

Current state

Table 1 lists selected companies that offer specific data-mining products and services tailored to the biopharmaceutical industry. The specific examples illustrate the breadth of data-mining applications, and also how they differ from more traditional bioinformatics ones. For example, Chiron Informatics focuses on healthcare delivery systems and is developing decision-support, data mining–based products and services to facilitate the implementation of comprehensive medical management.

Table 1: Selected companies with data-mining products and services

Full table

Another example, Lexical Technology, founded in 1984, specializes in the development of lexically based products and services for healthcare vendors and enterprises. Lexical's Metaphrase software product family was developed with contributions from collaborators that include the National Library of Medicine, the National Cancer Institute, Kaiser Permanente, the Mayo Clinic, the American College of Physicians.

Bioreason is a new company with proprietary chemoinformatic knowledge discovery and data-mining software to help identify structure–activity relationships in large quantities of relevant data.

Finally, Columbus Molecular Software develops and markets its LeadScope software for visualizing, browsing, and interpreting chemical and biological screening data, thus accelerating the extraction of information that helps validate targets and leads for further preclinical or clinical development.

It is interesting to note that although data-mining companies that specialize in the biopharmaceutical sector are relatively few (again, excluded here are traditional bioinformatics companies), more than 100 general data-mining companies serve other industries, with very significant revenues2.

Methodologies and applications

Data-mining applications are being developed using essentially six major approaches, which lend themselves to different types of biological data analysis. The first approach is generically known as influence-based mining. Here, complex and granular (as opposed to linear) data in large databases are scanned for influences between specific data sets, and this is done along many dimensions and in multi-table formats. These systems find applications wherever there are significant cause-and-effect relationships between data sets—as occurs, for example, in large and multivariant gene expression studies, which are behind areas such as pharmacogenomics.

A variant of influence-based mining is the method generically referred to as affinity-based mining. Again, large and complex data sets are analyzed across multiple dimensions, and the data-mining system identifies data points or sets that tend to be grouped together. These systems differentiate themselves by providing hierarchies of associations and showing any underlying logical conditions or rules that account for the specific groupings of data. This approach is particularly useful in biological motif analysis, whereby it is important to distinguish "accidental" or incidental motifs from ones with biological significance.

Yet another approach is generically referred to as time-delay data mining. Here, the data set is not available immediately and in complete form, but is collected over time. The systems designed to handle such data look for patterns that are confirmed or rejected as the data set increases and becomes more robust. This approach is geared toward long-term clinical trial analysis and multicomponent mode of action studies, for example.

In the fourth approach, trends-based data mining, the software analyzes large and complex data sets in terms of any changes that occur in specific data sets over time. The data sets can be user-defined, or the system can uncover them itself. Essentially, the system reports on anything that is changing over time. This is especially important in cause-and-effect biological experiments. Screening is a good example, where responses over time to particular drugs or other stimuli are being collected for analysis. The software is designed specifically for this purpose, and can identify multiple trends very efficiently.

The fifth approach is generically known as comparative data mining, and it focuses on overlaying large and complex data sets that are similar to each other and comparing them. This is particularly useful in all forms of clinical trial meta analyses, where data collected at different sites over different time periods, and perhaps under similar but not always identical conditions, need to be compared. Here, the emphasis is on finding dissimilarities, not similarities.

Finally, data mining alone is lacking somewhat if it is unable to also offer a framework for making simulations, predictions, and forecasts, based on the data sets it has analyzed. So-called predictive data mining combines pattern matching, influence relationships, time set correlations, and dissimilarity analysis to offer simulations of future data sets. One advantage here is that these systems are capable of incorporating entire data sets into their working, and not just samples, which make their accuracy significantly higher. Predictive data mining is used often in clinical trial analysis and in structure–function correlations.

The future

A key application of data mining is the protein folding process and the derivation of structure–function relationships. Here, contributions from related fields, such as machine learning developed in engineering, are making significant contributions that are producing software better able to handle this complex task, and we will see a lot more of these approaches in the future3.

In addition, data mining methodologies will be increasingly applied to the extraction of information not just from biological data, such as sequences, but also from the scientific literature itself. With the increase in electronic publications, there is an opportunity and a need to develop automated ways of searching and summarizing the literature. A recent report describes the use of automated keyword extraction to produce up-to-date entries on human inherited diseases from the OMIM (Online Mendelian Inheritance in Man) database4.

Another major development for the future is the application of data mining to clinical information databases, such as heart disease databases. The methodology here can help reveal patients at higher risk for heart disease and therefore promise significant preventative potential5.

Finally, data mining methods are used to improve computer-assisted drug design, by using techniques such as genetic algorithms and others to detect chemical entity features that occur in clusters within the high-dimensional analytical data of drug design experiments6. This type of cluster analysis helps optimize the search for relevant new drug structures and therefore has major importance for the industry.

Conclusions

The explosive growth of biological data generation and availability has shifted bottlenecks in drug development from the discovery phase to the high-throughput analysis phase. Here, humans alone cannot go over the vast tracts of data sets that are being generated. Data mining is emerging within the biopharmaceutical industry as a significant ally in this effort, in a way that compliments and expands traditional bioinformatics. Eventually, data mining and bioinformatics will be indistinguishable, but for the time being they are distinct. It is important to remember that this is an example of a technology that has been successfully deployed in many other industries whose data requirements are similar to those of the biopharmaceutical industry. Data mining has met with very significant success in these other industries, and it is expected that in the next few years it will contribute significantly in the optimization of the data analysis process of the biopharmaceutical industry as well.

This Is Video NJIT School of Management professor Stephan P Kudyba describes what data mining is and how it is being used in the business world.

Computer Techno

Wednesday, June 6, 2012

Data mining

No comments:

Post a Comment