5 Steps for Advanced Data Analysis using Visualization
In most of the scientific researches, due to large amount of experiment data, statistical analysis is typically done by technical experts in computing and statistics. Unfortunately, these experts are not the experts of underlying research; which may cause gaps in analysis. If actual researchers are given easy to use tools and methods to handle and analyse data, it will enrich the research outcome for sure.
By Carl Johan Ivarsson, Qlucore.
A common challenge affecting many scientists, especially those working in the area of molecular biology, is the vast amount of data that is created by their experiments. With such a large volume of data to consider, software tools are required to interpret their data effectively.
Until now, computer software designed for this purpose has focused on being able to handle increasingly vast amounts of data and to a large extent applying standard statistical methods presented to the user in a technical specialist oriented user interface. As a result, the possibility for the scientist/researcher to approach and interpret data has partly been set aside, and a lot of data analysis can only be performed by specialist bioinformaticians and biostatisticians. In most cases, however, this model has several drawbacks, since it is typically the scientist who knows the most about the specific area being studied.
Visualization in combination with well selected algorithms and methods can overcome some of the described challenges and allow a broader range of users to explore and analyze data. The active use of Visualization techniques provides a powerful way of identifying important structures and patterns very quickly. Visualization provides the user with feedback that is easy to understand. Visualization is also an important tool from an organizational point of view since it stimulates innovation as a result of more scientists now being able to analyze and discuss data and results.
We recommend a five-step method to ensure repeatable and significant results when using Visualization to identify new subgroups and patterns in data. The purposes of this analysis step can be several. The most common purpose is to try to identify completely new groups or patterns in data. Another purpose can be to explore data to detect only expected patterns, which is a good form of quality control. By applying this five-step method, it is possible to investigate large and complex data sets without being an expert in statistics. The method is described below in more detail, but some basics need to be in place at the start. This method can be applied to any type of high dimensional data and examples on data from the life-science industry are; RNA-seq, gene expression arrays, proteomics, DNA methylation, metabolomics.
Step 1: Reduce to lower dimensions
First of all, the high dimensional data needs to be reduced to lower dimensions so that it can be plotted in 3D. We recommend the use of Principal Component Analysis (PCA) for this purpose. Tools to color data to enhance the information are also required, as well as filters and tools to select and deselect parts of the data set.
At this stage, researchers can begin the five-step Visualization process by detecting and removing the strongest signal present in the active dataset. Once this signal is identified, it can be removed in order to see whether there are any other obscured (but still detectable) signals present. Removing a strong signal will usually result in the reduction of both the number of active samples and/or variables(features).
Step 2: Assess signal to noise ratio
Step two of the process is to assess the signal-to-noise ratio in the data by using PCA, Projection Score and randomization. The Projection Score will indicate the strength of the visually detected signal or pattern..
Step 3: Remove noise by variance filtering
Step three is to remove “noise” by variance filtering. If researchers can see a significant signal-to-noise ratio in their active dataset, they should try to remove some of the active variables that are most likely contributing to the noise. To identify the required amount of variance filtering the user can use variance filtering supervised by PCA visualization, and projection score. By testing many different variance settings it is possible to easier find clear patterns.
Step 4: Statistical tests
Step four offers the option of performing statistical tests that can be applied to any/all of the other stages of the five-step process: either during the initial analysis, when a step is repeated, at the end of a step, or not at all. The groups to test can be either predefined or a selection of those that were identified during the iterative process. (It is recommended to verify found structure and groups in a second data set).
Step 5: Search for subgroups or clusters
The final step uses graphs to refine the search for subgroups or clusters. Connecting samples in networks or graphs, for example, makes it possible to move into higher dimensions (i.e. more than the three that can be represented in a 3D PCA plot), since the graph created in a sample plot is based on the distances in the space of all active variables, and can therefore provide more insight into the structure of the data.
These five steps are then repeated until there are no more structures to be found.
When used in this way, Visualization can be used as a powerful tool for researchers. Data can be visualized in a clear way, scientists can identify any interesting and/or significant results easily by themselves, without having to rely on specialist bioinformaticians and biostatisticians. Instead the scientist can co-operate with the bioinformaticians to achieve even more interesting results.
Bio: Carl Johan Ivarsson is president at Qlucore. Qlucore started as a collaborative research project at Lund University, Sweden, supported by researchers at the Departments of Mathematics and Clinical Genetics, in order to address the vast amount of high-dimensional data generated with microarray gene expression analysis. As a result, it was recognised that an interactive scientific software tool was needed to conceptualise the ideas evolving from the research collaboration.
The basic concept behind the software is to provide a tool that can take full advantage of the most powerful pattern recogniser that exists – the human brain. The result is a core software engine that lets the user handle and filter data and the same time instantly visualise it in 3D. This will aid the user in identifying hidden structures and patterns. Over the last four years major efforts have been made to optimise the early ideas and to develop a core software engine that is extremely fast, allowing the user to explore and analyse high-dimensional data sets with the use of a normal PC, interactively and in real time.
Qlucore was founded in early 2007 and the first product released was the “Qlucore Gene Expression Explorer 1.0”. The latest version of this software, now called Qlucore Omics Explorer, represents a major step forward with advanced statistics support, streamlined workflows for multiple data types, and a wide selection of presentation methods to aid the user. The presentation methods range from an innovative use of principal component analysis (PCA) to interactive heat maps and flexible scatter plots. All user action is at most two mouse clicks away. The company’s early customers are mainly from the Life-science and Biotech industries, but solutions for other industries are currently under development.