From Research to Riches: Data Wrangling Lessons from Physical and Life Science
With a background in bioinformatics, Christian discusses his recent transition to the world of data science and the learning curve associated with this dynamic field.
The Diary of a Data Scientist blog documents the journey into the world of predictive analytics where Christian and his colleagues write about machine learning in an up-close and personal look into data science.
By Christian Kendall, Salford Systems.
Subscribe to Diary of a Data Scientist Blog:
Employing analogies to relate data science problems to the context of experiments in physical and life science allows you to access more intuitive understanding and critical thinking you would apply to measuring concrete attributes with physical equipment. By comparing day-to-day operations in data science with research in more traditional scientific fields yields a few lessons on how collect and use data.
- Data is everywhere and you don’t need a laboratory to get it
- Big data makes it harder to design a good experiment
- “Garbage in, garbage out” does not apply when data can be cleaned and tools can be fixed (but you may still have to throw them out)
- Sources of data can be treated as black boxes in order to apply problem solving strategies for managing physical measurements and equipment
- You can solve problems more effectively with a mental inventory of your black boxes
Coming to data science from a research background, I was impressed by how the diverse ecosystem of problems and solutions can evoke pure scientific thinking to frame questions, to measure aspects of real scenarios, and to develop actionable analyses. On the other hand I was surprised by the lack of standardized tools and approaches to problems.In day-to-day practice, heady technical concepts associated with “data” and “science” take a backseat to operational and practical questions like, “where is the data? Is this data accessible and useful? Can I process the data in a reasonable time frame?”These questions seem straight forward enough. In fact, they are obvious and necessary starting points for beginning any analysis. However, data scientists spend a lot more time on these questions than many people perceive. It’s not all about writing backwards on glass, artificial intelligence, or even the final construction of models.
We all encounter data in our jobs and day-to-day lives but we don’t always think about where we get data or how we perceive of it and interact with it. Most of us find it easy to picture data in the form of a plot and to think of the information represented on the plot as some number of tangible things versus time or distance—perhaps a science-y, but still physical, thing like x-ray intensity.I find that comparing and contrasting data science with more conventional scientific research helps to introduce a concrete understanding of data through physical examples.
While an x-ray crystallographer must set up a complex experiment to get a sense of molecular shapes and arrangements with radiation, we record tons of information on radiation from visible light in pictures every day. You may or may not think of pictures as data or a collection of pixel intensity values, but entire fields of image processing and intelligent computer vision have been built around using pictures as data. We are surrounded by data and often miss the presence and value of information readily available to us outside of the laboratory. While we can often get data through less sophisticated means than a highly technical experiment, we do not have the defined equations and approaches that an x-ray crystallographer has to follow. A data scientist must use tools like machine learning to find patterns, functions, or meaning. We have to build several models to find the best approach for the current problem and data at hand. Drawing some analogies between physical experiments and the questions and information that many professionals encounter on a daily basis helps to develop an appreciation of what our data is and how much we have.
In physical or life science the question of “where is the data?” or “what does the data tell us?” really lies in more technical and field-specific questions. You have a question and need to probe information from a physical system. To this end you must learn about (or design) an experimental method or instrument to measure a particular phenomenon. The data format, type, volume, and error are all defined by the methods and standards employed in your field of study or dictated by the design of your experiment.
Simply put, big data is defined by uncertainty in all aspects that a more traditional scientist exercises control over when designing an experiment. A data scientist must apply similar scientific reasoning to draw conclusions from data, but they usually cannot choose what measurements are available, minimize error, or eliminate variations to inspect the direct impacts of particular variables. When data scientists deal with big data problems, their “big” data is commonly defined by the “three V’s:” volume, variety, and velocity. Depending on who you ask, there may be a fourth V, “veracity,” referring to quality or uncertainty in the data (this infographic from IBM gives some good examples of the “V’s”).
Big data can span a huge range in size and dimensions (volume) and can comprise of video, text, whole files, and measurements of diverse abstract and concrete attributes (variety). In today’s world, tons of new data is generated every day and streams at staggering rates from sensors, cameras, or web and media platforms (velocity). The different types and sources alone can cause problems when defining error measurements (veracity), and difficulties with quantifying sources of uncertainty like clerical errors, falsified reports, and even sarcastic tweets can confound the most clever attempts at error analysis.
A scientist generally has a precise idea of what data type they are dealing with (changes in voltage or a number of cells etc.) and will usually incorporate only a few different measures in any experiment or publication. An experienced scientist will understand how much data they need to study a particular phenomenon, for example, a number of experimental replicates or a large enough time frame to observe the event of interest. A good scientist constantly assesses error, whether instrumental or experimental, and will quantify it, plot it, and describe its impact on interpretations if the report will ever see the light of day. Scientists with more sensitive problems pay close attention to the resolution of their data, which can relate to space (e.g. nanometer scale), information (e.g. base pair resolution of a genetic sequence), or time (e.g. having a sufficient sampling rate or frequency to measure extremely fast molecular changes). Someone with these kinds of problems and experiments may not only understand the volume and velocity of their data but may also try to optimize the resolution and error of their data.
Bio: Christian Kendall is a Data Scientist at Salford Systems. He brings more than 4 years of research expertise, with a background in physical and life science emphasizing informatics and software development.