6 Steps to Effective Data Preparation for Quality Conclusions
Data preparation is usually the most time consuming part of a data analysis project. To get good results, follow the six steps here, starting with Understand the Business Needs, Get to Know the Data, and Wrangle, Munge, and Mash Up.
Garbage in, garbage out. In this age of big and unstructured data analytics, good data preparation is a must to avoid risking invalid results or being blocked from analyses of benefit to your business. You may also need to dedicate up to 80% of the time of a data analysis initiative to preparing data properly. So, to optimize results, follow the six steps below.
Step 1 – Understand the Business Needs
Ask! Get the ultimate beneficiaries of your data preparation to tell you what business insights or knowledge they want from the data available. Check that enterprise goals translate into appropriate business questions and key performance indicators (KPIs), which can then be mapped onto the data and analytics to be used. Don’t get sucked into a “proof of concept” project without a valid, useful business benefit.
Step 2 – Get to Know the Data
Understand where the data is to be accessed, and whether it falls into the category of simple, diversified, big or complex data. These categories are determined by the overall volume of data and the number of tables. The data you need may be in Excel files, in a data warehouse, or in a CRM system. You’ll need the right credentials to access the data, and the right software and hardware resources to process it.
Step 3: Wrangle, Munge, and Mash Up
Time to take out the garbage. Identify or amend your data sources to ensure they are complete, accurate, and current. Determine if you must change the data, whether in terms of formats or statistically (handling outliers, dealing with non-standard distributions). If manual transformation is impractical (as in “Death by a Thousand Spreadsheets”), automated transformation via a specialized application is another option.
Step 4: Build Good Relationships
Business users constantly need to make new queries to help them react to changes in business markets and strategies. Suitable relationships between data sources must therefore be defined. This may mean joining tables in different ways, or making summary tables to maintain flexibility for ad hoc queries from users, while keeping dataset relationships manageable.
Step 5: Load and Reload Data
Depending on your needs and choices, the target for loading may range from a single flat file to a data warehouse with fact and dimension tables. You may also choose to overwrite existing loaded data, or maintain a fixed window (the last 12 months, for instance) of loaded data. These choices will also depend on available hardware and software resources for analyzing large datasets.
Step 6: Check
Check your data preparation. After loading, ensure your data preparation activities are leading to sensible results: run test calculations to make sure you get the results consistent with existing metrics. And always, check you are keeping close to the business requirements driving the project.
All Done? Not Quite…
The six steps above form a cycle you are likely to go through more than once, as markets, needs, and data sources change. So, “set it and forget it” is unlikely to apply here. But looking on the bright side, data preparation requirements could also mean you’ll never be out of a job!