A Layman’s Guide to Data Science. Part 2: How to Build a Data Project

As Part 2 in a Guide to Data Science, we outline the steps to build your first Data Science project, including how to ask good questions to understand the data first, how to prepare the data, how to develop an MVP, reiterate to build a good product, and, finally, present your project.

By Sciforce, software solutions based on science-driven information technologies.

It is quite often that in our blog, we explore intricate connections between state-of-the-art technologies, or explore the mesmerizing depth of a new technique. However, AI or data science is not only bragging about new exciting methods that boost accuracy by 2% (which is a big gain) but about making data and technology work for you. It will help you increase sales, understand your customers, predict future faults in process lines, or make an insightful presentation, submit a term project, or have a good time with your friends working on a new idea that will change the world. And in this sense, everyone can — and to some extent should — become a data scientist.

We already discussed in Part 1 of this guide what makes a good data scientist and what you should learn before you set to a real project. In this post, we’ll walk you through the process of building a backbone data project in simple steps.

Find a story behind an idea.

You have an excellent idea in your head — the one you have cherished since you were a child about having a toys-cleaning robot or the one that just came into your mind about accessing the customers in your shop by sending them fortune cookies with predictions based on their purchase preferences. However, to make your idea work, you need the attention of others. Find a compelling narrative for it; make sure that it has a hook or a captivating purpose, that it is up-to-date and relevant. Finding the narrative structure will help you decide whether you actually have a story to tell.

Such a narrative will be the basis for your business model. Ask yourself: What is it that you develop, what resources do you need, and what value do you provide to the customer? For what values are customers going to pay?

A nice way to do this is the business model canvas. It’s simple and cheap, you can create it on a sheet of paper.

Prepare the data.

The first practical step is collecting data to fuel your project. Depending on your field and goals, you can search for ready datasets available on the Internet, such as for example, this collection. You can choose to scrape data from websites or access data from social networks through public API’s. For the latter option, you need to write a small program that can download data from social networks in a programming language you feel the most comfortable with. For the cloud option, you can spin up a simple AWS EC2 Linux instance (nano or micro), and run your software on in.

The best way to store the data is to use a simple .csv format with each line, including the text and meta information, such as the person, timestamp, replies, and likes.

As to the amount of data needed, the rule of the thumb is to get as much data as possible in a reasonable time, for example, a few days of running your program. Another important consideration is to collect as much data as the machine you are using for analytics can handle. How much data to get is not an exact science, but it rather depends on the technical limitations and the question you have.

Finally, in collecting and managing data, it is crucial to be devoid of bias and do not be selective about the inclusion or exclusion of data. This selectivity includes using discrete values when the data is continuous; how you deal with missing, outlier, and out of range values; arbitrary temporal ranges; capped values, volumes, ranges, and intervals. Even if it is arguing to influence, it should be based upon what the data says–not what you want it to say.

Choose the right tools.

To perform a valid analysis, you need to find the proper tools. After getting the data, you need to select the proper tool to explore it. To make a choice, you can write down a list of analytics features you think you need and compare available tools. Sometimes you can use user-friendly graphical tools like Orange, Rapid Miner, or Knime. In other cases, you’ll have to write the analysis on your own with such languages as Python or R.

Prove your theory.

With the data and tools available, you can prove your theory. In Data Science, theories are statements of how the world should be or is and are derived from axioms that are assumptions about the world, or precedent theories (Das, 2013). Models are implementations of the theory; in data science, they are often algorithms based on theories that are run on data. The results of running a model lead to a deeper understanding of the world based on theory, model, and data.

To assess your theory at an initial step, in line with the more general and conventional content analysis, you can pinpoint trends present in the data. One way we use quite a lot is to select significant events that have been reported. Then you can try to create an analytics process that finds these trends. If analytics can find the trends you specified, then you are on the right track. Look for instances where analytics finds new trends. Confirm these trends, for instance, by searching the Internet. The results are not going to be reliable 100% of the time, so you’ll need to decide how many falsely reported trends (the error rate) you want to tolerate.

Build a minimum viable product.

When you have your business model and proven theory, it is time to build the first version of your product, the so-called minimum viable product (MVP). Basically, this can be the first version that you offer to customers. As a minimum viable product (MVP) is a product with just enough features to satisfy early customers and to provide feedback for future development, it should focus only on the core functionality without any fancy solutions. You should stick to simple functions that will work in the beginning and expand your system later. At this stage, the system could look something like this:

Automate your system.

In principle, your focus should be on the future development of your product, not on system operation. For this, you need to automate as much as possible: uploading to S3, starting the analysis or data storing. In this article, we discussed automation in more detail.

The other face of automation is logging. When everything is automated, you can feel that you are losing control over your system and do not know how it performs. Besides, you need to know what to develop next, both in terms of new features and fixing problems. For this, you need to set up a system for logging, monitoring, and measuring all meaningful data. For instance, you should log statistics for the download of your data or upload it to S3, the time of the analytics process, and the users’ behavior.

There are multiple tools to help you log server statistics like CPU, RAM, network, code-level performance, and error monitoring, many of them having a user-friendly interface.

Reiterate and expand.

You probably know that AI, Machine Learning, Data Science, and other new developments are all about reiteration and fine-tuning. So, when you have your MVP running, automation, and monitoring in place, you can start enhancing your system. It is time to get rid of weaknesses, optimize the overall performance and stability, and add new functions. Implementing new features will also allow you to offer new services or products.

Present your product.

Finally, when your product is ready, you need to present it to the customers. This is where your story behind the data and business model comes to help.

First of all, think about your target audience. Who are your customers, and how are you going to sell your product to them? What does the audience you are going to present your product to know about the topic? The story needs to be framed around the level of information the audience already has, correct and incorrect:

  • Novice: first exposure to the subject, but doesn’t want oversimplification
  • Generalist: aware of the topic, but looking for an overview understanding and major themes
  • Managerial: in-depth, actionable understanding of intricacies and interrelationships with access to detail
  • Expert: more exploration and discovery and less storytelling with great detail
  • Executive: only has time to glean the significance and conclusions of weighted probabilities

Afterward, visualize your data and incorporate trends, significance, and proportion you built your project into a narrative. Your story about the product should never end with a fixed event, but rather a set of options or questions to trigger an action from the audience. Never forget that the goal of data storytelling is to encourage and energize critical thinking for business decisions or to purchasing your product.

Original. Reposted with permission.


Bio: SciForce is a Ukraine-based IT company specialized in development of software solutions based on science-driven information technologies. We have wide-ranging expertise in many key AI technologies, including Data Mining, Digital Signal Processing, Natural Language Processing, Machine Learning, Image Processing, and Computer Vision.