5 Concepts Every Data Scientist Should Know
Once a Data Scientist, there are certain skills you will apply each and every day of your career. Some of these might be common techniques you learned during your education, while others may develop fully only after you become more established in your organization. Continuing to hone these skills will provide you with valuable professional benefits.
By Matthew Przybyla, Senior Data Scientist at Favor Delivery.
Photo by Romson Preechawit on Unsplash.
I have written about common skills that Data Scientists can expect to use in their professional careers, so now I want to highlight some key concepts of Data Science that can be beneficial to know and later employ. I may be discussing some that you know already and some that you do not know; my goal is to provide some professional explanation of why these concepts are beneficial regardless of what you do know now. Multicollinearity, onehot encoding, undersampling and oversampling, error metrics, and lastly, storytelling, are the key concepts I think of first when thinking of a professional Data Scientist in their daytoday. The last point, perhaps, is a combination of skill and a concept but wanted to highlight, still, its importance on your everyday work life as a Data Scientist. I will expound upon all of these concepts below.
Multicollinearity
Photo by The Creative Exchange on Unsplash.
Although the word is somewhat long and hard to say, when you break it down, multicollinearity is simple. Multi meaning many, and collinearity meaning linearly related. Multicollinearity can be described as the situation when two or more explanatory variables explain similar information or are highly related in a regression model. There are a few reasons this concept can raise a concern.
For some modeling techniques, it can cause overfitting and, ultimately, a decline in model performance.
The data becomes redundant, and not each feature or attribute is needed in your model. Therefore, there are some ways to find out which features you should remove that constitute multicollinearity.
 variance inflation factor (VIF)
 correlation matrices
These two techniques are commonly used amongst Data Scientists, especially correlation matrices and plots — usually visualized with a heatmap of some sort, while VIF is lesserknown.
The higher the VIF value, the less usable the feature is for your regression model.
A great, simple resource for VIF is Variance Inflation Factor  Statistics How To.
OneHot Encoding
This form of feature transformation in your model is called onehot encoding. You want to represent your categorical features numerically by encoding them. Whereas the categorical features have text values themselves, onehot encoding transposes that information so that each value becomes the feature, and the observation in the row is either denoted as a 0 or 1. For example, if we have the categorical variable gender, the numerical representation after onehot encoding would look like (gender before, and male/female after):
Before and after onehot encoding. Screenshot by Author.
This transformation is useful when you are not just working with numerical features, and need to create that numerical representation with text/categorical features.
Sampling
When you do not have enough data, oversampling may be suggested as a form of compensation. Say you are working on a classification problem and you have a minority class like the example down below:
class_1 = 100 rows
class_2 = 1000 rows
class_3 = 1100 rows
As you can see, class_1 has a small amount of data for its class, which means your dataset is imbalanced and will be referred to as the minority class. There are several oversampling techniques. One of them is called SMOTE, which stands for Synthetic Minority Oversampling Technique. One of the ways that SMOTE works is by utilizing a Kneighbor method for finding the nearest neighbor to create synthetic samples. There are similar techniques that use the reverse method for undersampling.
These techniques are beneficial when you have outliers in your class or regression data even, and you want to ensure your sampling is the best representation of the data that your model will run on in the future.
Error Metrics
There are plenty of error metrics used for both classification and regression models in Data Science. According to the scikitlearn library, here are some that you can use specifically for regression models:
 metrics.explained_variance_score
 metrics.max_error
 metrics.mean_absolute_error
 metrics.mean_squared_error
 metrics.mean_squared_log_error
 metrics.median_absolute_error
 metrics.r2_score
 metrics.mean_poisson_deviance
 metrics.mean_gamma_deviance
The two most popular error metrics for regression from above are MSE and RMSE:
MSE: the concept is → mean absolute error regression loss (sklearn)
RMSE: the concept is → mean squared error regression loss (sklearn)
For classification, you can expect to evaluate your model’s performance with accuracy and AUC (Area Under the Curve).
Storytelling
Photo by Nong Vang on Unsplash.
I wanted to add a unique concept of Data Science that is storytelling. I cannot stress enough how important this concept is. It can be seen as a concept or skill, but the label here is not important. What is, is how well you articulate your problemsolving techniques in a business setting. A lot of Data Scientists will focus solely on model accuracy, but will then fail to understand the entire business process. That process includes:
 what is the business?
 what is the problem?
 why do we need Data Science?
 what is the goal of Data Science here?
 when will we get usable results?
 how can we apply our results?
 what is the impact of our results?
 how do we share our results and overall process?
As you can see, none of these points is the model itself or corresponds to an improvement in accuracy. The focus here is how you will use data to solve your company's problems. It is beneficial to become acquainted with stakeholders and your nontechnical coworkers whom you will ultimately be working with. You will also work with Product Managers who will work alongside you in assessing the problem, and Data Engineers to collect the data before even running a base model. At the end of your model process, you will share your results with key individuals who will usually like to see its impact in most likely some type of visual representation (Tableau, Google Slide deck, etc.), so being able to present and communicate is beneficial as well.
Original. Reposted with permission.
Related:
Top Stories Past 30 Days

