KDnuggets Top Blog Winner

The Not-so-Sexy SQL Concepts to Make You Stand Out

Databases are the houses of our data and data scientists HAVE TO HAVE A KEY! In this article, I discuss some lesser known concepts of SQL that data scientists do not familiarize themselves with.



SQL (Structured Query Language) is the programming language for managing data in a relational database. Most data scientists will use relational data and there is a somewhat small misconception that the job of database manipulation falls on the hands of the great and all powerful “DATA ENGINEER”. Practically speaking, as a data scientist, do you want to go to someone else every time you need to extract a piece of data?

In reality there are so many jobs for data scientists asking for SQL experience/skills but either we do not learn it in our academics or don’t bother with it to learn ourselves so we have more time for R and Python learning. Yes, it's true there's a finite amount of time in a day, week, month etc. but databases are the houses of our data and data scientists HAVE TO HAVE A KEY!

Sure, this is not gonna be groundbreaking but the fact that this is not as mainstream in the data science toolkit is holding people back from improving their skills and taking their data science game to the next level.

The Not-so-Sexy SQL Concepts to Make You Stand Out
 

In this article, I discuss some lesser known concepts of SQL that data scientists do not familiarize themselves with. I will use SQL Server as my choice of SQL. Please note that the syntax will change if you use MySQL, PostgreSQL etc.:

 

1. SQL View

 
Now, to give a quick summary of why you want this technique, Views will reduce your need to run the same script over-and-over again to access the same data. We are guilty of creating an SQL script and once we get it right, we just save it in a folder and then run it again. Why do this when we can just save it as a View and just use it as you would a SQL table.

The following SQL script shows a normal SELECT and JOIN statement that the average new SQL user or beginner SQL academic class would use. 

SELECT
    p.customer_name, 
    p.customer_address, 
    b.order_item
    b.order_qty
FROM
    customers.purchases p
INNER JOIN purchases.orders b 
        ON b.order_id = p.order_id;


Now, let's say you need to reference this table multiple times for your work and every time you find yourself rerunning the script. If this is a large table, you know it is more than likely that it will take a very long time. A View will save you time and let you reference this anytime you want and manipulate it almost like a normal table.

CREATE VIEW customers.orders
AS
SELECT
    p.customer_name, 
    p.customer_address, 
    b.order_item
    b.order_qty
FROM
    customers.purchases p
INNER JOIN purchases.orders b 
        ON b.order_id = p.order_id;


It is as simple as that to create a view. This is not storing or creating a new table exactly but it is enabling the user to create the previous table with a simple command as ‘SELECT * FROM customers.orders’. That’s all you have to do instead of having to save a file and rerun the script. Just open your SQL application and just run the relevant select statement.

 

2. Stored Procedures

 
A Stored Procedure is a piece of SQL code that you can save and execute when you want. It is another method of creating efficiencies in your SQL data manipulation workflow.

CREATE PROCEDURE spCustomerOrders
AS
BEGIN
SELECT
    p.customer_name, 
    p.customer_address, 
    b.order_item
    b.order_qty
FROM
    customers.purchases p
INNER JOIN purchases.orders b 
        ON b.order_id = p.order_id;
END;


To execute the above procedure, just type

EXEC spCustomerOrders;


Again such a simple method to create a little efficiency. This does look similar to a view but a different execution method. Which in essence is true to an extent, but the difference comes when you can specify parameters to your stored procedure. This works like a function or filter in SQL. I would say this is mostly used as a quick filter method than anything else. Of course there are many other possibilities now that you can add parameters. 

CREATE PROCEDURE spCustomerOrders(@ord_qty AS Int)
AS
BEGIN
SELECT
    p.customer_name, 
    p.customer_address, 
    b.order_item
    b.order_qty
FROM
    customers.purchases p
INNER JOIN purchases.orders b 
        ON b.order_id = p.order_id;
WHERE b.order_qty >= @ord_qty
END;


In the above example, we added the ‘@ord_qty’  parameter to the spCustomerOrders procedure. It always starts with ‘@’ in SQL Server and the ‘AS int’ is to specify the data type expected. In this circumstance, we use it in the WHERE clause to filter the customer and order details by the number of purchases. 

To execute the stored procedure, just simply type

EXEC spCustomerOrders 125;


The above command will execute the stored procedure to return the customer and order details for the number of orders above 125.

 

3. Scalar functions

  
As a statistician, I think there are a lot of people that can piece together the definition of a scalar function. But let me give you a definition anyway. In terms of SQL, a scalar function is a piece of code that takes one or more parameters to calculate and return a single value. We regularly use SQL to create data aggregations.

In my personal opinion, think of this as GROUP BY 2.0. Might be a bit of stretch to say that but essentially allows you to do some more complex calculations on data that may not be directly available to you by default on SQL.

CREATE FUNCTION customers.GrossProfit_per_Cust(
    @ord_qty INT,
    @price DEC(10,2),
    @cost_of_sales DEC(10,2)
)
RETURNS DEC(10,2)
AS 
BEGIN
    RETURN (@ord_qty * @price)-@cost_of_sales;
END;


The above function is used to find the gross profit per customer by taking the order quantity multiplied by the price and subtracting the cost of sales (This is not an accounting lesson so sorry if this is not correct!). 

In the most simplest way, you can use the function like this:

SELECT customers.GrossProfit_per_cust(12,50,200)


That seems fine but the real value comes from using it inside a typical SELECT statement to take your data aggregation to another level. 

SELECT 
p.customer_name, SUM(customers.GrossProfit_per_cust(b.ord_qty, b.price, b.cost_of_sales) gross_profit
FROM customers.purchases p
INNER JOIN purchases.orders b 
ON b.order_id = p.order_id;
GROUP BY p.customer_name
ORDER BY gross_profit DESC;


 

Concluding Remarks

 
These are 3 of many underutilized SQL concepts that are available to data scientist’s but not thought enough at an academic level and not in the priority list for many novice data scientists. Databases are now moving to the cloud and the data generated around the world is getting bigger and bigger. 

Database technology needs to keep up with the data to be as useful as possible and the data scientists of today and tomorrow need to be able to manipulate a database at a higher level than just having the most complicated piece of code as a mere GROUP BY. We need to improve processes, be faster and more agile.

 
 
Asel Mendis (@aselmendis) is a Data Scientist in Australia. He is originally from Sri Lanka and has been in the data space since 2019. As a Data Scientist, his objective is to use the relevant techniques and technology to create insights and value. He is a Contributing Editor at KDnuggets. He has interests in Location Intelligence, Spatial Analysis, Demographics, Machine Learning, Data Visualization and Statistics using R and Python. He has a Master of Analytics from the Royal Melbourne Institute of Technology specializing in Applied Statistics and as of 2021 is currently undertaking a PhD in Applied Statistics focusing on the topic of Black Spots on roads. He is also a member of the Statistical Society of Australia and holds the designation of Graduate Statistician (GStat) and is looking forward to becoming an Accredited Statistician (AStat.) in the future.