5 Tricky SQL Queries Solved
Explaining the approach to solving a few complex SQL queries.
By Saiteja Kura, Web Development, Data Science, and NLP
SQL(Structured Query Language) is a very important tool in a data scientist’s toolbox. Mastering SQL is not only essential in an interview point of view, but a good understanding of SQL by being able to solve complex queries will keep us above everyone in the race.
In this article, I will talk about 5 tricky questions I found and my approaches to solve them.
Note — Each query can be written in different ways. Try to think about the approach before moving on to my solutions. You can also suggest different approaches in the response section.
We are given a table consisting of two columns, Name, and Profession. We need to query all the names immediately followed by the first letter in the profession column enclosed in parenthesis.
SELECT CONCAT(Name, ’(‘, SUBSTR(Profession, 1, 1), ’)’) FROM table;
Since we need to combine the name and profession we can use
CONCAT. We also need to have only one letter inside the parenthesis. Hence we will use
SUBSTRand pass the column name, start index, end index. Since we need only the first letter we will pass 1,1(start index is inclusive and the end index is not inclusive)
Tina was asked to compute the average salary of all employees from the EMPLOYEES table she created but realized that the zero key in her keyboard is not working after the result showed a very less average. She wants our help in finding out the difference between miscalculated average and actual average.
We must write a query finding the error( Actual AVG — Calculated AVG).
SELECT AVG(Salary) - AVG(REPLACE(Salary, 0, ’’)) FROM table;
A point to note here is that we have only one table that consists of actual salary values. To create the error scenario we use
REPLACE to replace 0’s. We will pass the column name, value to be replaced, and the value with which we will replace the
REPLACE method. Then we find the difference in averages using the aggregate function
We are given a table, which is a Binary Search Tree consisting of two columns Node and Parent. We must write a query that returns the node type ordered by the value of nodes in ascending order. There are 3 types.
- Root — if the node is a root
- Leaf — if the node is a leaf
- Inner — if the node is neither root nor leaf.
Upon initial analysis, we can conclude that if a given node N has its corresponding P-value as NULL it is the root. And for a given Node N if it exists in the P column it is not an inner node. Based on this idea let us write a query.
SELECT CASE WHEN P IS NULL THEN CONCAT(N, ' Root') WHEN N IN (SELECT DISTINCT P from BST) THEN CONCAT(N, ' Inner') ELSE CONCAT(N, ' Leaf') END FROM BST ORDER BY N asc;
We can use
CASE which acts as a switch function. As I mentioned if P is null for a given node N then N is the root. Hence we used
CONCAT for combining the node value and label. Similarly, if a given node N is in column P it is an inner node. To get all nodes from column P we wrote a subquery which returns all the distinct nodes in column P. Since we were asked to order the output by node values in ascending order we used the
ORDER BY Clause.
We are given a transaction table that consists of transaction_id, user_id, transaction_date, product_id, and quantity. We need to query the number of users who purchased products on multiple days(Note that a given user can purchase multiple products on a single day).
To solve this query, we cannot directly count the occurrence of user_id’s and if it is more than one return that user_id because a given user can have more than one transaction on a single day. Hence if a given user_id has more than one distinct date associated with it means he purchased products on multiple days. Following the same approach, I wrote a query. (Inner query)
SELECT COUNT(user_id) FROM ( SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(DISTINCT DATE(date)) > 1 ) t1
Since the question asked for the number of user_ids and not the user_id’s itself we use
COUNT in the outer query.
We are given a subscription table which consists of subscription start and end date for each user. We need to write a query that returns true/false for each user based on the overlapping of dates with other users. For instance, If user1's subscription period overlaps with any other user the query must return True for user1.
Upon initial analysis, we understand that we must compare every subscription against every other one. Let us consider start and end dates of userA as
endA, similarly for userB,
startB then we can say the two date ranges overlap. Let us take two examples. Let us compare U1 AND U3 first.
startA = 2020–01–01
endA = 2020–01–31
startB = 2020–01–16
endB = 2020–01–26
Here we can see
startA(2020–01–01) is less than
endB(2020–01–26) and similarly, endA(2020–01–31) is greater than
startB(2020–01–16) and hence can conclude that the dates overlap. Similarly, if you compare U1 and U4 the above condition fails and will return false.
We must also ensure that a user is not compared to his own subscription. We also want to run a left join on itself to match a user with each other user that satisfies our condition. We will create two replicas s1 and s2 of the same table now.
SELECT * FROM subscriptions AS s1 LEFT JOIN subscriptions AS s2 ON s1.user_id != s2.user_id AND s1.start_date <= s2.end_date AND s1.end_date >= s2.start_date
Given the conditional join, a user_id from s2 should exist for each user_id in s1 on the condition where there exists an overlap between the dates.
We can see there exists another user for each user in case the dates overlap. For user1 there are 2 rows indicating that he matches with 2 users. For user 4 the corresponding id is null indicating that he does not match with any other user.
Wrapping it all together now, we can group by the s1.user_id field and just check if any value exists true for a user where s2.user_id IS NOT NULL.
SELECT s1.user_id , (CASE WHEN s2.user_id IS NOT NULL THEN 1 ELSE 0 END) AS overlap FROM subscriptions AS s1 LEFT JOIN subscriptions AS s2 ON s1.user_id != s2.user_id AND s1.start_date <= s2.end_date AND s1.end_date >= s2.start_date GROUP BY s1.user_id
We used the
CASE clause to label 1 and 0 depending on the s2.user_id value for a given user. The final output looks like this -
Before concluding, I would like to suggest a good book on SQL which I thoroughly enjoyed and found very useful.
Mastering SQL requires lots of practice. In this article, I took 5 tricky questions and explained the approaches to solve them. The specialty of SQL is that each query can be written in many different ways. Do feel free to share your approaches in the responses. I hope you learned something new today!
If you would like to get in touch, connect with me on LinkedIn.
Bio: Saiteja Kura is sincere, friendly, and ambitious, and is interested in Web Development, Data Science, and NLP.
Original. Reposted with permission.
- Feature Engineering in SQL and Python: A Hybrid Approach
- Modern Data Science Skills: 8 Categories, Core Skills, and Hot Skills
- Introduction to Geographical Time Series Prediction with Crime Data in R, SQL, and Tableau