SQL Case Study: Helping a Startup CEO Manage His Data
In this tutorial, you will learn how to create a table, insert values into it, use and understand some data types, use SELECT statements, UPDATE records, use some aggregate functions, and more.
How many rows do exist in a table?
We would like to know how many persons in this startup. We do that by counting the number of rows using count(*):
SELECT COUNT(*) FROM startup;
How many rows with minimum values in a column?
We would like to know which persons in the startup have the lowest salary and how many of them. Let’s try this query:
SELECT * FROM startup WHERE salary = MIN(salary);
This will introduce an error which says:
ERROR: aggregate functions are not allowed in WHERE
Aggregate functions are like COUNT(), MIN(), MAX(), AVG(), SUM(), etc. which take values in a column as an input and returns a single value (or NULL) . Here MIN() is used after WHERE clause, so we can do that by checking the salary if it equals the minimum value (or not) which we can get from another query, not by aggregating it, like this:
SELECT MIN(salary) FROM startup;
But this will produce illogical value which is 1000$, so what happened?!
This is done because of the collation sequence, if you are interested in the ASCII collating orders you can see this link.
So, what should we do now?
Actually, I did something not efficient at the beginning of creating the table which is storing the column salary as varchar. The reason exists at the RidFilter‘s answer of this stackoverflow question.
But believe it or not, I dealt with some data like that. It includes dollar sign stored as varachar so let’s fix it to be able to do some operations on it.
Removing a character from a string:
This problem can be solved by first removing the dollar sign and then converting this varchar to numeric value which can be integer.
SELECT REPLACE(salary, '$', '') FROM startup;
This REPLACE() function will result in the values in salary column without the dollar sign (i.e. replacing the $ by nothing), but beware that this is not edited into the table. So, we need to use it next time we need to operate on it. Also, beware that this column is still string not numeric, so we need to cast it into decimal.
Now, we can apply the MIN() on the casted replaced values of the salaries:
SELECT MIN( CAST( REPLACE(salary, '$', '') as DECIMAL)) FROM startup;
The CAST() function converts the new salary column into decimal values.
Remember, we still see the salary with the dollar sign in the startup table.
To make it available in the table without the dollar sign, we use the UPDATE() function:
UPDATE startup SET salary = REPLACE(salary, '$', '');
We, now, return to our problem which is finding out the persons inside the startup who have the lowest salary.
SELECT * FROM startup WHERE CAST(salary AS DECIMAL) = 800;
We could use just one query with a subquery instead of the last two separate queries:
SELECT * FROM startup WHERE CAST( salary AS DECIMAL) = ( SELECT MIN( CAST(salary AS DECIMAL) ) FROM startup );
We can also count them using COUNT(*):
SELECT COUNT(*) FROM startup WHERE CAST(salary AS DECIMAL) = 800;
The three lowest paid:
We would like to get the three lowest paid engineers in the startup. We can do that by first querying all the rows using ORDER BY clause followed by ASC or just ORDER BY which will order the output ascendingly by default.
SELECT * FROM startup ORDER BY CAST( salary AS DECIMAL );
Adding to ‘LIMIT 3’ will bring the first three corresponding to the rows with the lowest salaries.
The three highest paid:
Notice what we changed!
SELECT * FROM startup ORDER BY CAST( salary AS DECIMAL ) DESC LIMIT 3;
Let’s say the CEO wants to know the total cost of the salaries, so we can do that using the aggregate function SUM():
SELECT SUM( CAST( salary AS DECIMAL ) ) FROM startup;
Adding a column to a table:
Let’s say, he wants now to hire females, so he would add another column named sex. This can be done using ALTER clause:
ALTER TABLE startup ADD sex char(1);
We should update each row like that:
UPDATE startup SET sex = 'M' WHERE id = 1;
Using for loop:
Of course, this is a cumbersome if we do it manually. That’s why we should use loops instead.
SQL does not have loops, but it can only be used inside a procedural language function or a ‘Do’ statement as answered here:
DO $do$ BEGIN FOR i IN 2..8 LOOP UPDATE startup2 set sex = 'M' WHERE id = i; END LOOP; END $do$;
(Kind of) histogram:
One of the possible requirements is to know the frequency of something occurrence. We can get the frequency of the city among the engineers in the startup by counting each row occurrence for each city value. That’s why we use GROUPY BY clause:
SELECT city, COUNT(*) FROM startup GROUP BY city;
This is like histogram; it shows us how frequent value occurs.
We can name any column if we follow it by AS:
SELECT city, COUNT(*) AS frequency FROM startup GROUP BY city;
I hope you find this article useful. If you want to know about my story before getting the first internship in data science, I wrote an article at medium about it. If you want to see more updates and challenges, please follow me on:
- Linkedin: https://www.linkedin.com/in/ezzeddinabdullah/
- Medium: https://medium.com/@ezzeddinabdullah
- Twitter: https://twitter.com/EzzEddinAbdulah
- Quora: https://www.quora.com/profile/Ezz-El-Din-Abdullah
Bio: Ezz El Din Abdullah is a former Data Scientist Intern & Programming Tutor.