7 Steps to Mastering SQL for Data Science

SQL is a standard language for storing, manipulating and retrieving data in databases. Our SQL tutorial will teach you how to use SQL in: MySQL, SQL Server, MS Access, Oracle, Sybase, Informix, Postgres, and other database systems.

SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database.

Step 1: SQL Basics

As a data scientist, you will be reading from databases and analyzing data to fit your use-case. You generally don’t need to create or manipulate existing databases — companies have a separate team to do this.

If you have no prior SQL knowledge whatsoever, start with this tutorial to understand what an RDBMS is.

An ERD is a structural diagram used to visualize the tables in a database and the relationship between them. As a data scientist, when extracting data from different tables, you’d often need to refer to an ER Diagram to understand how the tables interact with each other.

After that, you can immediately start learning how to query data in SQL. I highly recommend following along to these tutorials by W3Schools to learn the following commands — SELECT, IN, WHERE, BETWEEN, AND, OR, NOT, LIKE.

WHERE

The WHERE clause can be combined with AND, OR, and NOT operators.

AND and OR

The AND and OR operators are used to filter records based on more than one condition:

The AND operator displays a record if all the conditions separated by AND are TRUE.
The OR operator displays a record if any of the conditions separated by OR is TRUE.

NOT

The NOT operator displays a record if the condition(s) is NOT TRUE.

SELECT

The SELECT statement is used to select data from a database. The data returned is stored in a result table, called the result-set.

LIKE:

The LIKE operator is used in a WHERE clause to search for a specified pattern in a column. There are two wildcards often used in conjunction with the LIKE operator:

The percent sign (%) represents zero, one, or multiple characters
The underscore sign (_) represents one, single character

Syntax:

SELECT

SELECT column1, column2, ...
FROM table_name;

WHERE

SELECT column1, column2, ...
FROM table_name;e
WHERE condition;

AND

SELECT column1, column2, ...
FROM table_name
WHERE condition1 AND condition2 AND condition3 ...;

SELECT column1, column2, ...
FROM table_name
WHERE condition1 OR condition2 OR condition3 ...;

NOT

SELECT column1, column2, ...
FROM table_name
WHERE NOT condition;

SELECT column1, column2, ...
FROM table_name
WHERE columnN LIKE pattern;

These are some of the simplest SQL commands used to query and filter database tables. Once you’re familiar with them, start learning CASE statements. They’re very similar to if-else commands in any programming language.

CASE

The CASE statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement). So, once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE clause. If there is no ELSE part and no conditions are true, it returns NULL.

Syntax:

CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;

Step 2: Aggregations

SQL aggregate functions are used to perform calculations on multiple table values and return a single result. SQL has 5 aggregate functions — SUM, COUNT, AVG, MIN, MAX.

COUNT

SELECT COUNT(column_name)
FROM table_name
WHERE condition;

AVERAGE

SELECT AVG(column_name)
FROM table_name
WHERE condition;

SUM

SELECT SUM(column_name)
FROM table_name
WHERE condition;

MIN

SELECT MIN(column_name)
FROM table_name
WHERE condition;

MAX

SELECT MAX(column_name)
FROM table_name
WHERE condition;

Step 3: Grouping and Sorting

Learn about the GROUPBY and ORDERBY commands. These are especially useful when you need to view your data in different groups or sort rows in a specific order.

ORDERBY

The ORDER BY keyword is used to sort the result-set in ascending or descending order. The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.

GROUPBY

The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country".

It is also useful to learn the HAVING clause, as it’s used frequently with the above commands.

HAVING

The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.

Syntax:

GROUPBY

SELECT column_name(s)
FROM table_name
WHERE condition
GROUPBY column_name(s)
ORDERBY column_name(s);

ORDERBY

SELECT column1, column2, ...
FROM table_name
ORDERBY column1, column2, ... ASC|DESC;

HAVING

SELECT column_name(s)
FROM table_name
WHERE condition
GROUPBY column_name(s)
HAVING condition
ORDERBY column_name(s);

Step 4: Joins

All the queries above can only be used to extract data from a single table. If you’d like to combine data in multiple tables, you need to learn the JOIN command.

Here is a visual representation of SQL joins:

Step 5: Subqueries

Subqueries are also called nested queries in SQL, and are used when the result you want requires more than one query. In a nested query, the result of the inner command is used as input in the main query.

This might seem confusing at first, but is actually a fairly intuitive concept once you get used to it.

Step 6: SQL to Solve Business Problems

As a data scientist, the value you bring to an organization lies in your ability to use data to solve a business problem. When given a use-case by a stakeholder, you need to be able to translate this requirement into a technical analysis.

For example, your manager requests for a list of customers that should be targeted for different industries based on their online browsing behaviour. As a data scientist, you will need to break this task down into the following steps:

Look into websites that these customers visited, and segregate them by industry based on their website visits. This can be done with some basic filtering and grouping in SQL.
Then, you can look into recency and frequency of website visits to identify high-potential customers to be targeted in these industries. This might require some additional data pre-processing, filtering, and possibly ranking.
Finally, you can handover the filtered output data grouped by sector to your manager. If you’d like to enrich these customer categories, you can even build a clustering algorithm on top of this data to identify high-potential individuals.

The example above is simple, but captures the thought process of a data scientist when provided with a business problem statement. This is a skill that is developed over time, with practice.

Udemy has a course on SQL Business Intelligence designed to help students used SQL to support better decision making. The first part of this program covers the fundamentals of SQL (joins, operators, subqueries, aggregations, etc), and the second half is focused on applying the knowledge learnt to solve business problems.

Step 7: Window Functions

Window functions are a slightly more advanced SQL topic. They enable users to perform calculations against partitions of a result set.

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

ROW NUMBER

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts are 1 and numbers the rows according to the ORDER BY part of the window statement. ROW_NUMBER() does not require you to specify a variable within the parentheses:

SELECT start_terminal,        
    start_time,        
    duration_seconds,        
    ROW_NUMBER() OVER (ORDERBY start_time)                     
        AS row_number   

FROM tutorial.dc_bikeshare_q1_2012  
WHERE start_time < '2012-01-08'

RANK

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers. In the following query, you notice the 4th and 5th observations for start_terminal 31000—they are both given a rank of 4, and the following result receives a rank of 6:

SELECT start_terminal,        
    duration_seconds,        
    RANK() OVER (PARTITIONBY start_terminal                     
    ORDERBY start_time)               
    AS rank FROM tutorial.dc_bikeshare_q1_2012 

WHERE start_time < '2012-01-08'

NTILE

You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(*# of buckets*). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of 'tiles you specify). For example:

SELECT start_terminal,        
    duration_seconds,        
    NTILE(4) OVER          
        (PARTITIONBY start_terminal ORDERBY duration_seconds)           
    
    AS quartile,        
        NTILE(5) OVER          
        (PARTITIONBY start_terminal ORDERBY duration_seconds)          
    
    AS quintile,        
        NTILE(100) OVER          
        (PARTITIONBY start_terminal ORDERBY duration_seconds)          
    
    AS percentile   
FROM tutorial.dc_bikeshare_q1_2012  
WHERE start_time < '2012-01-08'

LAG and LEAD

It can often be useful to compare rows to preceding or following rows, especially if you've got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows:

 SELECT start_terminal,        
   duration_seconds,        
   LAG(duration_seconds, 1) OVER          
     (PARTITIONBY start_terminal ORDERBY duration_seconds) AS lag,        
   
   LEAD(duration_seconds, 1) OVER          
     (PARTITIONBY start_terminal ORDERBY duration_seconds) AS lead   
 
 FROM tutorial.dc_bikeshare_q1_2012  
 WHERE start_time < '2012-01-08'ORDERBY start_terminal, duration_seconds

The Tech Platform