SQL

FAQ

Quote Usage

Double quotes are for names of tables or fields. Sometimes You can omit them. The single quotes are for string constants. This is the SQL standard. In the verbose form, your query looks like this:

select * from "employee" where "employee_name"='elina';

Document Data Model (NoSQL)

History?

Berkeley, for Ingress had QUEL
- Ingress roots PostgreSQL
IBM, for System R first came up w SQUARE
- System R roots System/38, DB2
- SQUARE was complicated, so needed something else.
- Came up with SEQUEL, wordplay on QUEL (sequel to QUEL)
- Later renamed SEQUEL to SQL

Why is COUNT(*) slow?

SELECT COUNT(*) FROM students;
This is same as SELECT COUNT(1) FROM students;
So we can say “why is SELECT COUNT(1) FROM students; slow?”
COUNT(*) counts rows as long as any one of their columns is non-NULL
In PostgreSQL
- It’s because of MVCC
- Multiple transactions see different states
- PG needs to know which rows are “visible”, and “visibility” depends on the Isolation level and MVCC implementation.
- Eg. If some concurrent write access is happening on some row that will not be visible in certain isolation level. So we need to row by row.

smol hack: We can instead get the estimate using json output of EXPLAIN

Faster PostgreSQL Counting

Another way

SELECT
    relname AS table_name,
    to_char(reltuples::bigint, '999,999,999,999') AS estimated_row_count
FROM
    pg_class
WHERE
    relname IN ('table_name1, 'table_name2')
ORDER BY
    relname;

Why is SELECT * slow?

SELECT * from students;
It’s even worse in column store
A Deep Dive in How Slow SELECT * is - YouTube

SQL and Relational Algebra

SQL is based on bags (duplicates) not sets (no duplicates).

So this is where SQL differs from relation algebra

Queries

DML

Methods to store and retrieve information from a database.
Procedural
- Query specifies a high-level “How” based on sets.
- This is Relational Algebra
Non-Procedural (Declarative)
- Query specifies only what data is needed.
- This is Relational Calculus
Eg. SELECT, INSERT, UPDATE, DELETE, TRUNCATE

DDL

Index creation, table creation deletion etc.
Eg. CREATE, DROP, ALTER

DCL (Data control language)

Access control etc.
Eg. GRANT, REVOKE

Other modifiers

Basic: WHERE, ORDER BY, JOIN
Merge: DISTINCT
- It operates on the entire row(eg. if you SELECT multiple cols)
Aggregate: GROUP BY, HAVING
Limit: LIMIT, OFFSET
Set operations: UNION, INTERSECT, EXCEPT
- EXCEPT is the difference from Set Theory

Transaction control

Eg. BEGIN, COMMIT, ROLLBACK, SAVEPOINT, RELEASE SAVEPOINT

Operators

relational algebra is a logical definition of what a plan should do

Ordering

Ordering is important, do we want to join 1bnx1bn and then pick columns or do we want to join 1bnx2cols. The latter is better.
This translates to execution strategy/query plan
Modern optimizers can sometime help you do this automatically but sometimes it gets things wrong.
TIP: Don’t prematurely optimize, write your query normally, see what the optimizer does, if not what you need, change your query/provide hints/provide index etc.
Also when you write a SQL query, the order that the SQL query is written in, is not the order that it actually happens in. But irl db(s) have other optimizations aswell

Practicality

SQL has a standard but almost every dbms vendor offends the standard

101 Basics

Query

Parameterized Query

Query that uses placeholders for parameters instead of embedding the actual values directly in the query string. Replacement happens at execution/run time and not at query planning time.
Prevents SQL injection. Makes sure to treat parameters as data and not as executable parts of the SQL statements.
Eg. LIKE : % (any substring), _ (one character)

Prepared statement

Uses PREPARE, PostgreSQL pre-compiles the SQL query. Improves performance.
Creates a server side object. But only for the DB session.
EXPLAIN EXECUTE name(parameter_values); to see generic or custom plan
Features
- Can use EXECUTE to execute the statement
- You can also specify the types of the parameters
- Useful when a single session is being used to execute a large number of similar statements.
- Refer to parameters by position, using $1,$ 2, etc.

Output

String operations

Operations: Can be used in output or in predicate
- Eg. SUBSTRING, UPPER, ||/+/CONCAT, DBMSes add their own shit

Date/Time

Can be used in output or in predicate

Output redirection

Output control

ORDER BY
- The database will store the tuples any way it wants
  - No guarantee that it’ll store data in the way you insert them
- Can use the ORDER BY <column*/offset> [ASC|DSC] clause to re-order the results
  - Default order is ASC
  - We can use offset number in cases where we run orderby in cases the column doesn’t have a name
- Interesting thing is, I can ORDER BY some column even if that column is not mentioned in the SELECT statement.
- We can also ORDER BY on a column alias created by SELECT using AS

LIMIT
- LIMIT <count> [offset]

Views

Regular views

A non-materialized VIEW is a name for a SELECT statement
- Query that pretends to be a table
- In PostgreSQL you use VIEW to name a SELECT statement
- Building views upon other views is not uncommon.
It directly works with actual data
Uses
- Abstraction: Diff tables into one
- Security/Privacy: Expose only the shit you want to expose, don’t show too much cock

Materialized views

Work best for expensive views on data that does not change frequently.
If a view is cheap/fast or changes to the underlying data occur frequently, you probably do not want to materialize it.
update all rows and set read locks on all affected data during refreshment by default.
We can materializing a view into a temp table
- The temp table cannot be directly updated, but can be refreshed using REFRESH MATERIALIZED VIEW <table_name>;
- See PostgreSQL: Documentation: 16: 41.3. Materialized Views
See How PostgreSQL Views and Materialized Views Work and How They Influenced TimescaleDB Continuous Aggregates

`Views` v/s `CTE` v/s Materialized view

CTE are sometimes called inline view

X	View	CTE	Materialized View
Lifetime	Q is object in DB	Exists only for duration of Q	Caches data and Q
USP	Can be used in multiple Q	Good w tree hierarchy i.e. recursive
Index	Index can be created	No statistics, No indexes
Performance	Does nothing	Does nothing	Improves Performance

Procedures

Nested Queries / Sub Queries / Inner Queries

Embed one query inside another
Inner query can reference outer query table
Difficult to optimize, the optimizer will try to optimize it into a JOIN or do something else etc.

Common Table Expression (CTE)

Mostly about WITH
- WITH are the “private methods” of SQL
- WITH is a prefix to SELECT
- WITH queries are only visible in the SELECT they precede
Separated by , and returns to main query if no , (comma)
Better alternative to nested queries and views
Statement scoped views
Provides a way to write auxiliary statements for use in a larger query.
Think of it like a temp table just for one query.

Has a RECURSIVE version. Makes SQL Turing Complete (See Automata Theory) because it allows Recursion

WITH RECURSIVE source (
    counter
) AS ((
        SELECT
            1)
    UNION (
        SELECT
            counter + 1
        FROM
            source
        WHERE
            counter < 10))
SELECT
    *
FROM
    source;

JOINs

See SQL JOINs

NULL in SQL

NULL is never equal or not equal to “anything” in SQL. Not even to itself.
- = and != operators don’t work with NULL

What `NULL` means depends on the data

It can mean missing data
It can mean the data has the NULL value, whatever the application then interprets
Where they come from (some examples)
- Table/Data can have it in itself
- Functions (eg. LAG) can return NULL
- Result of some OUTER JOIN
- Any operations such as +, *, CONCAT with NULL will result in NULL

Handling `NULL`

Handle them in application
Filter them
If you have IS NULL sprinkled across every query, that tells me that you have a lot of nullable columns in your schema that really ought to be not-nullable. You should be enforcing non-nullableness on your database table, not your SELECT.
COALESCE
- COALESCE returns the first item that is not NULL
- items in the list should be of same type
- items can be literal or other columns too
CASE
- This is how you’d write if statements in SQL

`NULL` and Unique indexes

NULL in not equal to NULL, so if you try to add a unique constraint or unique index on a column, it’s allowed to have multiple NULL values. unless something like NULLS NOT DISTINCT is used. (PG)

Aggregations

Both GROUP BY and OVER may follow any aggregation

Aggregation Functions

COUNT(col): Return # of values for col.
SUM(col): Return sum of values in col.
AVG(col): Return the average col value.
MIN(col): Return minimum col value.
MAX(col): Return maximum col value.

GROUP BY

Functions that return a single value from a bag of tuples
Aggregate functions can (almost) only be used in the SELECT output list.
Support/Rules
- COUNT, SUM, AVG support DISTINCT
- Multiple aggregates in a SELECT query is allowed
- Output of other columns when doing aggregates is undefined. (instead use GROUP BY)
- Non-aggregated values in SELECT clause must appear in GROUP BY clause.
- Filtering on the aggregated value is done using HAVING, syntax looks like we’re computing AVG twice, but the database is smart enough not to do that. We use HAVING because WHERE is not aware of the computed value.
- IMPORTANT: HAVING filters after the GROUP BY, WHERE filters before GROUP BY

Window Functions

See Window Functions in SQL

Data Modeling

See DB Data Modeling

Random topics

Rivers and Axis | ProGramin’g thoughts
“say you have a dataset in a SQL table, and you find there are duplicates in it that shouldn’t be there. walk me through how you’d resolve this issue”
- I’ve also encountered the same thing, but with constraints added every time you come up with a solution.
- Whay if the table has a billion rows, would that affect your solution?
- What if you need to ensure it completes before every full hour?
- What if new data arrives continuously?

🐏 mogoz

Table of Contents

SQL

FAQ §

Quote Usage §

Document Data Model (NoSQL) §

History? §

Why is COUNT(*) slow? §

Why is SELECT * slow? §

SQL and Relational Algebra §

Queries §

DML §

DDL §

DCL (Data control language) §

Other modifiers §

Transaction control §

Operators §

Ordering §

Practicality §

101 Basics §

Query §

Parameterized Query §

Prepared statement §

Output §

String operations §

Date/Time §

Output redirection §

Output control §

Views §

Regular views §

Materialized views §

Views v/s CTE v/s Materialized view §

Procedures §

Nested Queries / Sub Queries / Inner Queries §

Common Table Expression (CTE) §

JOINs §

NULL in SQL §

What NULL means depends on the data §

Handling NULL §

NULL and Unique indexes §

Aggregations §

Aggregation Functions §

GROUP BY §

Window Functions §

Data Modeling §

Random topics §

Graph View

Backlinks

FAQ