tags : Database, PostgreSQL

FAQ

Quote Usage

Double quotes are for names of tables or fields. Sometimes You can omit them. The single quotes are for string constants. This is the SQL standard. In the verbose form, your query looks like this:

select * from "employee" where "employee_name"='elina';

Document Data Model (NoSQL)

History?

  • Berkeley, for Ingress had QUEL
  • IBM, for System R first came up w SQUARE
    • System R roots System/38, DB2
    • SQUARE was complicated, so needed something else.
    • Came up with SEQUEL, wordplay on QUEL (sequel to QUEL)
    • Later renamed SEQUEL to SQL

Why is COUNT(*) slow?

  • SELECT COUNT(*) FROM students;
  • This is same as SELECT COUNT(1) FROM students;
  • So we can say “why is SELECT COUNT(1) FROM students; slow?”
  • COUNT(*) counts rows as long as any one of their columns is non-NULL
  • In PostgreSQL
    • It’s because of MVCC
    • Multiple transactions see different states
    • PG needs to know which rows are “visible”, and “visibility” depends on the Isolation level and MVCC implementation.
    • Eg. If some concurrent write access is happening on some row that will not be visible in certain isolation level. So we need to row by row.
  • smol hack: We can instead get the estimate using json output of EXPLAIN
    • Faster PostgreSQL Counting
    • Another way
      SELECT
          relname AS table_name,
          to_char(reltuples::bigint, '999,999,999,999') AS estimated_row_count
      FROM
          pg_class
      WHERE
          relname IN ('table_name1, 'table_name2')
      ORDER BY
          relname;

Why is SELECT * slow?

SQL and Relational Algebra

  • SQL is based on bags (duplicates) not sets (no duplicates).
  • So this is where SQL differs from relation algebra

Queries

DML

  • Methods to store and retrieve information from a database.
  • Procedural
    • Query specifies a high-level “How” based on sets.
    • This is Relational Algebra
  • Non-Procedural (Declarative)
    • Query specifies only what data is needed.
    • This is Relational Calculus
  • Eg. SELECT, INSERT, UPDATE, DELETE, TRUNCATE

DDL

  • Index creation, table creation deletion etc.
  • Eg. CREATE, DROP, ALTER

DCL (Data control language)

  • Access control etc.
  • Eg. GRANT, REVOKE

Other modifiers

  • Basic: WHERE, ORDER BY, JOIN
  • Merge: DISTINCT
    • It operates on the entire row(eg. if you SELECT multiple cols)
  • Aggregate: GROUP BY, HAVING
  • Limit: LIMIT, OFFSET
  • Set operations: UNION, INTERSECT, EXCEPT

Transaction control

  • Eg. BEGIN, COMMIT, ROLLBACK, SAVEPOINT, RELEASE SAVEPOINT

Operators

  • relational algebra is a logical definition of what a plan should do

Ordering

  • Ordering is important, do we want to join 1bnx1bn and then pick columns or do we want to join 1bnx2cols. The latter is better.
  • This translates to execution strategy/query plan
  • Modern optimizers can sometime help you do this automatically but sometimes it gets things wrong.
  • TIP: Don’t prematurely optimize, write your query normally, see what the optimizer does, if not what you need, change your query/provide hints/provide index etc.
  • Also when you write a SQL query, the order that the SQL query is written in, is not the order that it actually happens in. But irl db(s) have other optimizations aswell

Practicality

  • SQL has a standard but almost every dbms vendor offends the standard

101 Basics

Query

Parameterized Query

  • Query that uses placeholders for parameters instead of embedding the actual values directly in the query string. Replacement happens at execution/run time and not at query planning time.
  • Prevents SQL injection. Makes sure to treat parameters as data and not as executable parts of the SQL statements.
  • Eg. LIKE : % (any substring), _ (one character)

Prepared statement

  • Uses PREPARE, PostgreSQL pre-compiles the SQL query. Improves performance.
  • Creates a server side object. But only for the DB session.
  • EXPLAIN EXECUTE name(parameter_values); to see generic or custom plan
  • Features
    • Can use EXECUTE to execute the statement
    • You can also specify the types of the parameters
    • Useful when a single session is being used to execute a large number of similar statements.
    • Refer to parameters by position, using 2, etc.

Output

String operations

  • Operations: Can be used in output or in predicate
    • Eg. SUBSTRING, UPPER, ||/+/CONCAT, DBMSes add their own shit

Date/Time

  • Can be used in output or in predicate

Output redirection

Output control

  • ORDER BY

    • The database will store the tuples any way it wants
      • No guarantee that it’ll store data in the way you insert them
    • Can use the ORDER BY <column*/offset> [ASC|DSC] clause to re-order the results
      • Default order is ASC
      • We can use offset number in cases where we run orderby in cases the column doesn’t have a name
    • Interesting thing is, I can ORDER BY some column even if that column is not mentioned in the SELECT statement.
    • We can also ORDER BY on a column alias created by SELECT using AS
  • LIMIT

    • LIMIT <count> [offset]

Views

Regular views

  • A non-materialized VIEW is a name for a SELECT statement
    • Query that pretends to be a table
    • In PostgreSQL you use VIEW to name a SELECT statement
    • Building views upon other views is not uncommon.
  • It directly works with actual data
  • Uses

Materialized views

Views v/s CTE v/s Materialized view

  • CTE are sometimes called inline view
XViewCTEMaterialized View
LifetimeQ is object in DBExists only for duration of QCaches data and Q
USPCan be used in multiple QGood w tree hierarchy i.e. recursive
IndexIndex can be createdNo statistics, No indexes
PerformanceDoes nothingDoes nothingImproves Performance

Procedures

Nested Queries / Sub Queries / Inner Queries

  • Embed one query inside another
  • Inner query can reference outer query table
  • Difficult to optimize, the optimizer will try to optimize it into a JOIN or do something else etc.

Common Table Expression (CTE)

  • Mostly about WITH
    • WITH are the “private methods” of SQL
    • WITH is a prefix to SELECT
    • WITH queries are only visible in the SELECT they precede
  • Separated by , and returns to main query if no , (comma)
  • Better alternative to nested queries and views
  • Statement scoped views
  • Provides a way to write auxiliary statements for use in a larger query.
  • Think of it like a temp table just for one query.
  • Has a RECURSIVE version. Makes SQL Turing Complete (See Automata Theory) because it allows Recursion
    WITH RECURSIVE source (
        counter
    ) AS ((
            SELECT
                1)
        UNION (
            SELECT
                counter + 1
            FROM
                source
            WHERE
                counter < 10))
    SELECT
        *
    FROM
        source;

JOINs

See SQL JOINs

NULL in SQL

  • NULL is never equal or not equal to “anything” in SQL. Not even to itself.
    • = and != operators don’t work with NULL

What NULL means depends on the data

  • It can mean missing data
  • It can mean the data has the NULL value, whatever the application then interprets
  • Where they come from (some examples)
    • Table/Data can have it in itself
    • Functions (eg. LAG) can return NULL
    • Result of some OUTER JOIN
    • Any operations such as +, *, CONCAT with NULL will result in NULL

Handling NULL

  • Handle them in application
  • Filter them
  • If you have IS NULL sprinkled across every query, that tells me that you have a lot of nullable columns in your schema that really ought to be not-nullable. You should be enforcing non-nullableness on your database table, not your SELECT.
  • COALESCE
    • COALESCE returns the first item that is not NULL
    • items in the list should be of same type
    • items can be literal or other columns too
  • CASE
    • This is how you’d write if statements in SQL

NULL and Unique indexes

NULL in not equal to NULL, so if you try to add a unique constraint or unique index on a column, it’s allowed to have multiple NULL values. unless something like NULLS NOT DISTINCT is used. (PG)

Aggregations

  • Both GROUP BY and OVER may follow any aggregation

Aggregation Functions

  • COUNT(col): Return # of values for col.
  • SUM(col): Return sum of values in col.
  • AVG(col): Return the average col value.
  • MIN(col): Return minimum col value.
  • MAX(col): Return maximum col value.

GROUP BY

  • Functions that return a single value from a bag of tuples
  • Aggregate functions can (almost) only be used in the SELECT output list.
  • Support/Rules
    • COUNT, SUM, AVG support DISTINCT
    • Multiple aggregates in a SELECT query is allowed
    • Output of other columns when doing aggregates is undefined. (instead use GROUP BY)
    • Non-aggregated values in SELECT clause must appear in GROUP BY clause.
    • Filtering on the aggregated value is done using HAVING, syntax looks like we’re computing AVG twice, but the database is smart enough not to do that. We use HAVING because WHERE is not aware of the computed value.
    • IMPORTANT: HAVING filters after the GROUP BY, WHERE filters before GROUP BY

Window Functions

See Window Functions in SQL

Data Modeling

See DB Data Modeling

Random topics

  • Rivers and Axis | ProGramin’g thoughts
  • “say you have a dataset in a SQL table, and you find there are duplicates in it that shouldn’t be there. walk me through how you’d resolve this issue”
    • I’ve also encountered the same thing, but with constraints added every time you come up with a solution.
    • Whay if the table has a billion rows, would that affect your solution?
    • What if you need to ensure it completes before every full hour?
    • What if new data arrives continuously?