tags : Database

FAQ

How does Supabase work?

What’s PostgREST?

  • Also see REpresentational State Transfer, PostgREST is not really REST.
  • PostgREST is based on the idea if you are accessing the database as a particular user, Row Level Security (RLS) will ensure that you’ll have access to the data you need.
    • Users need to be authenticated and authorized.
    • As this is like the backend now, some kind of pool for connections would be needed
  • Its API looks like a mixture of RESTful and GraphQL.

How to use EXPLAIN?

  • Can give you the cost before hand
  • Can estimate how many rows it’ll return etc.

How to use ANALYZE?

  • It runs the shit that collects the statics that the query planner uses and what we see when we run EXPLAIN

Postgres by default runs in read committed, so transactions are not serializable. How is this acceptable?

  • With read committed, Query in a transaction
    • Sees only data committed before the query began;
    • Never sees uncommitted data
    • Never sees changes committed by concurrent transactions during the query’s execution
  • This is slightly different for writes( See MVCC )
  • The partial transaction isolation provided by Read Committed mode is adequate for many applications

Postgres Schema and Database Objects

Meaning of schema

  • The concept of a schema is not very consistent between DBMSs.
    • In mysql, it calls schemas databases, and only has one database.
    • Schemas are the implementation in postgres of the namespace concept. It’s a collection of logical structures of data.
  • This is different from the general idea of database schema
  • SHOW SEARCH_PATH

Hierarchy of “things” in postgres

  • A single machine/host can have multiple clusters, clusters run in a port
  • A cluster can have many databases, roles/users and more cluster level objects. i.e role and database are couples. but role keeps cheating, so a single role can be the owner of multiple databases.
  • Then we have database level objects
    • Belonging to some schema (namespaced)
      • Eg. Tables, views, sequences, functions, procedures, types, domains, etc.
      • Eg. Indexes and triggers are always created in the same schema as the table they are for.
    • Not belonging to some schema (no namespace)
      • Eg. Extensions
  • There can be schemas with the same name in different databases (but not in the same database).
  • objects from one database cannot interact with objects from another one (catalogs are kept separated). Eg. we can’t join across two databases. Use schema instead.
  • A database can have multiple schemas

Understanding usecase of schema and search_path in postgres

  • Organization
    • Think of schema like directories and database objects as files. You can do without directories but it helps organize.
    • Eg. In a single database, you can have schemas like ecommerce, auth etc.
    • Helpful in dumping, selecting all tables in a schema etc.
  • Access control
    • This has roots into the usecase where a database has multiple users. In our web application world, we usually have one user accessing it. So we set the search path in the connection string.
    • Grant privileges to roles at schema level, so that all objects inside that schema will have those access.

When to create schema, when to create database?

  • If projects/users should be separate and be unaware of each other: databases
  • If projects/users are interrelated, should be able to use each other’s resources: schemas

search path and schema

search path

  • When using schema, you need to use “qualified” names. Instead you can use search path
  • If object is referenced without schema qualification, the search path is traversed until a matching object is found.
  • schema path is a list of path, whichever is found first is used.
    • The first schema in the search path that “exists” is the default location for creating new objects.
    • That is the reason that by default objects are created in the public schema.
  • Eg. follwing shows tables from all schema
      \dt *.*

Builtin schema

  • public schema: There’s nothing special about the public schema except that it exists by default. It can be dropped, too. It’s just is there when you create a new database, so if you create objects without specifying any schema, it’ll go land up in the public schema of the database.
  • pg_catalog schema: Contains the system tables and all the built-in data types, functions, and operators. pg_catalog is always effectively and implicitly part of the search path.

PUBLIC role / keyword

  • It’s not exactly a role but a keyword that can be a placeholder for role
  • This keyword usually means “everyone”
  • This is different from public schema. eg. REVOKE CREATE ON SCHEMA public FROM PUBLIC; (first “public” is the schema, the second “public” means “every user”)
  • It used to gives create permission on the public schema to “all” users despite what grant is at play.
    • USAGE privilege granted to PUBLIC: Allows anyone to refer to objects in the public schema.
    • CREATE privilege granted to PUBLIC: Allows anyone to create objects in the public schema.
      • This is problematic!
  • Postgrs14 onwards there is a fix for this issue
    • new role pg_database_owner, implicitly has the actual owner of the current database as a member.
    • public schema is owned by pg_database_owner now.
    • public schema does not have CREATE privileges anymore.

Postgres Syntax notes

JSONB

  • Prefer jsonb over json if possible

Operators & Functions

  • Checks
    • exists: ? , ?| (any), ?& (all)
      • eg. SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar';
      • Not nested
      • works with array items
      • works only with keys when items in array are objects
    • containment
      • <@
        • Can operate on nested elements
        • Eg. SELECT doc->'site_name' FROM websites WHERE doc @> '{"tags":[{"term":"paris"}, {"term":"food"}]}';
      • @>
  • Access
    • subscripting [x] (Extract & Modify jsonb)
      • Result of subscript always a jsonb
      • Array indexes start at 0, -ve indexes supported
      • Can access nested element by successive subscripting
    • ->
      • Array(index) and object(key) access
    • ->>
      • Array(index) and object(key) access as text
    • #> (path)
      • Array(index) and object(key) access
    • #>> (path)
      • Array(index) and object(key) access as text
  • Manipulation
    • || : concat objects
    • - : subtract keys

Updating jsonb values

It’s important to note that in each of these examples, you’re not actually updating a single field of the JSON data. Instead, you’re creating a temporary, modified version of the data, and assigning that modified version back to the column. In practice, the result should be the same, but keeping this in mind should make complex updates, like the last example, more understandable. SO answer

Indexes on jsonb

  • To use indexes(GIN), we need to apply the operator on the entire column of jsonb type, picking on specific field would not make use of the index if using simple index, otherwise we can use expression index for specific field in the json document.
  • Supported operators: ?, ?|, ?&, @>, jsonpath operators @?, @@
  • Types

    • jsonb_path_ops
      • Seems to have some advantage over jsonb_ops that I don’t care about at the moment
    • jsonb_ops
      • simple & expression
  • Expression indexes vs Simple indexes (jsonb_ops)

    • Simple index
      • Eg. CREATE INDEX idxgin ON api USING GIN (jdoc);
      • Will store copies of every key and value in the jdoc column
      • More flexible, supports any key but bigger in size and might be slower
    • Expression index
      • Eg. CREATE INDEX idxgintags ON api USING GIN ((jdoc -> 'tags'));
      • Stores only data found under the tags key in the example

Management/Admin

See Operations cheat sheet - PostgreSQL wiki

See Difference between running Postgres for yourself and for others

Postmaster & Backend workers & Backend processes

  • postmaster is the primary daemon process for a PostgreSQL server.
  • The use of term is not done these days, the primary daemon process is simply being called postgres now.
  • Responsibilities
    • Initializes the database system, prepares it for incoming connections, and ensures that the system is in a consistent state.
    • Manages connections (spun off processes/ backends processes)
    • Logging
  • Backens processes
    • Config: max_connections
    • Backend processes communicate with each other and with other processes of the instance using
    • semaphores and shared memory to ensure data integrity throughout concurrent data access.
    • i.e connection = backend process = session (unless using pools)
  • Background workers
    • Config: max_worker_processes
    • The backend processes spin up the background workers for actual work if parallel is enabled.
    • This is different from background writer which is a Auxilary process.

Shared Memory

  • As we read things from the disk etc. we put things in memory
  • Everything that is shared, it lives here. Pages, WAL records etc.
  • Every backend process and backend worker and everything else has access to it.
  • Related terms: shared_buffers, buffer pools
  • Provisioned by mmap (?)

Auxiliary Processes

  • background writer : It writes the dirty pages to the OS periodically. So it frees up shared memory in the postgres process but doesn’t ensure that shit actually got to disk, that’s upto the os whether it keeps it in the fs chache or what etc.
  • checkpointer: Flushes pages+wal to disk. Ensure that things have reached the disk.
  • Autovacuum Launcher
    • config: autovacuum_max_workers
    • It launches autovacuum workers
  • wal
    • archiver: Archives everything for the wal (can be used for replay)
    • reciever: Runs where we want to receive the WAL changes.
    • senders: Senders of the WALs to replicas
    • writer: Writing the WAL records from shared memory to disk. Controlled by wal_writer_delay
  • starter
    • This combines checkpointer and wal for REDO(roll-forward recovery).
    • Doesn’t let any incoming connection till it ensures that it loaded the wal into the shared memory and made that data dirty before the checkpoint
    • Now you’re back in the same point before the crash happened.

Cluster

What it’s NOT

  • It is NOT what we popularly understand as cluster(compute cluster). This tripped me up for years man.
    • If you were do that, you’ll end up creating cluster of postgres clusters. (This is basically bad naming for historic reasons)
  • It’s unrelated to the keyword CLUSTER which allows you to organize a table.

What it is

  • A postmaster / primary postgres process and a group of subsiduary processes
  • These processes manage a shared data directory that contains one or more databases.
  • It’s created for you when you do initdb
  • Unusual for a basic or intermediate user to ever need to create clusters or manage multiple clusters as one cluster can manage multiple DBs
  • Having multiple cluster means having >1 postmaster / primary postgres processes running on different ports. This is usually not needed.
  • See PostgreSQL: Documentation: 16: 19.2. Creating a Database Cluster

System Metadata

System Catalogs

System Views

Users and Roles

  • ALTER: Modify a role for its attributes
  • GRANT/REMOVE: Groups/Memberships leading to inheritance of privilege

Roles

  • select * from pg_roles; same as \du
  • PostgreSQL manages database access permissions using the concept of roles
  • They have nothing to do with OS user.
  • ROLE = user / group of users. (Any role can act as a user, a group, or both)
  • Some attributes: SUPERUSER,CREATEDB,CREATEROLE,INHERIT,LOGIN,REPLICATION,BYPASSRLS (from the CREATE ROLE command)
  • Can own
    • databases
    • objects in that database
  • Can have
    • privileges on objects
  • Can give/grant
    • access to other role (s) on the owned objects
    • membership in its role to another role (once two roles share membership, they can share privileges)
  • Predefined roles

    • Predefined roles were introduced in pg14.
    • Useful for things like granting readonly access etc.
  • “postgres” role

    • The default postgres role doesn’t have an “empty password”, it literally has no password which means one cannot connect using a password-based method.

Privileges

  • objects can require certain privilege to be accessed.
  • This privilege can be granted to roles
    • Indirectly: from other roles
    • Directly: to objects
    • See GRANT for more info.
  • Inheritance

    • Whether the privileges of the granted role should be “inherited” by the new member.
    • NOTE: It does not apply to the special role attributes set by CREATE ROLE and ALTER ROLE.
      • Eg. Being a member of a role with CREATEDB privilege does not immediately grant the ability to create databases, even if INHERIT attribute is set
    • Controlled in 2 places
      • Role level: Role can have the INHERIT attribute to inherit.
      • Command level: GRANT has WITH INHERIT to explicitly inherit.
  • ALTER

    • Assigned attributes to a ROLE can be modified with ALTER ROLE
    • ALTER DEFAULT PRIVILEGES allows you to set the privileges that will be applied to objects created in the future. (It does not affect privileges assigned to already-existing objects.)

GRANT / REVOKE

  • Used for mutating memberships and privilege to objects
  • GRANT and REVOKE can also be done by a role that is not the owner of the affected object
    • If it is a member of the role that owns the object
    • If it is a member of a role that holds privileges WITH GRANT OPTION on the object.
  • Can be GRANT’ed on 2 things
    • objects
      • GRANT INSERT ON films TO PUBLIC;
      • GRANT ALL PRIVILEGES ON kinds TO manuel;
      • One of: SELECT,INSERT ,UPDATE ,DELETE ,TRUNCATE ,REFERENCES ,TRIGGER ,CREATE ,CONNECT ,TEMPORARY ,EXECUTE ,USAGE ,SET ,ALTER SYSTEM and ALL PRIVILEGES
    • roles
      • Eg. Grant membership in role coolbois to user joe: GRANT admins TO joe;

Vertically Scaling Postgres (Finetuning with ZFS)

See Scaling Databases

WAL management

Should WAL be stored separate (It depends buit not really necessary)

TLS/SSL for postgres

  • TLS/SSL in a trusted network is not really necessary for postgres
  • But you still would want to have SCRAM even in trusted networks

Optimizing Postgres

Backup and Replication & Upgrade

  • Backup and replication are related in postgres. While backup methods need to be looked at separately, the part where backups are done continuously it uses what we have for replication in postgres and vice-versa.

Following table is specific to postgres terminologies

BackupReplication
PhysicalBacking up of on-disk structure(can be continuous or snapshot)Same as Streaming replication, streaming of literal bytes on disk(WAL in case of PG)
Logicalsql_dump
SynchronousNot a thingUsecase of streaming replication (greater consistency guarantee)
AsynchronousNot a thingUsecase of streaming replication
StreamingSame as Physical Backup (Sub category)Same as Physical replication
SnapshotSame as Physical Backup (Sub category)Not a thing

Meta ideas for replication and backups in postgres

Standby servers

Standby servers are usually read-only, unless promoted to be primary.

  • Warm standby (pg_standby) / Log shipping (2008)

    • AKA “Log shipping”
    • pg_standby is more of a restore utility
    • pg_standby is no longer the most popular choice for doing streaming backup/restore as of postgres16.
    • It stays in permanent recovery(keeps applying WAL) until promotion/trigger(manual intervention), so it couldn’t be read from as it was replicating. Usually some lag from primary will be there.
    • pg_standby allows creation of this.
    • pg_standby runs on a separate server than the primary server, and depends on the primary server to set proper archive_command, archive_library, archive_timeout etc. So the archive_command will somehow(eg. rsync) transfer the WAL files into the standby server and standby server would run pg_standby as the restore_command trying to keep reading the WAL and gets triggered based on a file to change its mode from “standby” whenever there is a failover scenario.
      • There are newer alternatives like pg_receivewal for shipping logs. BUT IT SHOULD NOT BE USED WITH remote_apply because it’ll never apply. It uses the postgres streaming protocol but it’s NOT A SYNCHRONOUS SERVER, it just streams WAL into file system using the streaming protocol.
  • Hot standby (2010)

    • +1up from warm standby.
      • warm standby only were useful in failover scenarios
      • hot standby you could use secondary servers to spread out read-only loads.
    • Ability to connect to the server and run read-only queries while the server is in archive recovery or standby mode.
    • Hot standby allows read-only queries and can be quickly promoted.
    • CAN BE PROMOTED TO PRIMARY from STANDBY “QUICKLY”
    • Q: What allowed hot standby to be possible?
      • Hot standbys were enabled by improvements in different areas, Improved WAL record interpretation, Advanced buffer management, Enhanced transaction visibility logic, Conflict resolution mechanisms, Special handling for read-only transactions etc.
  • Synchronous standby

    • Rare usecase, stronger consistency guarantees
    • Ensures transactions are written to both primary and standby before considering them committed.
    • This usecase is different than WAL archiving, WAL archiving is needed for PITR
    • 2 SYNCHRONOUS STANDBY BAD! 3 STANDBY GOOD!
      • a basic two-server setup, if the standby goes down, the primary may become unavailable for writes, waiting indefinitely for confirmation from the standby.
    • This CANNOT be achieved with WAL archiving alone because that WAL archiving works at WAL segments granularity and this is about granularity at commit level. and it’ll be too slow.
    • For this we NEED to use streaming replication protocol
      • Need to configure synchronous_commit and synchronous_standby_names on the “PRIMARY”
    • Can be warm or hot(allows readonly queries)
      • Using hot_standby = on
    • Synchronous v/s Async commit

      When using remote_apply we need to cautious to not use it with pg_recievewal as it’ll never apply as pg_revievewal doesn’t apply, it’s fine to use otherwise.

      • It is possible to have both synchronous and asynchronous commit transactions running concurrently
      • Certain transactions are always synchronous, eg. DROP TABLE, Two Phase Commit (2PC) prepared transaction etc.
      • Options:
        • on: Wait for WAL local flush to disk. (default)
        • off: Don’t wait for any confirmation (fastest but least safe)
        • If using “synchronous standby servers” using synchronous_standby_names, then
          • When any of these are applied we’re same as on (locally)
          • remote_write: Wait for standby to receive and write (but not flush) the data
          • remote_apply: Wait for standby to receive and apply the transaction
          • local

Q: How to do manual syncing two databases (for development)

Following is from reddit comment from /u/Key-Chip

  • pg_dump and pg_restore: This is a standard PostgreSQL tool for backing up and restoring databases. You can use pg_dump to dump the data from the source database into a file, transfer that file to the destination server, and then use pg_restore to restore it into the destination database. This method is efficient for large datasets and can be scripted for automation.
  • pg_dump with custom format and pg_restore: Using pg_dump with the custom format (-Fc flag) allows for parallel dumping of data, making it faster. Then, you can use pg_restore to restore the dumped data into the destination database.
  • psql with COPY command: The COPY command in psql allows you to copy data between tables, or between a table and a file. You can use this to export data from the source database to a file, transfer the file, and then import it into the destination database. However, this method might not be the most efficient for very large datasets.
  • Foreign Data Wrapper (FDW): PostgreSQL supports FDWs, which allow you to access data stored in remote databases as if they were local tables. You can set up an FDW in the destination database that connects to the source database, and then use SQL to insert the data directly from one database to the other.

Q: When do we evolve from “streaming replication” to needing multi-master etc. kind of setups?

  • Read this: https://www.reddit.com/r/PostgreSQL/comments/122p5am/can_someone_share_experience_configuring_highly/ (Holy grail)
  • See Data Replication
  • If you have such requirement, rather than trying to make PG fit in those maybe check something else, otherwise following link discusses postgres HA if we really need it.
  • While streaming replication and logical replication are officially supported features, and the replicas(standby(s)) are always read-only.
  • Multi-master replication allows multiple database servers to accept write operations and replicate changes between each other. And for leader-follower there are different variants there aswell. But as far as postgres goes we can consider the ones officially supported. streaming replication itself seems like a variant of leader follower.
  • Some third-party solutions attempt to provide this functionality.
  • When this functionality is needed, better to jump ship and use something else. See Scaling Databases and Data Replication.

TODO Postgres Major version upgrade

Replication

Oogabooga replication / Log shipping / File based

  • This is what happened first
  • Ship WAL logs continuously to remote destination and continuously keep reading from it. This is what pg_standby did.

Streaming replication / Streaming binary / Streaming wal records/commits

  • Instead of relying solely on the WAL archive and recovery functionalities to transport WAL data, a standby server could connect directly to the primary server via the existing libpq protocol and obtain WAL data that way, so-called streaming replication.
  • This is much faster than log based WAL archival
  • This is a Leader <--> Follower setup. i.e only one read&write others will be readonly.
  • Side effects

    • synchronous standbys/replication

      • Allowing commits to be received/applied by standbys
      • 2 SYNCHRONOUS STANDBY BAD! 3 STANDBY GOOD!
        • a basic two-server setup, if the standby goes down, the primary may become unavailable for writes, waiting indefinitely for confirmation from the standby.
    • pg_basebackup

      • pg_basebackup would use a normal libpq connection to pull down a base backup, thus avoiding complicated connection and authentication setups for external tools.
      • Makes the base backup easier instead of having to use rsync.
    • cascading replication

      • Allowing standbys to be created from standbys
      • pg_basebackup could do this
    • pg_last_xact_replay_timestamp

      function for easy monitoring of standby lag.

  • The streaming replication protocol (builtin)

    • Enabled by synchronous_standby_name and optionally setting/???? synchronous_commit
    • primary method of replication in PostgreSQL. It allows the standby server to connect directly to the primary and receive WAL records as they’re generated, reducing lag.
    • WAL data is sent immediately rather than waiting for a whole segment to be produced and shipped.

Logical replication

  • Usecase

    • Logical replication is therefore a good option when your target is not Postgres
    • Not a good usecase for “postgres backup”
  • History

    • PostgreSQL 9.4: Logical decoding
    • Introduced in PostgreSQL 10, this method allows replication of specific tables rather than the entire database. It’s more flexible but can be more complex to set up.
    • In logical replication setups, subscribers can be writable, but these are not the same as physical standbys.
  • Main ideas

    • publishers (servers that do CREATE PUBLICATION)
    • subscribers (servers that do CREATE SUBSCRIPTION)
    • Servers can also be publishers and subscribers at the same time.
    • wal_level needs to be logical (for logical decoding)
    • Logical replication allows replication of data changes on a per-table basis. In addition, a server that is publishing its own changes can also subscribe to changes from another server, allowing data to flow in multiple directions.
      • third-party extensions can also provide similar functionality.
  • Gotchas

    • I thinks if you have things like CURRENT_TIME(), NOW(), those things will be executed on the replica?? (and will drift from original), this problem is not there with physical replication. I am not sure if this is what actually happens. This is a doubt.

Backup

TypeToolNatureUse
LogicalSQL DumpConsistent but might not be uptodateOK for small scale, can combine w restic
PhysicalFile/Data/FS Backup/SnapshotSTW(Frozen), might need to stop pgOK for small scale, can combine w restic
PhysicalContinuous ArchivalUseful when large DB, no extra space to take a pg_dump

SQL Dump (Logical)

When you somehow backup the shape and state of the data. Later you’d have a way to put it back together. Backups the “database objects”.

  • pg_dump : Takes consistent backups, other writes can keep happening concurrently. But using this as a backup solution might result in data loss. It’ll read every(based on params) db object, converting it to equivalent SQL, and writing it to some location.
    • This is what you’d call a “full backup” I presume.
    • use pg_dump with -Fc to ensure that you are using a custom format. This is the only way to prepare to do a parallelized restore using pg_restore (with the -j parameter)
  • Concerns
    • If a system has a simple pg_dump cron job that ships the archive to remote storage, when the leader crashes or dies, the time to detection, copying the archive to the follower, pg_restore completion times is the amount of downtime that’s required.
    • The cron job, if configured at a certain time in the day, differs from the time the crash happens, the delta in the data until that time on the leader is a potential loss in data.
    • It involve duplicating data over and over again. De-dupe(as restic does) doesn’t make sense for Postgres backups, the full/diff/trn-log that pgbackrest does is what’s needed. But guess it works for small databases.

Filesystem Snapshots (Physical)

Can be refered to as snapshot backup. It’s a STW kind of a thing.

  • Based on if the FS allows atomic snapshots we might or mightnot need to stop the pg before taking the snapshot
  • ZFS filesyste: With ZFS you can trust snapshots + send/recv if, and only if, everything is on the same dataset (WAL and DATA). If you need PITR then you need something like Pgbackrest.

Continuous Archival (Physical)

See the continuous archival section

Continuous Archival

  • In PG8.0(2005), we had the ability to copy the WAL somewhere else.

  • With this, we can later play it back, either all the way or to a particular point in time. (ideally taking a base backup)

  • So (WAL storage separation + ability to replay from any point) = Continuous Archival + PITR

  • This had NO concept of replication initially. Simple, store the WAL elsewhere and play it back. simple.

  • Later, people figured you could archive WAL on one server and continuously recover WAL on another server. This gives us streaming!

    • Initially people did this manually, later we got pg_standby with which built-in replication in PG was officially born(2008).

Meta Questions

  • Relevant configs/commands

    • WO/RS: Without Replicated Slot
    • WRS: With Replicated Slot
    TypeStrategyNameWhat it does
    pg_standby
    pg_basebackup
    max_slot_wal_keep_size
    max_wal_size
    pg_start_backup
    pg_stop_backup
    wal_keep_sizeSpecifies the minimum size of past WAL files kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. Useful in case of slow connection etc.
    configWAL shipping(WORS)/Streaming/Logicalwal_levelNeeds to be replica for WAL shipping/streaming replication
    Streamingmax_wal_senders
    ??archive_mode
    WAL shipping(WORS)archive_commandtakes in a script (ideally idempotent). If exit status of 0 (success), the WAL file is recycled, otherwise archiving is retried until it succeeds.
    WAL shipping(WORS)restore_commandWhatever restores the archive on the warm standby when promoted
    WAL shipping(WORS)archive_timeoutWithout this sending out the archive can have long delays
    WAL shipping(WORS)archive_cleanup_commandThis needs to be run in the standby as primary automatically re-cycles WAL entries after checkpoints
  • Tools Exploration

    These help you do things like uploading to various object storage services, putting all the files together for restore, manage the lifecycle of the backup files, taking base backup periodically so that WAL replay does not become too big etc etc.

    NameDescription
    pg_basebackupofficial, allows for a base backup
    wal-gwritten in go, supports more than postgres
    pgbackrestSee this comparision w barman
    barmanMakes base-backup and does WAL archival
    repmgridk what this do

    I’ve finally decided to go with wal-g for by continious backup strategy

  • WAL Recycling

    From the docs

    The segment files are given numeric names that reflect their position in the abstract WAL sequence. When not using WAL archiving, the system normally creates just a few segment files and then “recycles” them by renaming no-longer-needed segment files to higher segment numbers. It’s assumed that segment files whose contents precede the last checkpoint are no longer of interest and can be recycled.

Continuous Archival with PITR

When you drop a table in prod, you should be able to easily recover it. PITR allows that.

  • Log shipping based

    Main idea

    • One base backup as a starting point
    • Continuous deltas in the form of the append-only log (WAL) files

    NOTE:

    • This does not involve it replication slot as far as i understand.
    • Unlike “stream replication”, In this we send out by WAL segments(16MB) in certain intervals. So, data loss is still possible as well, since WAL files are only closed and archived at finite intervals. If you archive the WAL every 5 minutes, you can lose 5 minutes of data. But this is still better than using plain sql_dump
    • Base backup

      • Missing WAL logs from the archive might hamper future restore, so regular base backups will help keep the error surface area a little small.
      • old base backup + too many WAL logs to restore increase the restore time.
      • Use the pg_basebackup command from an external machine (or the same machine, with a different data directory setting), providing the connection info to connect to the leader.
    • WAL shipping

      • each WAL file is exactly 16MB, they can be compressed down to only a few KB if they’re mostly empty
      • Postgres suggests shipping the WAL per minute
      • Allows backups to be taken much more frequently, with the Postgres documentation suggesting frequencies of once a minute.
  • Streaming replication based

    Using true streaming protocol based replication, do you need

    • In streaming replication, WAL data is sent immediately rather than waiting for a whole segment to be produced and shipped.
    • Streaming replication based backup + WAL Archival

      • The log is written in 16MB segments in a directory (normally pg_wal). It will keep growing if you have replication slots enabled AND replication is failing (or the link is too slow). It will shrink as the data is able to be replicated.
    • Using replication slot based (no extra WAL archival needed)

      • ????
    • pg_recivewal based

      • ?????
  • Using both (fallback)

    You can and usually should combine the two methods, using streaming replication usually, but with archive_command enabled. Then on the replica, set a restore_command to allow the replica to fall back to restore from WAL archives if there are direct connectivity issues between primary and replica.

    • The case of replication slots

      • Without replication slots

        • See this thread: https://www.reddit.com/r/PostgreSQL/comments/18yiyfb/question_on_archive_command_usage_with_streaming/
          • Has barman usage
        • In the simplest case, you still needed a WAL archive for complete robustness.
        • If primary server didn’t actually keep a list of its supposed standby servers, it just streamed whatever WAL happened to be requested if it happened to have it. If the standby server fell behind sufficiently far, streaming replication would fail, and recovery from the archive would kick in. If you didn’t have an archive, the standby would then no longer be able to catch up and would have to be rebuilt.
        • >If the asynchronous replica falls too far behind the master, the master might throw away information the replica needs if wal_keep_segments is too low and no slot is used. — What if wal_keep_segments is 0 (default) and a replication slot is in use?
          • Then a replication slot is in use, so WAL is retained indefinitely. However, in PostgreSQL 13 there is now a max_slot_wal_keep_size option that sets a limit for slot WAL retention. Consider it an advanced feature, you probably don’t need it. Also note that wal_keep_segments was replaced with the max_wal_size setting.
        • streaming replication is possible without using archive_command
          • provided the data ingestion throughput never exceeds the rate of the follower streaming the logs directly from the leader, and applying them locally (also depends on the network latency).
          • If the follower is not able to keep up with the logs, the logs on leader might get recycled, and the follower will keep waiting for the now-non-existent WAL file. Force-starting the follower in case of failure will result in data loss.
      • With replication slots (introduced later)

        • if you’re using replication slots, then you don’t need walarchive.

Continuous Archival without PITR (Not recommenced, Not really a backup)

If you drop a table on the master, oops, it’s dropped on the replicas too!

  • This is basically Logical and Streaming replication and using it in a way that it can be considered a “backup” for your usecase.
  • We continuously put the stuff out somewhere but we don’t get any point in time recovery mechanism. We have what we have replicated.

Resources

Features

Temporary Tables

  • Implemented like regular tables, but uses temp_buffers and uses disk for storing metadata and overflows etc.
  • Unlike regular tables, not guaranteed to be used by multiple connections at the same time, are not subject to locks, are not written to WAL, etc.
  • Located in PostgreSQL system tables. In addition, for each table, one or more files are created on disk (by default, in the same folder as the files for regular tables).
  • Allows different sessions to use the same temporary table name for different purposes
  • Automatically dropped at the end of a session, or optionally at the end of the current transaction based on ON COMMIT

Issues

  • Autovacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables.
  • Need to be purged
    • DELETE ALL bad cuz MVCC is used for Temporary tables aswell and deleting records is slow (See Concurrency)
    • So we TRUNCATE: TRUNCATE simply creates a new file on disk and does an UPDATE of the pg_class table.

Read more

UPSERT

-- canonical
INSERT INTO table (col1, col2, col3)
VALUES (val1, val2, val3)
ON
 CONFLICT conflict_target conflict_action;
-- example
INSERT INTO employees (id, name, email)
VALUES (2, ‘Dennis’, ‘dennisp@weyland.corp’)
ON CONFLICT (id) DO UPDATE;
-- conflict_target: (id)
-- conflict_action: DO UPDATE

Granular Access Control

Row Level Security (RLS)

  • It needs to be enabled on per-table basis
  • Access control is fully modeled inside the database
  • It doesn’t matter as who you connect to it anymore
  • As long as your request can be appropriately authenticated into a database role for that row in the table you’re good.
  • Row security policies can be specific to commands, or to roles, or to both.

Nested/Sub Transactions

BEGIN; -- Start the main transaction
-- Perform some operations
INSERT INTO accounts (account_id, amount) VALUES (1, 1000);
-- Create a savepoint
SAVEPOINT my_savepoint;
-- Perform operations within the nested transaction
INSERT INTO transactions (trans_id, account_id, amount) VALUES (101, 1, 100);
UPDATE accounts SET amount = amount - 100 WHERE account_id = 1;
-- Decide to roll back the nested transaction
ROLLBACK TO SAVEPOINT my_savepoint;
-- The nested transaction's changes are undone, but the main transaction is still in progress
-- Continue with other operations
INSERT INTO logs (message) VALUES ('Nested transaction was rolled back.');
-- Finally, commit the main transaction
COMMIT;
  • See Database Transactions
  • We want to avoid such nested transactions if possible.
  • Let the application logic create the transaction and put things in. If possible don’t try to handle this at the DB level.
  • Subtransactions can commit or abort without affecting their parent transactions, allowing parent transactions to continue.
  • Started using SAVEPOINT / EXECEPTION

WAL

  • See A write-ahead log is not a universal part of durability | Hacker News
  • Main idea: Changes to data files (where tables and indexes reside) must be written only after those changes have been logged
  • With the WAL, we will be able to recover the database so we need not flush pages on every transaction
  • When restoring the WAL after a crash, we’ll recover to the checkpoint created by the checkpointer

WAL flush v/s Data flush

  • The WAL file is written sequentially, and so the cost of syncing the WAL is much less than the cost of flushing the data pages.
  • If DB is handling many small concurrent transactions, one fsync of the WAL file may suffice to commit many transactions. commit_delay can be used to increase the window of transactions to be flushed.

Journaling filesystems

  • In certain cases if the underlying Filesystem (the sqlite VFS interface) there’s journaling support, in those cases we can probably not use sqlite journaling and gain some performance gains but this is not needed usually and unnecessarily complex. This can also be vice versa, we can disable filesystem journaling too.

MVCC Implementation

See MVCC

VACUUM

See A Detailed Understanding of MVCC and Autovacuum Internals in PostgreSQL 14

Terms

  • Live Tuples : Tuples that are Inserted or up-to-date or can be read or modified.
  • Dead Tuples : Tuples that are changed (Updated/Deleted) and unavailable to be used for any future transactions. (This is what Vaccum clears)

Usecase

  • MVCC Version Cleanup

    • VACUUM is like a Garbage collector for older versions of rows created by MVCC
    • transaction id is a 32-bit integer
      • the VACUUM process is responsible for making sure that the id does not overflow.
      • Never disable the VACUUM, else transaction wraparound
    • See Database Transactions
  • Run ANALYZE on tables

Freezing

autovaccum conditions

Views & Materialized views

See SQL for more info on “views” in general.

Maintaining views

  • Theory

    • See pgivm docs
  • Implementation

    • REFRESH MATERIALIZED VIEW
      • complete re-computation of contents
    • Incremental View Maintenance (IVM) (pg extension)

TODO Locks

Internals

Pages

  • AKA Data blocks
  • Page number starts from 0

Buffer Pool

Other extra info on Postgres

Psycogpg2 notes

Core Concepts

Connection > Transaction (each with conn block) > Cursor

  • Connection

    • Encapsulates a database session
    • Thread Safe and can be shared among many threads.
      • There’s a big problem here. (See my psycogpg2 FAQ section in this note)
    • closing a connection without committing the changes first will cause any pending change to be discarded
  • Transactions

    • transactions are handled by the connection class
      • The connection is responsible for terminating its transaction, calling either the commit() or rollback()
      • If close() is called or something goes wrong in middle of a transaction its dropped by the db servr
    • autocommit mode
      • connection can be set to autocommit mode
      • no rollback possible
      • All the commands executed will be immediately committed
    • Transactions are per-session, i.e. per-connection.
  • Cursor

    • Not thread safe
    • Client side cursor

      • Allows Python code to execute PostgreSQL command in a database session(connection -> transaction)
      • Bound to a connection for the entire lifetime
    • Server side cursor

      • When creating a cursor, if a name is specified, it’ll be a server side cursor
      • By default a named cursor is declared without SCROLL and WITHOUT HOLD

Context manager (with)

after with exit, it is closed?after with exit, transaction is committed?
connectionNOYES, if no exception. (i.e transaction is terminated)
cursorYES, any resources is releasedNO, no notion of transaction here.
  • with can be used with both connection and cursor. So the with exit action only applies if its used to it. Eg. Having a connection but only using with with the cursor will NOT close the connection.
    • In a way using with for connection is more conceptually along the lines of using context manager for the transaction
  • A connection can be used in more than a with statement (i.e more than one context)
    • Each with block is effectively wrapped in a separate transaction
  • Since context manager will not handle connection.close() for you, if you don’t want a long running database session. Make sure to call it wherever it seems fit.

psycogpg2 FAQ

  • In the docs there are mentions of command, it simply means sql query. got me confused.
  • Do we need to commit SELECT(s)?

    • SELECT is a transaction
    • But you can get by without committing it, but always good to commit it (context managers will do it for you)
  • What about using psycopg2 in multi-threaded setting?

    • Multiple concerns
    • According to the docs, the connection objects are thread-safe and shareable among threads. cursors are not.
      • But cursors created from the same context (connection) are not isolated!
      • On top of that it’s not super clear what does connection object being sharable among threads mean.
      • If if it does need to be very careful when sharing a connection across threads because commits and rollbacks are performed for the entire connection. If you have a multistep transaction and another thread performs a commit in the middle, you may not like the result.
    • For all these general confusion, go with the general rule: Each thread should have its own database connection.
    • Now we have 2 options
      • Create a new connection in each thread and close it
        • Something like: with psycopg2.connect(your_dsn) as conn:
      • Use a connection pool
        • With psycogpg2 context manager is not supported with pools for, just something to be aware of.
        • We should be releasing the connection instead of closing it
        • Some more context: Transaction Handling with Psycopg2 | Libelli (might not be 100% correct)
    • MAX CONNECTIONS in psycopg2 and ThreadPoolExecutor/ Python Threading
  • Long running connections

    • The difficulty with long-lived connections is that you might not be entirely sure that they’re still there.
      • Generally try to close the connection asap if possible
    • If using something like a webserver, re-creation of connection can be expecisive. This is when you’d want to use connection pools, server or client side based on the architecture of your application

To read (TODO)

Relevant Notes

  • Initializing a postgres directory: After logging into the postgres user, you can create a directory with necessary postgres files with the initdb command. It creates a directory in the file system and then you can start a postgres server directing it to that directory.
  • Tablespaces: All tables are by default created in pg_default tablespace, creating in a tablespace does not affect the logical SQL schema.

Tools

Routing

Others

Extension

Ecosystem

Other comments

  • Some PostgreSQL footguns:

    • default configuration
    • long transactions mixed with OLTP workload
    • repmgr (and other HA tools not based on consensus algorithms)
    • LISTEN/NOTIFY
    • “pg_dump is a utility for backing up a PostgreSQL database” says the official doc
    • moving/copying PGDATA and ignoring glibc version change
    • hot_standby_feedback on/off dilemma
    • partitioning of tables having high number of indexes and receiving QPS >> 1k
    • “setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions” says the official doc
    • using replication slots w/o setting max_slot_wal_keep_size
    • relying only on statement_timeout & idle_in_transaction_session_timeout and thinking it’s enough (lack of transaction_timeout)
    • data types “money”, “enum”, “timestamp” (3 different cases)
    • int4 PK
    • massive DELETE
    • attempts to use transactional version of DDL (e.g., CREATE INDEX) under heavy loads
    • subtransactions
    • DDL under load without lock_timeout