Database Indexing

tags : Database, Systems, Data Structures

FAQ

Are indexes only for performance?

Not necessarily
Sometimes they can be used to hold strict constraints like uniqueness constraints etc.

Ordered indexes(eg. B+trees) v/s hash tables?

See Why databases use ordered indexes but programming uses hash tables
Some day Btrees are better for disk and hash table better for in-memory, but this is NOT fundamental property and is very contextual
Persistent data has different needs than in-memory data.
In databases, we tend to use ordered indexes because they’re more general hence can support large variety of operations.

Hash Table(See Data Structures) v/s Trees

	Lookup	Ordering	Range Ops(`topk` / `starts_with`)	Spatial Locality
Hash Table	$O (1)$ (avg, can get worse)	Unordered	Need to scan entire table	No spatial locality
Trees	$O (l o g n)$	Shit’s ordered	Ez, can do, cuz ordered	Can prefetch!

How are tables stored? and how are indexes stored?

Table generated by GPT4^
InnoDB is similar to SQLite all tables are clustered indexes
- If you don’t have suitable primary key it creates a hidden one similar to rowid in SQLite.
PG only has unordered heaps, indexes are always secondary data structures. PG could really use true clustered indexes as an option.
MS-SQL and Oracle give you a choice for table storage to either have unordered heaps or ordered b-trees, depending on whether a clustered index is defined.

Query planner, selectivity, statistics and index

Assume we need to answer, Which index to use when doing query if we have index on name and address column? This needs context. The querry planner tries its best to answer this, else we(hoomans) chime in.
Sometimes even after having an index, it query planner might just choose not to use the index if the performance would be same as a sequential scan

On uuid v4 and Indexes

uuidv4 is random and should not be used for indexes because

they’re random and indexes are meant to be ordered based on the specific implementation
If we’re adding new items to the index and they’re random, we’d need to balance/re-order the index on each insert to the index
this causes thrashing which is bad for performance

Indexes and Foreign keys

In PostgreSQL, indexes are not created on FK(s) on the referencing(child) side, but it’s suggested to do so.
See DB Data Modeling

Best practices for Tuning

Performing index tuning must be done on production, or on a staging environment that is as close to production as possible. Doing index tuning locally might not even use index depending what the query planner thinks etc.
Ordinary CREATE INDEX commands require a lock that blocks writes but not reads. Prefer using CONCURRENTLY which can take longer but will not lock.
When using REINDEX, One strategy to achieve the same result on a live site is to build an index concurrently on the same table and columns but with a different name, and then drop the original index and rename the new one. This procedure, while much longer, doesn’t require any long running locks on the live tables. (From Heroku)

Types

Based on structure

Clustered Indexes

A table can only have one clustered index.
This physically changes the storage of the data in long term memory
The column you index with becomes sort of the primary key
Different databases have their own ways. sqlite has WITHOUT ROWID, PostgreSQL uses CLUSTER command etc.

Non-Clustered/Standard/Regular Indexes

A non-clustered index is a separate/secondary index that references back to the original data(primary key at the leaf)
In some DBs the secondary index is part of the table instead of begin a separate entity, but this is implementation detail.

Based on number of columns/indexes

Based on usecase, we might need to create both.

Multi-Column Indexes (AND)

This uses multiple columns but results in a single index.
Index gets used when you have AND in the query.
An example
- Index on a table defined as (col1, col2, col3)
- A query filtering on col1 can efficiently use this index.
- A query filtering on col1 and col2 (in that order) can also use it effectively.
- However, a query filtering just on col2 or on col2 and col3 cannot fully utilize this index, as the col1 is not used in the query.
Most efficient when there are constraints on the leading (leftmost) columns.
In PostgreSQL, only the B-tree, GiST, GIN, and BRIN index types support multiple-key-column indexes. Upto 32 columns but can be altered.
When writing query the order of the columns AND‘ed don’t matter (NEED TO CONFIRM)

Composite/Compound Indexes
- Eg. INDEX(last_name, first_name)

Covering Indexes
- These are not necessarily multi-column indexes, you can make covering indexes with single columns also but it’s mostly multicolumn in its usecase.
- Multicolumn example: INDEX(last_name, first_name, address)
- The query must reference only columns stored in the index.
- Not all indexing methods support covering indexes, Btrees do.
- In PostgreSQL it also has to do something with MVCC. i.e significant fraction of the table’s heap pages have to have their all-visible map bits set for this to be useful.
- In PostgreSQL, you use INCLUDE for this. Before INCLUDE people just tried to hack their way to introduce an extra playload using multicolumn indexes.
- Unique index on covering index
```
-- uniqueness condition applies to just column x
-- not to the combination of x and y
CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);
```

Multiple indexes (OR/AND)

This is not really a index type, just a situation where you have multiple standard indexes.
INDEX(first_name), INDEX(last_name)

Based on constraints

Unique Indexes

Can be combined with: [ Multi-column indexes , Partial indexes ]
Only B-tree indexes can be declared unique.
Any attempt to insert a duplicate value results in an error. If creating on exiting column it should throw error if there exists duplicates.
PostgreSQL has UNIQIE constraint also creates a unique index on it automatically
See NULL in SQL

Based on conditions

Partial Indexes

Index on a subset of a table. using the WHERE clause.
The subset is defined by a condition(predicate of the partial index)
Because it’s only on a subset of the table, it’s usually smaller and overall results in faster updates
Basically helps in not making an index on values of a table that we’re probably not going to query that much anyway
Such partial indexes are best used for data distributions that do not change.
Matching takes place at query planning time(compile time), not at run time. NO PARAMETERIZED
- As a result, parameterized query clauses do not work with a partial index.
- As replacement in parameterized query happen at run/execution time. (See SQL)
NOT AN SUBSTITUTE FOR PARTITION (See 16: 11.8. Partial Indexes)
Can also be used to override the system’s query plan choices

Example/Usecase of partial index

Exclude common/uninteresting values: Table has public and private IPs, but you only query public IPs. So you can have an index only on those which are public IPs.

Querying on a different column but indexing on another

CREATE INDEX orders_unbilled_index ON orders (order_nr) WHERE billed is not true;
-- index is used fully
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
-- in the following even if index is not in amount, the index will help
-- partially. i.e does a full scan for "amount"
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
-- index does not get used
-- query planner doesn't know if result is even in the index because no WHERE clause
-- so it skips using the index
SELECT * FROM orders WHERE order_nr = 3501;

Unique index over a subset of a table
- Enforces uniqueness among the rows that satisfy the index predicate
- Setting up a partial index indicates that you know at least as much as the query planner knows
- partial + unique + multicolumn index can also be made

Index on expression

SELECT * FROM test1 WHERE lower(col1) = 'value';
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));

Index column can be a function or scalar expression computed from one or more columns of the table.
Fast access to tables based on the results of computations.
Unique index also can be set on this, will enforce the uniqueness of the computation
COALESCE Index
- This is just an example of using expression index
- Eg. CREATE INDEX idx_test ON testtable(COALESCE(col1, col2));

Index Access methods

See PostgreSQL: Documentation: 16: 11.2. Index Types

B+tree

See Trees
Fetching ‘consecutive’ (according to the index) items from a BTree is very efficient because they are stored consecutively.
A BTree lookup is quite fast and efficient. For a million-row table there might be 3 levels of BTree, and the top two levels are probably cached.
B-tree indexes can be used for equality and range queries efficiently.
B-tree indexes are optimized for when a row has a single key value.

GiST

Hash

BRIN

BRIN index might need a bit of a stronger disclaimer, as far as I understand you really want to use that only for ordered data, and in those cases it is exceptionally good at its job. But it is a lot worse if that condition is not met.
BRIN is generally useful for datasets that have high local correlation.

Bloom

A traditional btree index is faster than a bloom index, but it can require many btree indexes to support all possible queries where one needs only a single bloom index.
Note however that bloom indexes only support equality queries, whereas btree indexes can also perform inequality and range searches.

Scan Types

Full table scan

Scan that happens when there’s no index

Single index scan

Regular index scans
Each row retrieval requires fetching data from both the index(btree) and the heap(table).

Index only scan

Using Covering Indexes

Multi index scan

This is related to Multiple indexes
This also using the same index multiple times if needed
Eg. A query like WHERE x = 42 OR x = 47 OR x = 53 OR x = 99 could be broken down into four separate scans of an index on x
To combine multiple indexes
- System scans each index and prepares a bitmap in memory
- Bitmap contains locations of table rows that are match index’s conditions.
- The bitmaps are then ANDed and ORed together as needed by the query.

🐏 mogoz

Table of Contents

Database Indexing

FAQ

Are indexes only for performance?

Ordered indexes(eg. B+trees) v/s hash tables?

How are tables stored? and how are indexes stored?

Query planner, selectivity, statistics and index

On uuid v4 and Indexes

Indexes and Foreign keys

Best practices for Tuning

Types

Based on structure

Clustered Indexes

Non-Clustered/Standard/Regular Indexes

Based on number of columns/indexes

Multi-Column Indexes (AND)

Multiple indexes (OR/AND)

Based on constraints

Unique Indexes

Based on conditions

Partial Indexes

Index on expression

Index Access methods

B+tree

GiST

Hash

BRIN

Bloom

Scan Types

Full table scan

Single index scan

Index only scan

Multi index scan

Resources

Graph View

Backlinks

🐏 mogoz

Table of Contents

Database Indexing

FAQ §

Are indexes only for performance? §

Ordered indexes(eg. B+trees) v/s hash tables? §

How are tables stored? and how are indexes stored? §

Query planner, selectivity, statistics and index §

On uuid v4 and Indexes §

Indexes and Foreign keys §

Best practices for Tuning §

Types §

Based on structure §

Clustered Indexes §

Non-Clustered/Standard/Regular Indexes §

Based on number of columns/indexes §

Multi-Column Indexes (AND) §

Multiple indexes (OR/AND) §

Based on constraints §

Unique Indexes §

Based on conditions §

Partial Indexes §

Index on expression §

Index Access methods §

B+tree §

GiST §

Hash §

BRIN §

Bloom §

Scan Types §

Full table scan §

Single index scan §

Index only scan §

Multi index scan §

Resources §

Graph View

Backlinks

FAQ

Are indexes only for performance?

Ordered indexes(eg. B+trees) v/s hash tables?

How are tables stored? and how are indexes stored?

Query planner, selectivity, statistics and index

On uuid v4 and Indexes

Indexes and Foreign keys

Best practices for Tuning

Types

Based on structure

Clustered Indexes

Non-Clustered/Standard/Regular Indexes

Based on number of columns/indexes

Multi-Column Indexes (AND)

Multiple indexes (OR/AND)

Based on constraints

Unique Indexes

Based on conditions

Partial Indexes

Index on expression

Index Access methods

B+tree

GiST

Hash

BRIN

Bloom

Scan Types

Full table scan

Single index scan

Index only scan

Multi index scan

Resources