Managing indexes
Once your vector table starts to grow, you will likely want to add an index to speed up queries. Without indexes, you'll be performing a sequential scan which can be a resource-intensive operation when you have many records.
IVFFlat indexes#
Today pgvector
indexes use an algorithm called IVFFlat. IVF stands for 'inverted file indexes'. It works by clustering your vectors in order to reduce the similarity search scope. Rather than comparing a vector to every other vector, the vector is only compared against vectors within the same cell cluster (or nearby clusters, depending on your configuration).
Inverted lists (cell clusters)#
When you create the index, you choose the number of inverted lists (cell clusters). Increase this number to speed up queries, but at the expense of recall.
For example, to create an index with 100 lists on a column that uses the cosine operator:
_10create index on items using ivfflat (column_name vector_cosine_ops) with (lists = 100);
For more info on the different operators, see Distance operations.
For every query, you can set the number of probes (1 by default). The number of probes corresponds to the number of nearby cells to probe for a match. Increase this for better recall at the expense of speed.
To set the number of probes for the duration of the session run:
_10set ivfflat.probes = 10;
To set the number of probes only for the current transaction run:
_10begin;_10set local ivfflat.probes = 10;_10select ..._10commit;
If the number of probes is the same as the number of lists, exact nearest neighbor search will be performed and the planner won't use the index.
Approximate nearest neighbor#
One important note with IVF indexes is that nearest neighbor search is approximate, since exact search on high dimensional data can't be indexed efficiently. This means that similarity results will change (slightly) after you add an index (trading recall for speed).
Distance operators#
The type of index required depends on the distance operator you are using. pgvector
includes 3 distance operators:
Operator | Description | Operator class |
---|---|---|
<-> | Euclidean distance | vector_l2_ops |
<#> | negative inner product | vector_ip_ops |
<=> | cosine distance | vector_cosine_ops |
Use the following SQL commands to create an index for the operator(s) used in your queries.
Euclidean L2 distance (vector_l2_ops
)#
_10create index on items using ivfflat (column_name vector_l2_ops) with (lists = 100);
Inner product (vector_ip_ops
)#
_10create index on items using ivfflat (column_name vector_ip_ops) with (lists = 100);
Cosine distance (vector_cosine_ops
)#
_10create index on items using ivfflat (column_name vector_cosine_ops) with (lists = 100);
Currently vectors with up to 2,000 dimensions can be indexed.
If you are using the vecs
Python library, follow the instructions in Managing collections to create indexes.
When should you add indexes?#
pgvector
recommends adding indexes only after the table has sufficient data, so that the internal IVFFlat cell clusters are based on your data's distribution. Anytime the distribution changes significantly, consider recreating indexes.
Resources#
Read more about indexing on pgvector
's GitHub page.