Sparse vector field type
editSparse vector field type
editA sparse_vector
field can index features and weights so that they can later be used to query documents in queries with a sparse_vector
.
This field can also be used with a legacy text_expansion
query.
sparse_vector
is the field type that should be used with ELSER mappings.
resp = client.indices.create( index="my-index", mappings={ "properties": { "text.tokens": { "type": "sparse_vector" } } }, ) print(resp)
response = client.indices.create( index: 'my-index', body: { mappings: { properties: { 'text.tokens' => { type: 'sparse_vector' } } } } ) puts response
const response = await client.indices.create({ index: "my-index", mappings: { properties: { "text.tokens": { type: "sparse_vector", }, }, }, }); console.log(response);
PUT my-index { "mappings": { "properties": { "text.tokens": { "type": "sparse_vector" } } } }
Token pruning
editWith any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional index_options
parameters for the field:
PUT my-index { "mappings": { "properties": { "text.tokens": { "type": "sparse_vector", "index_options": { "prune": true, "pruning_config": { "tokens_freq_ratio_threshold": 5, "tokens_weight_threshold": 0.4 } } } } } }
See semantic search with ELSER for a complete example on adding documents to a sparse_vector
mapped field using ELSER.
Parameters for sparse_vector
fields
editThe following parameters are accepted by sparse_vector
fields:
Indicates whether the field value should be stored and retrievable independently of the _source field.
Accepted values: true or false (default).
The field’s data is stored using term vectors, a disk-efficient structure compared to the original JSON input.
The input map can be retrieved during a search request via the
|
|
|
(Optional, object) You can set index options for your |
Parameters for index_options
are:
|
(Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If |
|
(Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if |
Parameters for pruning_config
include:
|
(Optional, integer) Tokens whose frequency is more than |
|
(Optional, float) Tokens whose weight is less than |
The default values for tokens_freq_ratio_threshold
and tokens_weight_threshold
were chosen based on tests using ELSERv2 that provided the most optimal results.
When token pruning is applied, non-significant tokens will be pruned from the query. Non-significant tokens can be defined as tokens that meet both of the following criteria:
- The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
- The weight/score is so low that the token is likely not very relevant to the original term
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned. This ensures that:
- The tokens that are kept are frequent enough and have significant scoring.
- Very infrequent tokens that may not have as high of a score are removed.
Multi-value sparse vectors
editWhen passing in arrays of values for sparse vectors the max value for similarly named features is selected.
The paper Adapting Learned Sparse Retrieval for Long Documents (https://arxiv.org/pdf/2305.18494.pdf) discusses this in more detail. In summary, research findings support representation aggregation typically outperforming score aggregation.
For instances where you want to have overlapping feature names use should store them separately or use nested fields.
Below is an example of passing in a document with overlapping feature names.
Consider that in this example two categories exist for positive sentiment and negative sentiment.
However, for the purposes of retrieval we also want the overall impact rather than specific sentiment.
In the example impact
is stored as a multi-value sparse vector and only the max values of overlapping names are stored.
More specifically the final GET
query here returns a _score
of ~1.2 (which is the max(impact.delicious[0], impact.delicious[1])
and is approximate because we have a relative error of 0.4% as explained below)
resp = client.indices.create( index="my-index-000001", mappings={ "properties": { "text": { "type": "text", "analyzer": "standard" }, "impact": { "type": "sparse_vector" }, "positive": { "type": "sparse_vector" }, "negative": { "type": "sparse_vector" } } }, ) print(resp) resp1 = client.index( index="my-index-000001", document={ "text": "I had some terribly delicious carrots.", "impact": [ { "I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8 }, { "I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4 } ], "positive": { "I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8 }, "negative": { "I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4 } }, ) print(resp1) resp2 = client.search( index="my-index-000001", query={ "term": { "impact": { "value": "delicious" } } }, ) print(resp2)
const response = await client.indices.create({ index: "my-index-000001", mappings: { properties: { text: { type: "text", analyzer: "standard", }, impact: { type: "sparse_vector", }, positive: { type: "sparse_vector", }, negative: { type: "sparse_vector", }, }, }, }); console.log(response); const response1 = await client.index({ index: "my-index-000001", document: { text: "I had some terribly delicious carrots.", impact: [ { I: 0.55, had: 0.4, some: 0.28, terribly: 0.01, delicious: 1.2, carrots: 0.8, }, { I: 0.54, had: 0.4, some: 0.28, terribly: 2.01, delicious: 0.02, carrots: 0.4, }, ], positive: { I: 0.55, had: 0.4, some: 0.28, terribly: 0.01, delicious: 1.2, carrots: 0.8, }, negative: { I: 0.54, had: 0.4, some: 0.28, terribly: 2.01, delicious: 0.02, carrots: 0.4, }, }, }); console.log(response1); const response2 = await client.search({ index: "my-index-000001", query: { term: { impact: { value: "delicious", }, }, }, }); console.log(response2);
PUT my-index-000001 { "mappings": { "properties": { "text": { "type": "text", "analyzer": "standard" }, "impact": { "type": "sparse_vector" }, "positive": { "type": "sparse_vector" }, "negative": { "type": "sparse_vector" } } } } POST my-index-000001/_doc { "text": "I had some terribly delicious carrots.", "impact": [{"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8}, {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}], "positive": {"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8}, "negative": {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4} } GET my-index-000001/_search { "query": { "term": { "impact": { "value": "delicious" } } } }
sparse_vector
fields can not be included in indices that were created on Elasticsearch versions between 8.0 and 8.10
sparse_vector
fields only support strictly positive values.
Negative values will be rejected.
sparse_vector
fields do not support analyzers, querying, sorting or aggregating.
They may only be used within specialized queries.
The recommended query to use on these fields are sparse_vector
queries.
They may also be used within legacy text_expansion
queries.
sparse_vector
fields only preserve 9 significant bits for the precision, which translates to a relative error of about 0.4%.