Elasticsearch Scripting: Understanding The Difference Between doc And params

Painless scripts allow to customize a lot of things in Elasticsearch. One thing that (almost) every script has in common is the access of document fields. There are two different ways to do so and every developer should know them. Because it can have a huge impact on the performance.

An Example

Let’s start with some sample data.

POST superhero/_doc/1
{
  "name": "Winter Monk",
  "race": "Cyborg",
  "eye_color": "green",
  "alignment": "good",
  "strength": 102
}
 
POST superhero/_doc/2
{
  "name": "Jungle Banana",
  "race": "Mutant",
  "eye_color": "red",
  "alignment": "bad",
  "strength": 492
}
         
POST superhero/_doc/3
{
  "name": "Green Flash",
  "race": "Human",
  "eye_color": "green",
  "alignment": "bad",
  "strength": 98
}

As explained in the last article, we could use scripted sorting to add some custom sort logic.

GET superhero/_search
{
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "params.mapping[doc['alignment.keyword'].value]",
          "params": {
            "mapping": {
              "neutal": 1,
              "good": 2,
              "bad": 3
            }
          }
        }
      }
    }
  ]
}

This script maps the alignment to a custom sort order (that differs from the natural order of the alignment strings).

Two different ways to access document attributes

The example above shows one way to access document fields. The keyword doc refers to the document context whose content can be accessed in a dictionary-style.

doc['field_name'].value

This is the recommended way and uses a special data structure called doc_values that is created at index time. Think of it as a mapping between a document and all its terms of every field. It is used for sorting, aggregations and the fast lookup of values from scripts. Elasticsearch loads required entries to RAM. That requires more memory but results in a faster execution. And since search is (in most cases) about query speed, this approach is the one you should go for.

It works only for singe-valued fields, so arrays or more complex objects are not supported. Also, since it depends on loading all field terms into memory, it should be used for non-analyzed fields (keywords, numbers).

The other option is accessing the document source directly.

params['_source']['field_name']

This gives you the full access of the document, even on arrays or nested objects. But there is a pitfall. Elasticsearch has to parse the document source to retrieve the values. That allows also to access all the document fields that were not indexed. And that eats a lot of time. Whenever possible, you should avoid that.

Conclusion

Accessing fields via the source is not an option, except your index is really, really small. If you need to lookup something that is not part of the doc_values, you should rather consider to remodel your index mapping.

Resources

Add a Comment

Your email address will not be published. Required fields are marked *

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close