Skip to content
Marcos G. Zimmermann edited this page Sep 26, 2023 · 12 revisions

The Esse::Index class an abstraction of an Elasticsearch index. It's responsible for defining the index name, the index settings, the index mappings, datasources and its documents.

Here is an minimal example of an index:

class ArticlesIndex < Esse::Index
  repository :article do
    collection do |**context, &block|
      batch = [
        { id: 1, title: 'Article 1', body: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.' },
        { id: 2, title: 'Article 2', body: 'Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium.' },
      ]
      batch.delete_if { |item| item[:id] != context[:id] } if context[:id] # Just to simulate a filter
      block.call(batch, **context)
    end

    document do |item, **_context|
      {
        _id: item[:id], # The _id is a convention to define the document id. More on this later.
        title: item[:title],
        body: item[:body],
      }
    end
  end
end

Now, let's see what's happening here:

  • The ArticlesIndex class inherits from Esse::Index and defines a respository block.
  • The respository block defines a new repo identified by :article with a collection and a document.
  • The collection block is responsible for fetching data from a datasource. It may receive a context that can be used to filter the data and a block that must be called with the fetched data
  • The document block is responsible for transforming each item of collection into a Esse::Document. Note that we are using a Hash as a document to keep things simpler, but under the hood, it will be converted to a generic Esse::HashDocument object. Always prefer to implement your own Esse::Document class.
> ArticlesIndex.documents
=> #<Enumerator: ...>
> ArticlesIndex.documents.to_a
=> [
  #<Esse::HashDocument @object={:_id=>1, :title=>"Article 1", :body=>"Lorem ipsum dolor sit amet, consectetur adipiscing elit."}, @options={}>,
  #<Esse::HashDocument @object={:_id=>2, :title=>"Article 2", :body=>"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium."}, @options={}>
]
> ArticlesIndex.documents(id: 1).to_a
=> [
  #<Esse::HashDocument @object={:_id=>1, :title=>"Article 1", :body=>"Lorem ipsum dolor sit amet, consectetur adipiscing elit."}, @options={}>
]

Now let's go deeper in each part of the index definition.

Repository

The repository is used to define a data source for the index. It can be a database table, a file, a web service, etc. The repository is responsible for fetching the data, enriching it and transforming it into documents. One index can have multiple repositories.

Defining a repository with a block:

class GeosIndex < Esse::Index
  repository :county do
    # ...
  end

  repository :city do
    # ...
  end
end

The identifier of the repository must be unique within the index. As default, a constantized version of the identifier will be used as the repository class name. In the example above, the :county repository will be represented by the GeosIndex::County class and the :city repository will be represented by the GeosIndex::City class. You can also access the repositories using GeosIndex.repo method:

> GeosIndex.repo(:county) == GeosIndex::County
=> true

> GeosIndex.repo_hash
=> {"county"=>GeosIndex::County, "city"=>GeosIndex::City}

If you don't want to generate the repo constant, you can pass const: false to the repository method:

class GeosIndex < Esse::Index
  repository :county, const: false do
    # ...
  end
end
GeosIndex.constants.include?(:County)
=> false

Collection

The collection block is responsible for fetching the data from the datasource. It must receive a context keyword-arguments and a block that must be called with the fetched data. The context can be anything you want, but it's important to implement :id filter to fetch a single document.

A collection can be defined through a block or a class that implements the Enumerable interface.

# app/indices/geos_index.rb
class GeosIndex < Esse::Index
  repository :county do
    collection do |**context, &block|
      # ...
    end
  end

  repository :city do
    collection Collections::CityCollection
  end
end

# app/indices/geos_index/collections/city_collection.rb
class GeosIndex::Collections::CityCollection
  include Enumerable

  # @param [Hash] context
  def initialize(**context)
    @context = context
  end

  # @yield [Array<Object>] batch of objects
  def each(&block)
    # ...
  end
end

Document

The document block is responsible for coerce each item of collection into a Esse::Document. It will always receive each item of the collection and a context keyword-arguments. The context can be anything you want, apply filters, policies, etc.

A document can be defined through a block or a class that implements the Esse::Document interface.

# app/indices/geos_index.rb
class GeosIndex < Esse::Index
  repository :county do
    document do |item, **context|
      { _id: item.id, name: item.name }
    end
  end

  repository :city do
    document Documents::CityDocument
  end
end

# app/indices/geos_index/documents/city_document.rb
class GeosIndex::Documents::CityDocument < Esse::Document
  # @return [String]
  def id
    object.id
  end

  # @return [Hash]
  def source
    # You can access the context using the `options` method
    { name: object.name }
  end
end

Elasticsearch 5.x or lower requires a type to be defined for each document. You can define the type using the #type method or by rendering _type: 'doc_type' in hash documents.

In the document level you can also define the #routing method to define the routing of the document.

Please look at the Esse::Document source code to see all the methods you can override.

ActiveRecord Repository

As most of Ruby applications are built on Rails, I'm going to show how to create a Esse Index loading data from an ActiveRecord model.

# app/indices/geographies_index.rb
class GeographiesIndex < Esse::Index
  repository :city do
    collection do |**context, &block|
      query = ::City.includes(:state)
      query = query.where(id: context[:id]) if context[:id]
      query = query.where(state_abbr: context[:state_abbr]) if context[:state_abbr]
      query.find_in_batches(&block)
    end

    document do |city, **_context|
      {
        _id: city.id,
        name: city.name,
        state: {
          id: city.state.id,
          name: city.state.name,
        }
      }
    end
  end
end

But thanks to the plugin system, we can use the esse-active_record and simplify the implementation above with a few lines of code:

# app/indices/geographies_index.rb
class GeographiesIndex < Esse::Index
  plugin :active_record

  repository :city do
    collection ::City.includes(:state) do
      scope :state_abbr, ->(abbr) { where(state_abbr: abbr) }
    end
    document Documents::CityDocument
  end
end

Much better, huh? The esse-active_record plugin will automatically create a collection block. You can define multiple scopes to handle the context filters. There is also a pretty nice feature named batch_context that can be useful to preload associations. Please refer to the esse-active_record documentation for more details and more examples.

Index Settings

The index settings are responsible for defining the index settings It can be defined using the settings method. The settings method accepts a block or a Hash as argument.

class ArticlesIndex < Esse::Index
  settings number_of_shards: 2, number_of_replicas: 1
end

# or
class ArticlesIndex < Esse::Index
  settings do
    # Usefull when you need to define dynamic settings
    {
      number_of_shards: 2,
      number_of_replicas: 1,
    }
  end
end

If you want something more complex, you can pass as argument any object. The object must respond to #to_h and return a Hash with the settings definition.

Note that the settings can also be defined in the Esse.config.custer. The global settings will be deep merged with the settings defined in the index.

# config/initializers/esse.rb
Esse.configure do |config|
  config.cluster do |cluster|
    cluster.settings = {
      number_of_shards: 2,
      number_of_replicas: 0,
      refresh_interval: '30s',
    }
  end
end

# app/indices/articles_index.rb
class ArticlesIndex < Esse::Index
  settings number_of_replicas: 1
end

ArticlesIndex.settings_hash
# => {:settings=>{:number_of_shards=>2, :number_of_replicas=>1, :refresh_interval=>"30s"}}

Index Mappings

The index mappings are responsible for defining the index mappings It can be defined using the mappings method:

class ArticlesIndex < Esse::Index
  mappings do
    {
      properties: {
        title: { type: 'text' },
        body: { type: 'text' },
      }
    }
  end
end

If you want something more complex, you can pass as argument any object. The object must respond to #to_h and return a Hash with the mappings definition.

Note that the mappings can also be defined in the Esse.config.custer. The global mappings will be deep merged with the mappings defined in the index.

# config/initializers/esse.rb
Esse.configure do |config|
  config.cluster do |cluster|
    cluster.mappings = {
      dynamic_templates: [
        {
          strings_as_keywords: {
            mapping: {
              ignore_above: 1024,
              type: 'keyword',
            },
            match_mapping_type: 'string',
          },
        },
      ],
      properties: {
        created_at: { type: 'date' },
      },
    }
  end
end

# app/indices/articles_index.rb
class ArticlesIndex < Esse::Index
  mappings do
    {
      properties: {
        title: { type: 'text' },
        body: { type: 'text' },
      }
    }
  end
end

ArticlesIndex.mappings_hash
# => {:mappings=>
#   {:dynamic_templates=>[{:strings_as_keywords=>{:mapping=>{:ignore_above=>1024, :type=>"keyword"}, :match_mapping_type=>"string"}}],
#    :properties=>{:created_at=>{:type=>"date"}, :title=>{:type=>"text"}, :body=>{:type=>"text"}}}}

If you are working with elasticsearch 5.x and lower, you must define a type in the mappings' properties. Please adjust accordingly your needs.

Naming and Aliases

The index name is defined by the class name. The ArticlesIndex class will be represented by the articles index. If you want to change the index name, you can use the index_name= method:

class ArticlesIndex < Esse::Index
  self.index_name = 'my_articles'
end

This gem uses a combination of index_prefix + index_name + index_suffix to define the real index name. And the alias is defined by the index name without the suffix. This is useful to implement a zero-downtime deployment strategy. Please refer to the (https://www.elastic.co/blog/changing-mapping-with-zero-downtime) article for more details.

  • index_prefix is a prefix that can be defined in the Esse.config.custer.index_prefix=. It's useful to separate the indexes by environment.
  • index_name is the index name defined by the class name. The ArticlesIndex class will be represented by the articles index.
  • index_suffix is automatically generated with a current timestamp in the %Y%m%d%H%M%S format. Unless you define a hardcoded index_suffix= in the index class.

Here is an example of how the index name is generated:

> Esse.config.cluster.index_prefix = 'esse'
=> "esse"
> ArticlesIndex.index_name
=> "esse_articles"
> ArticlesIndex.index_name(suffix: 'v2')
=> "esse_articles_v2"

The suffix parameter is available in most of the methods that interact with the index. Including the CLI commands. This is useful for long running tasks that can be performed in the background without affecting the current index.

Let's say you have a braking change in the index mappings and you need to reindex all the documents. You can create a new index with the new mappings and reindex all the documents from the old index to the new one. When the reindex is done, you can switch the alias to the new index and delete the old one. This way, you can perform the reindex without affecting the current index.

# Create initial index
> ArticlesIndex.create_index(alias: true)
=> {"acknowledged"=>true, "shards_acknowledged"=>true, "index"=>"esse_articles_20230926105739"}
> ArticlesIndex.indices_pointing_to_alias
=> ["esse_articles_20230926105739"]

# Let's say you have a braking change in the index mappings and you need to reindex all the documents.
> suffix = Esse.timestamp
=> "20230926105811"
> ArticlesIndex.create_index(suffix: suffix, alias: false)
=> {"acknowledged"=>true, "shards_acknowledged"=>true, "index"=>"esse_articles_20230926105811"}
> ArticlesIndex.indices_pointing_to_alias
=> ["esse_articles_20230926105739"]
> ArticlesIndex.import(suffix: suffix)
=> 231

# Now you can switch the alias to the new index and delete the old one.
> ArticlesIndex.update_aliases(suffix: suffix)
=> {"acknowledged"=>true}
> ArticlesIndex.indices_pointing_to_alias
=> ["esse_articles_20230926105811"]

# And finally delete the old index
> ArticlesIndex.delete_index(suffix: "20230926105739")
=> {"acknowledged"=>true}

Sure you can do it in the CLI too, but it's covered in the CLI section.

Clone this wiki locally