ElasticSearch: Understanding Full-Text Search

Shambhavi Shandilya
5 min readJun 29, 2024

--

The Elastic Stack, formerly known as the ELK Stack, is a group of open-source tools that help users search, analyze, and visualize data in real-time. Here’s a breakdown of the three main components:

📌 Elasticsearch: Elasticsearch is a powerful search engine built on Apache Lucene. It excels at storing, searching, and analyzing large volumes of data from various sources in near real-time.

📌 Logstash: This data pipeline acts as the input engine for Elasticsearch. It ingests data from diverse sources like application logs, system metrics, and website activity. Logstash can also parse and transform this data before feeding it into Elasticsearch for storage and analysis.

📌 Kibana: Kibana is the visualization layer. It interacts with Elasticsearch to create dashboards, charts, and graphs, allowing users to explore and understand their data in a user-friendly way.

The versatility of Elasticsearch extends across various industries and applications such as log analysis, website search, real-time monitoring, e-commerce product search, etc.

Full-text search is one of the most powerful features of Elasticsearch. Unlike traditional searches that require exact term matches, full-text search empowers users to identify documents containing terms that are either synonymous with or contextually relevant to the query term. This advanced capability significantly broadens the scope of discoverable information and enhances the overall effectiveness of the search process.

In this blog, I will cover the basics on how ElasticSearch implemented it’s full-text search capabilities.

🛠️ Architecture

Before diving into the details of how the search functionality is implemented, here is the architecture of how data is stored in an ElasticSearch cluster.

A high-level architecture of an ElasticSearch Cluster

Some points to note about the architecture are:

  1. A document is the basic unit that can be indexed by ElasticSearch. It is immutable in nature. Any update to the document is done by deletion and re-insertion. Documents are stored in JSON format and contain fields representing the data points. These documents are what you search and analyze within your Elasticsearch cluster.
  2. Routing logic dictates how documents are distributed across shards within an index. While Elasticsearch handles shard allocation by default using a hash of the document’s ID, custom routing can be leveraged to achieve specific data organization goals.
  3. Shards are composed of smaller units called segments.

🔎 Full-Text Search capabilities of ElasticSearch

Elasticsearch goes beyond simply storing text. It can account for typos or variations in spelling (fuzzy search), wildcard searches that enable searching for terms with unknown characters, etc.

1️⃣ Data Preparation

ElasticSearchdocuments are prepared as JSON files. Each document represents a piece of information you want to search within (e.g., an article, a product description, a customer review). It is similar to a row in SQL tables. Apart from the submitted data, ElasticSearch also adds other meta fields such as index, type, id, version, etc.

2️⃣ Analysis Pipeline

After submitting a document, the indexing process takes place. The first step is analysing the document. Certain tasks are carried out based on a customisable Analyzer module.

🟢 Tokenization: Breaking down text into individual terms (words).
🟢 Normalization: Lowercase letters, remove punctuation, etc.
🟢 Stemming/Lemmatization: Reducing words to their root form (e.g., “running” becomes “run”). This improves search accuracy for synonyms and variations.
🟢 Stop Word Removal: Removing common words like “the,” “a,” and “an,” which don’t contribute much to search meaning.

The standard analyzer is the default analyzer whose job is to tokenize sentences based on whitespaces, punctuation, and grammar. The pipeline is made of no character filters, one standard tokenizer and two token filters: lowercase and stop words.

3️⃣ Inverted Index

The inverted Index is a core component that improves the efficiency of the full-text search functionality of ElasticSearch. But before getting to the details of the inverted index, let’s understand how it is different from regular forward indexes.

Forward indexing, also known as document indexing, is a straightforward method where each document is indexed individually. In this approach, each document is stored in its entirety along with its metadata (like document ID, timestamps, etc.) in the index. When you perform a search, the search engine scans through these documents to find matches based on the query.

Inverted indexing is a more sophisticated method used by many modern search engines, including ElasticSearch. It revolves around creating an index that maps terms (or tokens) to their location in the documents where they occur. This index is called an inverted index because it “inverts” the structure of forward indexing

For example, let’s consider two documents and see the difference between the indexing patterns.

{
“id”: 1,
“title”: “Introduction to ElasticSearch”,
“content”: “ElasticSearch is a distributed search engine…”
}
{
“id”: 2,
“title”: “Getting Started with ElasticSearch”,
“content”: “To begin using ElasticSearch, you need to…”
}

A forward index might look like:

Index:
{
Document ID: 1,
Title: “Introduction to ElasticSearch”,
Content: “ElasticSearch is a distributed search engine…”
},
{
Document ID: 2,
Title: “Getting Started with ElasticSearch”,
Content: “To begin using ElasticSearch, you need to…”
}

But an inverted index will look like:

Term: “Introduction”
{
Document ID: 1,
Positions: [0]
}

Term: “ElasticSearch”
{
Document ID: 1,
Positions: [1, 10]
},
{
Document ID: 2,
Positions: [3]
}

Term: “Getting”
{
Document ID: 2,
Positions: [0]
}
...

In ElasticSearch and similar systems, inverted indexing is preferred due to its efficiency and scalability in handling large volumes of textual data and complex search queries. It allows for fast retrieval of relevant documents based on user queries, making it suitable for applications requiring real-time search capabilities.

🗒️ Conclusion

ElasticSearch focuses on real-time search, indexing, and analytics of structured and unstructured data with a strong emphasis on text search capabilities. While text analysers and inverted indexes improve the search performance, another key component that enhances query speed is parallel processing due to the distributed nature of ElasticSearch.

Elasticsearch stores data in shards, which are partitions of an index spread across your cluster nodes. This distribution allows for parallel processing during search queries. Each shard executes the query locally on its own data, leveraging its processing power. This parallelism allows for faster processing compared to a single-threaded approach. The coordinating node receives results from all relevant shards. It merges these results, eliminates duplicates, and ranks them based on relevance scores and any user-defined sorting criteria.

By empowering users to find information quickly and efficiently, Elasticsearch serves as a valuable tool for various applications, from log analysis and website search to enterprise content management and data visualization.

References:

--

--

Shambhavi Shandilya
Shambhavi Shandilya

No responses yet