Search Engine with Node.js and Elasticsearch
Search Engine with Node.js and Elasticsearch

Elasticsearch is an open source search engine, which is gaining popularity due to its high performance and distributed architecture. In this article, I will discuss its key features and walk you through the process of using it to create a Node.js search engine.

Introduction to Elasticsearch

Elasticsearch is built on top of Apache Lucene, which is a high performance text search engine library. Although Elasticsearch can perform the storage and retrieval of data, its main purpose is not to serve as a database, rather it is a search engine (server) with the main goal of indexing, searching, and providing real-time statistics on the data.

Elasticsearch has a distributed architecture that allows horizontal scaling by adding more nodes and taking advantage of the extra hardware. It supports thousands of nodes for processing petabytes of data. Its horizontal scaling also means that it has a high availability by rebalancing the data if ever any nodes fail.

When data is imported, it immediately becomes available for searching. Elasticsearch is schema-free, stores data in JSON documents, and can automatically detect the data structure and type.

Elasticsearch is also completely API driven. This means that almost any operations can be done via a simple RESTful API using JSON data over HTTP. It has many client libraries for almost any programming language, including for Node.js. In this tutorial we will use the official client library.

Elasticsearch is very flexible when it comes to hardware and software requirements. Although the recommended production setting is 64GB memory and as many CPU cores as possible, you can still run it on a resource-constrained system and get decent performance (assuming your data set is not huge). For following the examples in this article, a system with 2GB memory and a single CPU core will suffice.

You can run Elasticsearch on all major operating systems (Linux, Mac OS, and Windows). To do so, you need the latest version of the Java Runtime Environment installed (see the Installing Elasticsearch section). To follow the examples in this article, you’ll also need to have Node.js installed (any version after v0.11.0 will do), as well as npm.

Elasticsearch terminology

Elasticsearch uses its own terminology, which in some cases is different from typical database systems. Below, is a list of common terms in Elasticsearch and their meaning.

Index: This term has two meanings in Elasticsearch context. First is the operation of adding data. When data is added, the text is broken down into tokens (e.g. words) and every token is indexed. However, an index also refers to where are all the indexed data is stored. Basically, when you import data, it is indexed into an index. Every time you want to perform any operation on data, you need to specify its index name.

Type: Elasticsearch provides a more detailed categorization of documents within an index, which is called type. Every document in an index, should also have a type. For example, we can define a library index, then index multiple types of data such as article, book, report, and presentation into it. Since indices have almost fixed overhead, it is recommended to have fewer indices and more types, rather than more indices and fewer types.

Search: This term means what you might think. You can search data in different indices and types. Elasticsearch provides many types of search queries such as term, phrase, range, fuzzy, and even queries for geo data.

Filter: Elasticsearch allows you to filter search results based on different criteria, to further narrow down the results. If you add new search queries to a set of documents, it might change the order based on relevancy, but if you add the same query as a filter, the order remains unchanged.

Aggregations: These provide you with different types of statistics on aggregated data, such as minimum, maximum, average, summation, histograms, and so on.

Suggestions: Elasticsearch provides different types of suggestions for input text. These suggestions could be term or phrase based, or even completion suggestions.

Installing Elasticsearch

Elasticsearch is available under the Apache 2 license; it can be downloaded, used, and modified free of charge. Before installing it, you need to make sure you have the Java Runtime Environment (JRE) installed on your computer. Elasticsearch is written in Java and relies on Java libraries to run. To check whether you have Java installed on your system, you can type the following in the command line.

Using the latest stable version of the Java is recommended (1.8 at the time of writing this article). You can find a guide for installing Java on your system here.

Next, to download the latest version of Elasticsearch (2.4.0 at the time of writing this article), go to the download page and download the ZIP file. Elasticsearch requires no installation and the single zip file contains the complete set of files to run the program on all of the supported operating systems. Unzip the downloaded file and you are done! There are several other ways to get Elasticsearch running, such as getting the TAR file or packages for different Linux distributions (look here).

If you are running Mac OS X and you have Homebrew installed, you can install Elasticsearch using brew install elasticsearch. Homebrew automatically adds the executables to your path and installs the required services. It also helps you update the application with a single command: brew upgrade elasticsearch.

To run Elasticsearch on Windows, from the unzipped directory, run bin\elasticsearch.bat from the command line. For every other OS, run ./bin/elasticsearch from the terminal. At this point it should be running on your system.

As I mentioned earlier, almost all operations you can do with Elasticsearch, can be done via RESTful APIs. Elasticsearch uses port 9200 by default. To make sure you are running it correctly, head to http://localhost:9200/ in your browser, and it should display some basic information about your running instance.

For further reading about installation and troubleshooting, you can visit the documentation.

Graphical User Interface

Elasticsearch provides almost all its functionality through REST APIs and does not ship with a graphical user interface (GUI). While I cover how you can perform all the necessary operations through APIs and Node.js, there are several GUI tools that provide visual information about indices and data, and even some high level analytics.

Kibana, which is developed by the same company, provides a real-time summary of the data, plus several customized visualization and analytics options. Kibana is free and has detailed documentation.

There are other tools developed by the community, including elasticsearch-head, Elasticsearch GUI, and even a Chrome extension called ElasticSearch Toolbox. These tools help you explore your indices and data in the browser, and even try out different search and aggregation queries. All these tools provide a walkthrough for installation and use.

Setting Up a Node.js Environment

Elasticsearch provides an official module for Node.js, called elasticsearch. First, you need to add the module to your project folder, and save the dependency for future use.

Then, you can import the module in your script as follows:

Finally, you need to set up the client that handles the communication with Elasticsearch. In this case, I assume you are running Elasticsearch on your local machine with an IP address of 127.0.0.1 and the port 9200 (default setting).

The log options ensures that all the errors are logged. In the rest of this article, I will use the same esClient object to communicate with Elasticsearch. The complete documentation for the node module is provided here.

Note: all of the source code for this tutorial is provided on GitHub. The easiest way to follow along is to clone the repo to your PC and run the examples from there:

Importing the Data

Throughout this tutorial, I will use an academic articles dataset with randomly generated content. The data is provided in JSON format, and there are 1000 articles in the dataset. To show what the data looks like, one item from the dataset is shown below.

The field names are self-explanatory. The only point to note is that the body field is not displayed here, since it contains a complete, randomly generated article (with between 100 and 200 paragraphs). You can find the complete data set here.

While Elasticsearch provides methods for indexing, updating, and deleting single data points, we’re going to make use of Elasticserch’s bulk method to import the data, which is used to perform operations on large data sets in a more efficient manner:

Here, we are calling the bulkIndex function passing it library as the index name, article as the type and the JSON data we wish to have indexed. The bulkIndexfunction in turn calls the bulk method on the esClient object. This method takes an object with a body property as an argument. The value supplied to the body property is an array with two entries for each operation. In the first entry, the type of the operation is specified as a JSON object. Within this object, the index property determines the operation to be performed (indexing a document in this case), as well as the index name, type name, and the document ID. The next entry corresponds to the document itself.

Note that in future, you might add other types of documents (such as books or reports) to the same index in this way. We could also assign a unique ID to each document, but this is optional — if you do not provide one, Elasticsearch will assign a unique randomly generated ID to each document for you.

Assuming you have cloned the repository, you can now import the data into Elasticsearch by executing the following command from the project root:

Checking the data was indexed correctly

One of the great features of Elasticsearch is near real-time search. This means that once documents are indexed, they become available for search within one second (see here). Once the data is indexed, you can check the index information by running indices.js (link to source):

Methods in the client’s cat object provide different information about the current running instance. The indices method lists all the indices, their health status, number of their documents, and their size on disk. The v option adds a header to the response from the cat methods.

When you run the above snippet, you will notice it outputs a color code to indicate the health status of your cluster. Red indicates something is wrong with your cluster and it is not running. Yellow means the cluster is running, but there is a warning, and green means everything is working fine. Most likely (depending on your setting) you will get a yellow status when running on your local machine. This is because the default settings contain five nodes for the cluster, but in your local machine there is only one instance running. While you should always aim for green status in a production environment, for the purpose of this tutorial you can continue to use Elasticsearch in yellow status.

Dynamic and custom mapping

As I mentioned earlier, Elasticsearch is schema-free. This means that you do not have to define the structure of your data (similar to defining a table in a SQL database), before you import it, rather Elasticsearch automatically detects it for you. But despite being called schema-free, there are some limitations on the data structure.

Elasticsearch refers to the structure of the data as mapping. If no mapping exists, when the data is indexed, Elasticsearch looks at each field of the JSON data, and automatically defines the mapping based on its type. If a mapping entry already exists for that field, it ensures the new data being added follows the same format. Otherwise, it will throw an error.

For instance, if {"key1": 12} is already indexed, Elasticsearch automatically maps field key1 as long. Now, if you try to index {"key1": "value1", "key2": "value2"}, it throws an error, that it expects type of field key1 to be long. At the same time, the object {"key1": 13, "key2": "value2"} would be indexed without any issue, with key2 of type string added to the mapping.

Mappings are beyond the scope of article, and for the most part, the automatic mapping works fine. I would recommend looking at the elasticsearch documentation, which provides an in-depth discussion of the mappings.

Building the Search Engine

Once the data has been indexed, we are ready to implement the search engine. Elasticsearch provides an intuitive full search query structure called Query DSL—which is based on JSON—to define queries. There are many types of search queries available, but in this article we’re going to look at several of the more common ones. Complete documentation of Query DSL can be found here.

Please remember that I provide a link to the code behind every example shown. After setting up your environment and indexing the test data, you can clone the repo and run any of the examples on your machine. To do this, just run node filename.js from the command line.

Return all documents in one or more indices

To perform our search, we will use the various search methods provided by the client. The simplest query is match_all, which returns all the documents in one or multiple indices. The example below shows how we can get all the stored documents in an index (link to source).

The main search query is included within the query object. As we will see later, we can add different types of search queries to this object. For each query, we add a key with the query type (match_all in this example), with the value being an object containing the search options. There are no options in this example as we want to return all of the documents in the index.

In addition to the query object, the search body can contain other optional properties, including size and from. The size property determines the number of documents to be included in the response. If this value is not present, by default ten documents are returned. The from property determines the starting index of the returned documents. This is useful for pagination.

Understanding the search API response

If you were to log out the response from the search API (results in the above example), it might initially look overwhelming as it includes a lot of information.

At the highest level, the response includes a took property for the number of milliseconds it took to find the results, timed_out, which is only true if no results were found in the maximum allowed time, _shards for information about the status of the different nodes (if deployed as a cluster of nodes), and hits, which includes the search results.

Within the hits property, we have an object the following properties:

  • total — indicates the total number of matched items
  • max_score — the maximum score of the found items
  • hits — an array that includes the found items. Within each document in the hitsarray, we have the index, type, document ID, score, and the document itself (within the _source element).

It’s pretty complicated, but the good news is once you implement a method to extract the results, regardless of your search query, you will always get the results in the same format.

Also note that one of the advantages of Elasticsearch is that it automatically assigns a score to each matched document. This score is used to quantify the document’s relevancy, and results are returned ordered by decreasing score, by default. In a case where we retrieve all documents with match_all, the score is meaningless and all scores are calculated as 1.0.

Match documents that contain specific values in a field

Now, let’s look at some more interesting examples. To match documents that contain specific values in a field, we can use the match query. A simple search body with a match query is shown below (link to source).

As I mentioned earlier, we first add an entry to a query object with the search type, which is match in the above example. Inside the search type object, we identify the document field to be searched, which is title here. Inside that, we put search-related data, including the query property. I hope after testing the above example, you start to become amazed at the speed of search.

The above search query returns documents whose title field matches any words in the queryproperty. We can set a minimum number of matched terms as follows.

This query matches documents that have at least three of the specified words in their title. If there are less than three words in the query, all must be present in the title for the document to be matched. Another useful feature to add to search queries is fuzziness. This is useful if the user makes a typo in writing the query, as fuzzy matching will find closely spelled terms. For strings, the fuzziness value is based on the maximum permitted Levenshtein distance for each term. Below is an example with fuzziness.

Search within multiple fields

If you want to search within multiple fields, the multi_match search type can be used. It is similar to match, except instead of having the field as a key in the search query object, we add a fields key, which is an array of fields to be searched. Here, we search within the title, authors.firstname, and authors.lastname fields. (link to source)

The multi_match query supports other search properties such as minimum_should_match and fuzziness. Elasticsearch supports wildcards (e.g., *) for matching multiple fields, so we can shorten the above example to ['title', 'authors.*name'].

Matching a complete phrase

Elasticsearch can also match a phrase exactly as entered, without matching at term level. This query is an extension to the regular match query, called match_phrase. Below is an example of a match_phrase. (link to source)

Combining multiple queries

So far, in the examples we have only used a single query per request. Elasticsearch however, allows you to combine multiple queries. The most common compound query is bool. The bool query accepts four types of keys: must, should, must_not, and filter. As their names imply, documents in the results must match queries within must, must not match queries within must_not, and will get a higher score if they match queries within should. Each one of the mentioned elements can receive multiple search queries in the form of an array of queries.

Below, we use bool query along with a new query type called query_string. This allows you to write more advanced queries using keywords such as AND and OR. The complete documentation for the query_string syntax can be found here. In addition, we use the range query (documentation here), which allows us to restrict a field to a given range. (link to source)

In the above example, the query returns documents where the author’s first name contains term1 or their last name contains term2, and their title has term3, and they were not published in years 2011, 2012, or 2013. Also, documents that have the given phrase in their body, are awarded higher scores and are shown at top of the results (since the match query is in the should clause).

Filters, Aggregations, and Suggestions

In addition to its advanced search capabilities, Elasticsearch provides other functionalities. Here, we look at three of the more common features.

Filters

Often you might want to refine your search results based on specific criteria. Elasticsearch provides this functionality through filters. In our articles data, imagine your search returned several articles, of which you want to select only the articles that were published in five specific years. You can simply filter out everything that does not match your criteria from the search results, without changing the search order.

The difference between a filter and the same query in the must clause of the boolquery is that a filter does not affect the search scores, while must queries do. When search results are returned and the user filters on some specific criteria, they do not want the original results order to be changed, instead, they only want irrelevant documents removed from the results. Filters follow the same format as the search, but more often, they are defined on fields with definitive values, rather than strings of text. Elasticsearch recommends adding filters through the filter clause of the boolcompound search query.

Staying with the example above, imagine that we want to limit the results of our search to articles published between 2011 and 2015. To do this, we only need to add a range query to the filter section of the original search query. This will remove any unmatched documents from the results. Below is an example of a filtered query. (link to source)

Aggregations

The aggregations framework provides various aggregated data and statistics based on a search query. The two main types of aggregation are metric and bucketing, where metric aggregations keep track and compute metrics over a set of documents and bucketing aggregations build buckets, with each bucket being associated with a key and a document criterion. Examples of metric aggregations are average, minimum, maximum, summation, and value count. Examples of bucketing aggregations are range, date range, histogram, and terms. An in-depth explanation of the aggregators can be found here.

Aggregations are placed within an aggregations object, which itself is placed directly in the search object body. Within the aggregations object, each key is a name assigned to an aggregator by the user. The aggregator type and options should be placed as the value for that key. Below, we look at two different aggregators, one metric and one bucket. As a metric aggregator we try to find the minimum year value in our dataset (oldest article), and for bucket aggregator we try to find how many times each keyword has appeared. (link to source)

In the above example, we named the metric aggregator as min_year (this name can be anything), which is of type min over field year. The bucket aggregator is named keywords, which is of type terms over field keywords. The results for aggregations are enclosed within the aggregations element in the response, and at a deeper level, they contain each defined aggregator (min_year and keywords here) along with its results. Below is a partial response from this example.

By default there are at most 10 buckets returned in the response. You can add a sizekey next to the field in the request to determine the maximum number of buckets returned. If you want to receive all the buckets, set this value to 0.

Suggestions

Elasticsearch has multiple types of suggesters that provide replacement or completion suggestions for the entered terms (documentation here). We will look at term and phrase suggesters here. The term suggester provides suggestions (if any) for each term in the entered text, while the phrase suggester looks at the entered text as a whole phrase (as opposed to breaking it down into terms) and provides other phrase suggestions (if any). To use the suggestions API, we need to call the suggestmethod on the Node.js client. Below is an example of a term suggester. (link to source)

In the request body, consistent with all the other client methods, we have an indexfield determining the index for the search. In the body property we add the text for which we are seeking suggestions, and (as with aggregation objects) we give each suggester a name (titleSuggester in this case). Its value determines the type and options for the suggester. In this case, we are using a term suggester for the titlefield, and limiting the maximum number of suggestions per token to five (size: 5).

The response from the suggest API contains one key for every suggester you requested, which is an array with the same size as number of the terms in your textfield. For each object inside that array, there is an options object containing the suggestions in its text field. Below is a portion of the response from the above request.

To get phrase suggestions, we can follow the same format as above, and just replace the suggester type to phrase. In the following example, the response follows the same format as explained above. (link to source)

Further Reading

Elasticsearch provides a wide range of features that are well beyond the scope of this single article. In this article I tried to explain its features from a high level and refer you to proper resources for further studying. Elasticsearch is very reliable and has fantastic performance (which I hope you have noticed when running examples). This, coupled with growing community support has increased Elasticsearch adoption in industry, especially in firms dealing with real-time or big data.

After going over the examples provided here, I highly recommend looking at the documentation. They provide two main sources, one as a reference to Elasticsearch and its features, and the other as a guide that focuses more on implementation, use cases, and best practices. You can also find detailed documentation of the Node.js client here.

Are you already using Elasticsearch? What are your experiences? Or maybe you’re going to give it a shot after reading this article. Let me know in the comments below.

Recommended Courses:

Typescript 2 Masterclass: Node REST API + Angular 2 Client

ChatBots: Messenger ChatBot with API.AI and Node.JS

Build a Real Time web app in node.js , Angular.js, mongoDB

Node Js Projects

 

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here