SEMANTIC SEARCH ENGINE FOR Q&A USING ELASTIC SEARCH AND DOCKER

9 min readJan 11, 2021

Hey, guys in today’s article is about my DevOps project which I created with my partner.

PROBLEM DEFINITION

When we are using stack-overflow if we need to search for an answer we will ask a question in the search box. If we type the question, it will give some related questions. Our problem statement is similar to this only.

Given a question can we find similar questions in a repository of questions and answers that we have.

The main objective is to get the searched similar kind-off questions within a small amount of time. Basically, the search engines work in such a way that they will give us the results within less than 500 milliseconds. Ordering of related questions is also very important. The speed is extremely important while designing these kind-off engines.

We want to have high precision and high recall. The computational and server costs should below.

DATASET

We have picked up a dataset known as STACK-SAMPLE from the sample.

One can find the dataset in the below link.

DESIGN

Using elastic search.

Elastic search gives us an inbuilt implementation of the inverted index. It also gives us default scoring using TFID based schemes and also gives us the flexibility to build our scoring function. It is distributed. It runs in realtime so that the latency will be very low.

SENTENCE VECTOR

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. Word embeddings are representation of words in an N-dimensional vector space so that semantically similar (e.g. “king” — “monarch”) or semantically related (e.g. “bird” — “fly”) words come closer depending on the training method (using words as context or using documents as context). When it comes to texts, one of the most common fixed-length features is the bag-of-words. But this method neglects a lot of information like ordering and semantics of the words.

COSINE SIMILARITY

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.

INSTALLATION

We need to install docker. Within the docker, we need to install an elastic search.

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their software, libraries, and configuration files; they can communicate with each other through well-defined channels.

To install docker check this link.

To install Elastic search type these commands.

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.7.0
docker image ls
docker run -m 6G -p 9200:9200 -p 9300:9300 -e “discovery.type=single-node” — name myelastic docker.elastic.co/elasticsearch/elasticsearch:7.7.0
docker ps
docker stats
docker exec -it myelastic bash

INSTALL PACKAGES

yum -y update
yum install -y python3
yum install -y vim
yum -y install wget
yum clean all
pip3.6 install — upgrade pip
pip3.6 — version
pip3.6 install elasticsearch
pip3.6 install pandas
pip3.6 install — upgrade — no-cache-dir tensorflow
pip3.6 install — upgrade tensorflow-hub

ARCHITECTURE

WORKING

The first most thing is we need to convert the sentences to tensor. The whole code keeps the model in memory.

import tensorflow as tf
	import tensorflow_hub as hub
	

	embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
	embeddings = tf.make_ndarray(tf.make_tensor_proto(embed(["The quick brown fox jumps over the lazy dog."]))).tolist()[0]
	print(type(embeddings))
	print(len(embeddings))
	print(embeddings)

When we load the data which is in a zip file it will be available in the system. We need to load that data into docker. To do so type the below command.

docker cp /universal-sentence-encoder_4.tar.gz elasticsearch:/usr/share/elasticsearch/searchqa/data

Elastic Search Indexing

From the data folder that we have if we take some questions, and let’s assume that we read those questions one after the other. For those questions, if we call our model, our model returns us the vector so that we insert it into the elastic search both the question text and vector. This whole thing is known as Indexing. The code for that is as follows.

import json
import time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import csv
import tensorflow as tf
import tensorflow_hub as hub


# connect to ES on localhost on port 9200
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
if es.ping():
	print('Connected to ES!')
else:
	print('Could not connect!')
	sys.exit()

print("*********************************************************************************");


# index in ES = DB in an RDBMS
# Read each question and index into an index called questions
# Indexing only titles for this example to improve speed. In practice, its good to index CONCATENATE(title+body)
# Define the index


#Refer: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
# Mapping: Structure of the index
# Property/Field: name and type  
b = {"mappings": {
  	"properties": {
    		"title": {
      			"type": "text"
    		},
    		"title_vector": {
      			"type": "dense_vector",
      			"dims": 512
		}
	}
     }
   }


ret = es.indices.create(index='questions-index', ignore=400, body=b) #400 caused by IndexAlreadyExistsException, 
print(json.dumps(ret,indent=4))

# TRY this in browser: http://localhost:9200/questions-index

print("*********************************************************************************");

sys.exit();

#load USE4 model

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")




# CONSTANTS
NUM_QUESTIONS_INDEXED = 200000

# Col-Names: Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
cnt=0

with open('./data/Questions.csv', encoding="latin1") as csvfile:
	readCSV = csv.reader(csvfile, delimiter=',' )
	next(readCSV, None)  # skip the headers 
	for row in readCSV:
		#print(row[0], row[5])
		doc_id = row[0];
		title = row[5];
		vec = tf.make_ndarray(tf.make_tensor_proto(embed([title]))).tolist()[0]		
		
		b = {"title":title,
			"title_vector":vec,
			}	
		#print(json.dumps(tmp,indent=4))		
		
		res = es.index(index="questions-index", id=doc_id, body=b)
		#print(res)
		

		# keep count of # rows processed
		cnt += 1
		if cnt%100==0:
			print(cnt)
		
		if cnt == NUM_QUESTIONS_INDEXED:
			break;

	print("Completed indexing....")

	print("*********************************************************************************");

This file indexES.py takes all of our questions, computes the vector, takes the title inserts all the data into elastic search and we have also created an index called question-index.

To check whether it has created an index or not use the command.

curl -X GET "localhost:9200/questions-index/_stats?pretty"

We can also search by id.

http://localhost:9200/questions-index/_doc/80

When we create an id for each question it looks something like this.

The top200KQues.py reads the data from Question.csv and creates a new file called top200KQuesData. In this file, it will just print the id and title of each question. The code looks something like this.

import json
	import time
	import sys
	import csv
	# CONSTANTS
	NUM_QUESTIONS_INDEXED = 200000
	# Col-Names: Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
	cnt=0
	f = open("top200KQuesData", "w", encoding="latin1")
	with open('./data/Questions.csv', encoding="latin1") as csvfile:
		readCSV = csv.reader(csvfile, delimiter=',' )
		next(readCSV, None)  # skip the headers
		for row in readCSV:
			#print(row[0], row[5])
			doc_id = row[0];
			title = row[5];
			# write to file
			f.write(doc_id+ "," +title+"\n")
			# keep count of # rows processed
			cnt += 1
			if cnt%100==0:
				print(cnt)
			if cnt == NUM_QUESTIONS_INDEXED:
				break;
		print("Completed indexing....")
		print("*********************************************************************************");
	f.close()

We need to download the USE4 model into disk and whenever we need it we can use it.

To search for the question similarity we can use the searchES.py. This uses cosine-similarity to fetch a similar type of question.

import json
	import time
	import sys
	from elasticsearch import Elasticsearch
	from elasticsearch.helpers import bulk
	import csv
	import tensorflow as tf
	import tensorflow_hub as hub
	def connect2ES():
	    # connect to ES on localhost on port 9200
	    es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
	    if es.ping():
	            print('Connected to ES!')
	    else:
	            print('Could not connect!')
	            sys.exit()
	    print("*********************************************************************************");
	    return es
	def keywordSearch(es, q):
	    #Search by Keywords
	    b={
	            'query':{
	                'match':{
	                    "title":q
	                }
	            }
	        }
	    res= es.search(index='questions-index',body=b)
	    print("Keyword Search:\n")
	    for hit in res['hits']['hits']:
	        print(str(hit['_score']) + "\t" + hit['_source']['title'] )
	    print("*********************************************************************************");
	    return
	# Search by Vec Similarity
	def sentenceSimilaritybyNN(embed, es, sent):
	    query_vector = tf.make_ndarray(tf.make_tensor_proto(embed([sent]))).tolist()[0]
	    b = {"query" : {
	                "script_score" : {
	                    "query" : {
	                        "match_all": {}
	                    },
	                    "script" : {
	                        "source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
	                        "params": {"query_vector": query_vector}
	                    }
	                }
	             }
	        }
	    #print(json.dumps(b,indent=4))
	    res= es.search(index='questions-index',body=b)
	    print("Semantic Similarity Search:\n")
	    for hit in res['hits']['hits']:
	        print(str(hit['_score']) + "\t" + hit['_source']['title'] )
	    print("*********************************************************************************");
	if __name__=="__main__":
	    es = connect2ES();
	    embed = hub.load("./data/USE4/")
	    while(1):
	        query=input("Enter a Query:");
	        start = time.time()
	        if query=="END":
	            break;
	        print("Query: " +query)
	        keywordSearch(es, query)
	        sentenceSimilaritybyNN(embed, es, query)
	        end = time.time()
	       
 print(end - start)

We can create a flask API and use it for searching the similar questions. Before doing it we need to install some packages.

To run the app in flask use the code searchES_FlaskAPI.py. The code looks something like this.

import json
	import time
	import sys
	from elasticsearch import Elasticsearch
	from elasticsearch.helpers import bulk
	import csv
	import tensorflow as tf
	import tensorflow_hub as hub
	from flask import Flask
	def connect2ES():
	    # connect to ES on localhost on port 9200
	    es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
	    if es.ping():
	            print('Connected to ES!')
	    else:
	            print('Could not connect!')
	            sys.exit()
	

	    print("*********************************************************************************");
	    return es
	def keywordSearch(es, q):
	    #Search by Keywords
	    b={
	            'query':{
	                'match':{
	                    "title":q
	                }
	            }
	        }
	    res= es.search(index='questions-index',body=b)
	    return res
	# Search by Vec Similarity
	def sentenceSimilaritybyNN(es, sent):
	    query_vector = tf.make_ndarray(tf.make_tensor_proto(embed([sent]))).tolist()[0]
	    b = {"query" : {
	                "script_score" : {
	                    "query" : {
	                        "match_all": {}
	                    },
	                    "script" : {
	                        "source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
	                        "params": {"query_vector": query_vector}
	                    }
	                }
	             }
	        }
	    #print(json.dumps(b,indent=4))
	    res= es.search(index='questions-index',body=b)
	    return res;
	app = Flask(__name__)
	es = connect2ES();
	embed = hub.load("./data/USE4/")
	@app.route('/search/<query>')
	def search(query):
	    q = query.replace("+", " ")
	    res_kw = keywordSearch(es, q)
	    res_semantic = sentenceSimilaritybyNN( es, q)
	    ret = ""
	    for hit in res_kw['hits']['hits']:
	        ret += (" KW: " + str( hit['_score']) + "\t" + hit['_source']['title'] +"\n" )
	    for hit in res_semantic['hits']['hits']:
	        ret += (" Semantic: " +str(hit['_score']) + "\t" + hit['_source']['title'] +"\n")
	    return ret

The final output of the search can be seen as something like this.

The time taken for searching similar questions looks something like this.

We can also visualize the graph using Prometheus. The output looks something like this.

We can use grafana to visualize the memory consumption etc...

Guys, here we come to the end of this blog. I am sure that you guys have enjoyed this use-case. I would also really like to thank my friend NIKHIL GR for being a team member with me and helping me to complete this project. I hope you all like it and found it informative. If have any query feel free to reach me :)

Guys follow me for such amazing blogs and if have any review then please let me know I will keep those points in my mind next time while writing blogs. If want to read more such blog to know more about me here is my website link https://sites.google.com/view/adityvgupta/home.Guys Please do not hesitate to keep 👏👏👏👏👏 for it (An Open Secret: You can clap up to 50 times for a post, and the best part is, it wouldn’t cost you anything), also feel free to share it across. This really means a lot to me.