1. Inverted Index

1️⃣ Overview

Inverted Index = core data structure used in search engines
Used by:
- Elasticsearch
- Apache Lucene
- Search engines (e.g., Google),

📌 Idea

term (word) → list of document IDs

2️⃣ Basic Concepts

Terms

Term	Meaning
Corpus	Collection of documents
Document	Individual item with ID
Term	Word/token
Posting List	List of doc IDs for a term

Corpus (documents):

Doc1: "fish is tasty"
Doc2: "wall painting"
Doc3: "fish on wall"

Corpus → contains Documents → broken into Terms → stored in Inverted Index

3️⃣ Naive Search vs Inverted Index

❌ Naive Search

for doc in docs:
    if "fish" in doc:
        result.append(doc)

Time: O(N * doc_size)
Inefficient

✅ Inverted Index

index = {
  "fish": [1, 3], ← posting list
  "wall": [2, 3],
  "tasty": [1],
  "painting": [1]
}

Direct lookup → fast
No full scan needed

4️⃣ How Index is Built

🔄 Pipeline

1. Tokenization

Break text into words or token

"Fish on Wall" → ["Fish", "on", "Wall"]

2. Lowercasing

Fish → fish

3. Remove punctuation

"wall," → wall

4. Stemming

housing → house
cars → car

5. Lemmatization (better but slower)

Converts to more correct root word
both techniques (stemming & Lemmatization) aim to reduce words to their base form

Original Word	Stemming (Porter)	Lemmatization (Lemma)
Studies	studi	study
Caring	car	care

6. Remove Stop Words

is, the, are, was...

Final Step

for term in tokens:
    index[term].append(doc_id)
 
index = {
  "fish": [1, 3],
  "wall": [2, 3],
  "tasty": [1],
  "painting": [1]
}

5️⃣ Posting List (Advanced)

Instead of just doc IDs:

fish → [
  (doc1, freq=1, pos=7, offset=28),
  (doc3, freq=1, pos=1, offset=0)
]
wall → [
  (doc1, freq=1, pos=7, offset=28),
  (doc3, freq=1, pos=1, offset=0)
]
 
 
- Frequency → how many times term appears
- Position  → word order in doc
- Offset    → character position

Why Store This?

Proximity search (“fish near wall”).
Better ranking

6️⃣ Query Lookup

Query "fish AND wall"

fish → [1,3]
wall → [2,3]

set intersection to get docs that contains both
- Apply boolean algebra - OR, NOT, etc

👉 Intersection:

result = [3]

7️⃣ Optimizations ⚡

1. Sorted Posting Lists

Keep posting list sorted - Faster merge (O(n))

2. Compression

Reduce memory usage
Techniques:
- Delta encoding
- Varints

Interview One-Liner

“An inverted index maps terms to the list of documents containing them, enabling fast full-text search without scanning all documents.”

Om's Brain

Explorer

1️⃣ Overview

📌 Idea

2️⃣ Basic Concepts

3️⃣ Naive Search vs Inverted Index

❌ Naive Search

✅ Inverted Index

4️⃣ How Index is Built

🔄 Pipeline

1. Tokenization

2. Lowercasing

3. Remove punctuation

4. Stemming

5. Lemmatization (better but slower)

6. Remove Stop Words

Final Step

5️⃣ Posting List (Advanced)

Why Store This?

6️⃣ Query Lookup

7️⃣ Optimizations ⚡

1. Sorted Posting Lists

2. Compression

Interview One-Liner

Table of Contents

Mindmap

Graph View

Backlinks