1️⃣ Overview

  • Inverted Index = core data structure used in search engines
  • Used by:
    • Elasticsearch
    • Apache Lucene
    • Search engines (e.g., Google),

πŸ“Œ Idea

term (word) β†’ list of document IDs

2️⃣ Basic Concepts

  • Terms
TermMeaning
CorpusCollection of documents
DocumentIndividual item with ID
TermWord/token
Posting ListList of doc IDs for a term
  • Corpus (documents):
Doc1: "fish is tasty"
Doc2: "wall painting"
Doc3: "fish on wall"
  • Corpus β†’ contains Documents β†’ broken into Terms β†’ stored in Inverted Index

3️⃣ Naive Search vs Inverted Index

for doc in docs:
    if "fish" in doc:
        result.append(doc)
  • Time: O(N * doc_size)
  • Inefficient

βœ… Inverted Index

index = {
  "fish": [1, 3], ← posting list
  "wall": [2, 3],
  "tasty": [1],
  "painting": [1]
}
  • Direct lookup β†’ fast
  • No full scan needed

4️⃣ How Index is Built

πŸ”„ Pipeline

1. Tokenization

  • Break text into words or token
"Fish on Wall" β†’ ["Fish", "on", "Wall"]

2. Lowercasing

Fish β†’ fish

3. Remove punctuation

"wall," β†’ wall

4. Stemming

housing β†’ house
cars β†’ car

5. Lemmatization (better but slower)

  • Converts to more correct root word
  • both techniques (stemming & Lemmatization) aim to reduce words to their base form
Original WordStemming (Porter)Lemmatization (Lemma)
Studiesstudistudy
Caringcarcare

6. Remove Stop Words

is, the, are, was...

Final Step

for term in tokens:
    index[term].append(doc_id)
 
index = {
  "fish": [1, 3],
  "wall": [2, 3],
  "tasty": [1],
  "painting": [1]
}

5️⃣ Posting List (Advanced)

Instead of just doc IDs:

fish β†’ [
  (doc1, freq=1, pos=7, offset=28),
  (doc3, freq=1, pos=1, offset=0)
]
wall β†’ [
  (doc1, freq=1, pos=7, offset=28),
  (doc3, freq=1, pos=1, offset=0)
]
 
 
- Frequency β†’ how many times term appears
- Position  β†’ word order in doc
- Offset    β†’ character position

Why Store This?

  • Proximity search (β€œfish near wall”).
  • Better ranking

6️⃣ Query Lookup

Query "fish AND wall"

fish β†’ [1,3]
wall β†’ [2,3]
  • set intersection to get docs that contains both
    • Apply boolean algebra - OR, NOT, etc

πŸ‘‰ Intersection:

result = [3]

7️⃣ Optimizations ⚑

1. Sorted Posting Lists

  • Keep posting list sorted - Faster merge (O(n))

2. Compression

  • Reduce memory usage
  • Techniques:
    • Delta encoding
    • Varints

Interview One-Liner

β€œAn inverted index maps terms to the list of documents containing them, enabling fast full-text search without scanning all documents.”