1οΈβ£ Overview
- Inverted Index = core data structure used in search engines
- Used by:
- Elasticsearch
- Apache Lucene
- Search engines (e.g., Google),
π Idea
term (word) β list of document IDs
2οΈβ£ Basic Concepts
- Terms
| Term | Meaning |
|---|---|
| Corpus | Collection of documents |
| Document | Individual item with ID |
| Term | Word/token |
| Posting List | List of doc IDs for a term |
- Corpus (documents):
Doc1: "fish is tasty"
Doc2: "wall painting"
Doc3: "fish on wall"- Corpus β contains Documents β broken into Terms β stored in Inverted Index
3οΈβ£ Naive Search vs Inverted Index
β Naive Search
for doc in docs:
if "fish" in doc:
result.append(doc)- Time:
O(N * doc_size) - Inefficient
β Inverted Index
index = {
"fish": [1, 3], β posting list
"wall": [2, 3],
"tasty": [1],
"painting": [1]
}- Direct lookup β fast
- No full scan needed
4οΈβ£ How Index is Built
π Pipeline
1. Tokenization
- Break text into words or token
"Fish on Wall" β ["Fish", "on", "Wall"]2. Lowercasing
Fish β fish3. Remove punctuation
"wall," β wall4. Stemming
housing β house
cars β car5. Lemmatization (better but slower)
- Converts to more correct root word
- both techniques (stemming & Lemmatization) aim to reduce words to their base form
| Original Word | Stemming (Porter) | Lemmatization (Lemma) |
|---|---|---|
| Studies | studi | study |
| Caring | car | care |
6. Remove Stop Words
is, the, are, was...Final Step
for term in tokens:
index[term].append(doc_id)
index = {
"fish": [1, 3],
"wall": [2, 3],
"tasty": [1],
"painting": [1]
}5οΈβ£ Posting List (Advanced)
Instead of just doc IDs:
fish β [
(doc1, freq=1, pos=7, offset=28),
(doc3, freq=1, pos=1, offset=0)
]
wall β [
(doc1, freq=1, pos=7, offset=28),
(doc3, freq=1, pos=1, offset=0)
]
- Frequency β how many times term appears
- Position β word order in doc
- Offset β character positionWhy Store This?
- Proximity search (βfish near wallβ).
- Better ranking
6οΈβ£ Query Lookup
Query "fish AND wall"
fish β [1,3]
wall β [2,3]- set intersection to get docs that contains both
- Apply boolean algebra - OR, NOT, etc
π Intersection:
result = [3]7οΈβ£ Optimizations β‘
1. Sorted Posting Lists
- Keep posting list sorted - Faster merge (
O(n))
2. Compression
- Reduce memory usage
- Techniques:
- Delta encoding
- Varints
Interview One-Liner
βAn inverted index maps terms to the list of documents containing them, enabling fast full-text search without scanning all documents.β