Term Frequency-Inverse Document Frequency (TF-IDF) is one of the most fundamental and widely-used techniques in natural language processing and information retrieval. Whether you’re building a search engine, performing document classification, or analyzing text data, understanding how to calculate TF-IDF score in Python is an essential skill for any data scientist or NLP practitioner.
This comprehensive guide will walk you through the mathematical foundations of TF-IDF, demonstrate multiple implementation approaches, and provide practical examples that you can apply to your own projects. By the end of this article, you’ll have a thorough understanding of TF-IDF and the ability to implement it effectively in Python.
Understanding TF-IDF: The Mathematical Foundation
What is TF-IDF?
TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection of documents (corpus). The technique combines two key components:
Term Frequency (TF): Measures how frequently a term appears in a document relative to the total number of terms in that document.
Inverse Document Frequency (IDF): Measures how rare or common a term is across the entire corpus, giving higher weights to rare terms and lower weights to common terms.
The combination of these two metrics creates a scoring system that identifies terms that are both frequent in a specific document and rare across the corpus, making them highly distinctive and informative.
The Mathematical Formulas
Understanding the mathematics behind TF-IDF is crucial for proper implementation:
Term Frequency (TF):
TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)
Inverse Document Frequency (IDF):
IDF(t,D) = log(Total number of documents / Number of documents containing term t)
TF-IDF Score:
TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)
Where:
- t = term (word)
- d = document
- D = corpus (collection of documents)
Method 1: Manual Implementation from Scratch
Building the Foundation
Let’s start by implementing TF-IDF manually to understand the underlying mechanics. This approach gives you complete control over the calculation process and helps solidify your understanding of the algorithm.
import math
from collections import Counter
import numpy as np
def calculate_tf(text):
"""Calculate term frequency for a document"""
words = text.lower().split()
word_count = len(words)
tf_dict = {}
for word in words:
tf_dict[word] = tf_dict.get(word, 0) + 1
# Normalize by total word count
for word in tf_dict:
tf_dict[word] = tf_dict[word] / word_count
return tf_dict
def calculate_idf(documents):
"""Calculate inverse document frequency for all terms"""
N = len(documents)
idf_dict = {}
all_words = set(word for doc in documents for word in doc.lower().split())
for word in all_words:
containing_docs = sum(1 for doc in documents if word in doc.lower().split())
idf_dict[word] = math.log(N / containing_docs)
return idf_dict
def calculate_tfidf(documents):
"""Calculate TF-IDF scores for all documents"""
# Calculate IDF for all terms
idf_dict = calculate_idf(documents)
# Calculate TF-IDF for each document
tfidf_documents = []
for doc in documents:
tf_dict = calculate_tf(doc)
tfidf_dict = {}
for word, tf_value in tf_dict.items():
tfidf_dict[word] = tf_value * idf_dict[word]
tfidf_documents.append(tfidf_dict)
return tfidf_documents
# Example usage
documents = [
"the cat sat on the mat",
"the dog ran in the park",
"cats and dogs are pets"
]
tfidf_scores = calculate_tfidf(documents)
for i, doc_scores in enumerate(tfidf_scores):
print(f"Document {i+1} TF-IDF scores:")
for word, score in sorted(doc_scores.items(), key=lambda x: x[1], reverse=True):
print(f" {word}: {score:.4f}")
print()
Understanding the Manual Implementation
This manual implementation demonstrates several key concepts:
Text Preprocessing: The code converts text to lowercase and splits on whitespace, representing basic tokenization.
TF Calculation: Term frequency is calculated by counting word occurrences and normalizing by document length.
IDF Calculation: Inverse document frequency uses the logarithm of the ratio between total documents and documents containing each term.
Score Combination: The final TF-IDF score multiplies TF and IDF values for each term.
Method 2: Using Scikit-learn’s TfidfVectorizer
The Professional Approach
While manual implementation is educational, scikit-learn’s TfidfVectorizer
provides a robust, optimized solution for production use. This approach handles edge cases, offers extensive customization options, and integrates seamlessly with other machine learning tools.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def sklearn_tfidf_example():
"""Demonstrate TF-IDF calculation using scikit-learn"""
# Sample documents
documents = [
"The quick brown fox jumps over the lazy dog",
"Machine learning is a subset of artificial intelligence",
"Natural language processing deals with text analysis",
"Python is a popular programming language for data science"
]
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(
lowercase=True, # Convert to lowercase
stop_words='english', # Remove English stop words
max_features=1000, # Limit vocabulary size
ngram_range=(1, 2) # Include unigrams and bigrams
)
# Fit and transform documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()
# Convert to DataFrame for better visualization
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=feature_names,
index=[f"Doc_{i+1}" for i in range(len(documents))]
)
return tfidf_df, vectorizer
# Run the example
tfidf_df, vectorizer = sklearn_tfidf_example()
# Display top TF-IDF scores for each document
print("Top 5 TF-IDF scores per document:")
for idx, row in tfidf_df.iterrows():
top_scores = row.nlargest(5)
print(f"\n{idx}:")
for term, score in top_scores.items():
if score > 0:
print(f" {term}: {score:.4f}")
Advanced Scikit-learn Configuration
The TfidfVectorizer
offers numerous parameters for customization:
Preprocessing Options:
lowercase
: Convert text to lowercasestop_words
: Remove common words that don’t carry much meaningtoken_pattern
: Define custom tokenization patternspreprocessor
: Apply custom preprocessing functions
Vocabulary Control:
max_features
: Limit vocabulary size to most frequent termsmin_df
: Ignore terms appearing in fewer than specified documentsmax_df
: Ignore terms appearing in more than specified fraction of documentsvocabulary
: Use predefined vocabulary
N-gram Configuration:
ngram_range
: Include unigrams, bigrams, trigrams, etc.analyzer
: Choose between ‘word’, ‘char’, or custom analyzers
Method 3: Custom Implementation with Advanced Features
Enhanced Manual Implementation
For specialized use cases, you might need a custom implementation that combines the transparency of manual coding with advanced features:
import re
from collections import defaultdict
import numpy as np
from math import log
class CustomTfIdf:
def __init__(self, lowercase=True, remove_stopwords=True, min_df=1, max_df=1.0):
self.lowercase = lowercase
self.remove_stopwords = remove_stopwords
self.min_df = min_df
self.max_df = max_df
self.vocabulary = {}
self.idf_values = {}
# Basic English stop words
self.stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'being'
}
def preprocess_text(self, text):
"""Clean and preprocess text"""
if self.lowercase:
text = text.lower()
# Remove punctuation and split
words = re.findall(r'\b\w+\b', text)
if self.remove_stopwords:
words = [word for word in words if word not in self.stop_words]
return words
def fit(self, documents):
"""Fit the TF-IDF model to documents"""
processed_docs = [self.preprocess_text(doc) for doc in documents]
# Build vocabulary
word_doc_count = defaultdict(int)
for doc in processed_docs:
unique_words = set(doc)
for word in unique_words:
word_doc_count[word] += 1
# Filter vocabulary based on document frequency
n_docs = len(documents)
min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
self.vocabulary = {
word: idx for idx, (word, count) in enumerate(word_doc_count.items())
if min_count <= count <= max_count
}
# Calculate IDF values
for word, doc_count in word_doc_count.items():
if word in self.vocabulary:
self.idf_values[word] = log(n_docs / doc_count)
return self
def transform(self, documents):
"""Transform documents to TF-IDF vectors"""
processed_docs = [self.preprocess_text(doc) for doc in documents]
n_docs = len(documents)
n_features = len(self.vocabulary)
tfidf_matrix = np.zeros((n_docs, n_features))
for doc_idx, doc in enumerate(processed_docs):
word_counts = defaultdict(int)
for word in doc:
if word in self.vocabulary:
word_counts[word] += 1
doc_length = len(doc)
if doc_length > 0:
for word, count in word_counts.items():
tf = count / doc_length
idf = self.idf_values[word]
tfidf = tf * idf
word_idx = self.vocabulary[word]
tfidf_matrix[doc_idx, word_idx] = tfidf
return tfidf_matrix
def fit_transform(self, documents):
"""Fit model and transform documents in one step"""
return self.fit(documents).transform(documents)
def get_feature_names(self):
"""Get vocabulary terms"""
return [word for word, _ in sorted(self.vocabulary.items(), key=lambda x: x[1])]
# Example usage
custom_tfidf = CustomTfIdf(min_df=1, max_df=0.8)
documents = [
"Python programming is powerful and versatile",
"Machine learning with Python is popular",
"Data science uses Python for analysis",
"Programming languages include Python and Java"
]
tfidf_matrix = custom_tfidf.fit_transform(documents)
feature_names = custom_tfidf.get_feature_names()
print("Custom TF-IDF Implementation Results:")
for i, doc in enumerate(documents):
print(f"\nDocument {i+1}: '{doc[:50]}...'")
doc_scores = [(feature_names[j], tfidf_matrix[i, j])
for j in range(len(feature_names)) if tfidf_matrix[i, j] > 0]
doc_scores.sort(key=lambda x: x[1], reverse=True)
for word, score in doc_scores[:5]:
print(f" {word}: {score:.4f}")
Practical Applications and Use Cases
Document Similarity and Search
TF-IDF scores can be used to measure document similarity and build search systems:
from sklearn.metrics.pairwise import cosine_similarity
def document_similarity_example():
"""Demonstrate document similarity using TF-IDF"""
documents = [
"Machine learning algorithms for data analysis",
"Deep learning neural networks and AI",
"Data science with Python programming",
"Artificial intelligence and machine learning"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)
print("Document Similarity Matrix:")
for i in range(len(documents)):
for j in range(len(documents)):
print(f"Doc{i+1} vs Doc{j+1}: {similarity_matrix[i,j]:.3f}")
print()
document_similarity_example()
Feature Selection for Classification
TF-IDF can help identify the most discriminative terms for text classification:
def feature_importance_analysis(documents, labels):
"""Analyze feature importance using TF-IDF"""
vectorizer = TfidfVectorizer(max_features=100)
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# Calculate average TF-IDF scores per class
unique_labels = list(set(labels))
class_profiles = {}
for label in unique_labels:
label_docs = [i for i, l in enumerate(labels) if l == label]
class_tfidf = tfidf_matrix[label_docs].mean(axis=0).A1
class_profiles[label] = dict(zip(feature_names, class_tfidf))
return class_profiles
Performance Optimization and Best Practices
Memory and Speed Considerations
When working with large corpora, consider these optimization strategies:
Sparse Matrix Usage: TF-IDF matrices are typically sparse, so use scipy sparse matrices to save memory.
Batch Processing: For very large datasets, process documents in batches to manage memory usage.
Vocabulary Limiting: Use max_features
, min_df
, and max_df
parameters to control vocabulary size.
Preprocessing Efficiency: Optimize text preprocessing steps for better performance.
Common Pitfalls and Solutions
Zero Division Issues: Handle empty documents and ensure proper normalization.
Memory Limitations: Monitor memory usage when processing large document collections.
Vocabulary Explosion: Control vocabulary size to prevent excessive memory usage and computational complexity.
Inconsistent Preprocessing: Ensure consistent preprocessing between training and inference.
Conclusion
Understanding how to calculate TF-IDF score in Python opens up numerous possibilities for text analysis and natural language processing tasks. Whether you choose manual implementation for educational purposes, scikit-learn for production efficiency, or custom solutions for specialized requirements, TF-IDF remains a powerful and versatile technique.
The key to successful TF-IDF implementation lies in understanding the mathematical foundations, choosing appropriate preprocessing steps, and selecting the right approach for your specific use case. By mastering these concepts and techniques, you’ll be well-equipped to tackle a wide range of text analysis challenges and build sophisticated NLP applications.
Remember that TF-IDF is just one tool in the NLP toolkit. As you advance in your text analysis journey, consider exploring more advanced techniques like word embeddings, transformer models, and deep learning approaches that can complement or enhance TF-IDF-based solutions.