How to Calculate TF-IDF Score in Python

Term Frequency-Inverse Document Frequency (TF-IDF) is one of the most fundamental and widely-used techniques in natural language processing and information retrieval. Whether you’re building a search engine, performing document classification, or analyzing text data, understanding how to calculate TF-IDF score in Python is an essential skill for any data scientist or NLP practitioner.

This comprehensive guide will walk you through the mathematical foundations of TF-IDF, demonstrate multiple implementation approaches, and provide practical examples that you can apply to your own projects. By the end of this article, you’ll have a thorough understanding of TF-IDF and the ability to implement it effectively in Python.

Understanding TF-IDF: The Mathematical Foundation

What is TF-IDF?

TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection of documents (corpus). The technique combines two key components:

Term Frequency (TF): Measures how frequently a term appears in a document relative to the total number of terms in that document.

Inverse Document Frequency (IDF): Measures how rare or common a term is across the entire corpus, giving higher weights to rare terms and lower weights to common terms.

The combination of these two metrics creates a scoring system that identifies terms that are both frequent in a specific document and rare across the corpus, making them highly distinctive and informative.

The Mathematical Formulas

Understanding the mathematics behind TF-IDF is crucial for proper implementation:

Term Frequency (TF):

TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)

Inverse Document Frequency (IDF):

IDF(t,D) = log(Total number of documents / Number of documents containing term t)

TF-IDF Score:

TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)

Where:

  • t = term (word)
  • d = document
  • D = corpus (collection of documents)

Method 1: Manual Implementation from Scratch

Building the Foundation

Let’s start by implementing TF-IDF manually to understand the underlying mechanics. This approach gives you complete control over the calculation process and helps solidify your understanding of the algorithm.

import math
from collections import Counter
import numpy as np

def calculate_tf(text):
    """Calculate term frequency for a document"""
    words = text.lower().split()
    word_count = len(words)
    tf_dict = {}
    
    for word in words:
        tf_dict[word] = tf_dict.get(word, 0) + 1
    
    # Normalize by total word count
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / word_count
    
    return tf_dict

def calculate_idf(documents):
    """Calculate inverse document frequency for all terms"""
    N = len(documents)
    idf_dict = {}
    all_words = set(word for doc in documents for word in doc.lower().split())
    
    for word in all_words:
        containing_docs = sum(1 for doc in documents if word in doc.lower().split())
        idf_dict[word] = math.log(N / containing_docs)
    
    return idf_dict

def calculate_tfidf(documents):
    """Calculate TF-IDF scores for all documents"""
    # Calculate IDF for all terms
    idf_dict = calculate_idf(documents)
    
    # Calculate TF-IDF for each document
    tfidf_documents = []
    
    for doc in documents:
        tf_dict = calculate_tf(doc)
        tfidf_dict = {}
        
        for word, tf_value in tf_dict.items():
            tfidf_dict[word] = tf_value * idf_dict[word]
        
        tfidf_documents.append(tfidf_dict)
    
    return tfidf_documents

# Example usage
documents = [
    "the cat sat on the mat",
    "the dog ran in the park",
    "cats and dogs are pets"
]

tfidf_scores = calculate_tfidf(documents)
for i, doc_scores in enumerate(tfidf_scores):
    print(f"Document {i+1} TF-IDF scores:")
    for word, score in sorted(doc_scores.items(), key=lambda x: x[1], reverse=True):
        print(f"  {word}: {score:.4f}")
    print()

Understanding the Manual Implementation

This manual implementation demonstrates several key concepts:

Text Preprocessing: The code converts text to lowercase and splits on whitespace, representing basic tokenization.

TF Calculation: Term frequency is calculated by counting word occurrences and normalizing by document length.

IDF Calculation: Inverse document frequency uses the logarithm of the ratio between total documents and documents containing each term.

Score Combination: The final TF-IDF score multiplies TF and IDF values for each term.

Method 2: Using Scikit-learn’s TfidfVectorizer

The Professional Approach

While manual implementation is educational, scikit-learn’s TfidfVectorizer provides a robust, optimized solution for production use. This approach handles edge cases, offers extensive customization options, and integrates seamlessly with other machine learning tools.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

def sklearn_tfidf_example():
    """Demonstrate TF-IDF calculation using scikit-learn"""
    
    # Sample documents
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "Machine learning is a subset of artificial intelligence",
        "Natural language processing deals with text analysis",
        "Python is a popular programming language for data science"
    ]
    
    # Initialize TfidfVectorizer
    vectorizer = TfidfVectorizer(
        lowercase=True,          # Convert to lowercase
        stop_words='english',    # Remove English stop words
        max_features=1000,       # Limit vocabulary size
        ngram_range=(1, 2)       # Include unigrams and bigrams
    )
    
    # Fit and transform documents
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    # Get feature names (vocabulary)
    feature_names = vectorizer.get_feature_names_out()
    
    # Convert to DataFrame for better visualization
    tfidf_df = pd.DataFrame(
        tfidf_matrix.toarray(),
        columns=feature_names,
        index=[f"Doc_{i+1}" for i in range(len(documents))]
    )
    
    return tfidf_df, vectorizer

# Run the example
tfidf_df, vectorizer = sklearn_tfidf_example()

# Display top TF-IDF scores for each document
print("Top 5 TF-IDF scores per document:")
for idx, row in tfidf_df.iterrows():
    top_scores = row.nlargest(5)
    print(f"\n{idx}:")
    for term, score in top_scores.items():
        if score > 0:
            print(f"  {term}: {score:.4f}")

Advanced Scikit-learn Configuration

The TfidfVectorizer offers numerous parameters for customization:

Preprocessing Options:

  • lowercase: Convert text to lowercase
  • stop_words: Remove common words that don’t carry much meaning
  • token_pattern: Define custom tokenization patterns
  • preprocessor: Apply custom preprocessing functions

Vocabulary Control:

  • max_features: Limit vocabulary size to most frequent terms
  • min_df: Ignore terms appearing in fewer than specified documents
  • max_df: Ignore terms appearing in more than specified fraction of documents
  • vocabulary: Use predefined vocabulary

N-gram Configuration:

  • ngram_range: Include unigrams, bigrams, trigrams, etc.
  • analyzer: Choose between ‘word’, ‘char’, or custom analyzers

Method 3: Custom Implementation with Advanced Features

Enhanced Manual Implementation

For specialized use cases, you might need a custom implementation that combines the transparency of manual coding with advanced features:

import re
from collections import defaultdict
import numpy as np
from math import log

class CustomTfIdf:
    def __init__(self, lowercase=True, remove_stopwords=True, min_df=1, max_df=1.0):
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.min_df = min_df
        self.max_df = max_df
        self.vocabulary = {}
        self.idf_values = {}
        
        # Basic English stop words
        self.stop_words = {
            'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 
            'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'being'
        }
    
    def preprocess_text(self, text):
        """Clean and preprocess text"""
        if self.lowercase:
            text = text.lower()
        
        # Remove punctuation and split
        words = re.findall(r'\b\w+\b', text)
        
        if self.remove_stopwords:
            words = [word for word in words if word not in self.stop_words]
        
        return words
    
    def fit(self, documents):
        """Fit the TF-IDF model to documents"""
        processed_docs = [self.preprocess_text(doc) for doc in documents]
        
        # Build vocabulary
        word_doc_count = defaultdict(int)
        for doc in processed_docs:
            unique_words = set(doc)
            for word in unique_words:
                word_doc_count[word] += 1
        
        # Filter vocabulary based on document frequency
        n_docs = len(documents)
        min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
        max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
        
        self.vocabulary = {
            word: idx for idx, (word, count) in enumerate(word_doc_count.items())
            if min_count <= count <= max_count
        }
        
        # Calculate IDF values
        for word, doc_count in word_doc_count.items():
            if word in self.vocabulary:
                self.idf_values[word] = log(n_docs / doc_count)
        
        return self
    
    def transform(self, documents):
        """Transform documents to TF-IDF vectors"""
        processed_docs = [self.preprocess_text(doc) for doc in documents]
        n_docs = len(documents)
        n_features = len(self.vocabulary)
        
        tfidf_matrix = np.zeros((n_docs, n_features))
        
        for doc_idx, doc in enumerate(processed_docs):
            word_counts = defaultdict(int)
            for word in doc:
                if word in self.vocabulary:
                    word_counts[word] += 1
            
            doc_length = len(doc)
            if doc_length > 0:
                for word, count in word_counts.items():
                    tf = count / doc_length
                    idf = self.idf_values[word]
                    tfidf = tf * idf
                    
                    word_idx = self.vocabulary[word]
                    tfidf_matrix[doc_idx, word_idx] = tfidf
        
        return tfidf_matrix
    
    def fit_transform(self, documents):
        """Fit model and transform documents in one step"""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self):
        """Get vocabulary terms"""
        return [word for word, _ in sorted(self.vocabulary.items(), key=lambda x: x[1])]

# Example usage
custom_tfidf = CustomTfIdf(min_df=1, max_df=0.8)
documents = [
    "Python programming is powerful and versatile",
    "Machine learning with Python is popular",
    "Data science uses Python for analysis",
    "Programming languages include Python and Java"
]

tfidf_matrix = custom_tfidf.fit_transform(documents)
feature_names = custom_tfidf.get_feature_names()

print("Custom TF-IDF Implementation Results:")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}: '{doc[:50]}...'")
    doc_scores = [(feature_names[j], tfidf_matrix[i, j]) 
                  for j in range(len(feature_names)) if tfidf_matrix[i, j] > 0]
    doc_scores.sort(key=lambda x: x[1], reverse=True)
    
    for word, score in doc_scores[:5]:
        print(f"  {word}: {score:.4f}")

Practical Applications and Use Cases

Document Similarity and Search

TF-IDF scores can be used to measure document similarity and build search systems:

from sklearn.metrics.pairwise import cosine_similarity

def document_similarity_example():
    """Demonstrate document similarity using TF-IDF"""
    documents = [
        "Machine learning algorithms for data analysis",
        "Deep learning neural networks and AI",
        "Data science with Python programming",
        "Artificial intelligence and machine learning"
    ]
    
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    # Calculate cosine similarity
    similarity_matrix = cosine_similarity(tfidf_matrix)
    
    print("Document Similarity Matrix:")
    for i in range(len(documents)):
        for j in range(len(documents)):
            print(f"Doc{i+1} vs Doc{j+1}: {similarity_matrix[i,j]:.3f}")
        print()

document_similarity_example()

Feature Selection for Classification

TF-IDF can help identify the most discriminative terms for text classification:

def feature_importance_analysis(documents, labels):
    """Analyze feature importance using TF-IDF"""
    vectorizer = TfidfVectorizer(max_features=100)
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()
    
    # Calculate average TF-IDF scores per class
    unique_labels = list(set(labels))
    class_profiles = {}
    
    for label in unique_labels:
        label_docs = [i for i, l in enumerate(labels) if l == label]
        class_tfidf = tfidf_matrix[label_docs].mean(axis=0).A1
        class_profiles[label] = dict(zip(feature_names, class_tfidf))
    
    return class_profiles

Performance Optimization and Best Practices

Memory and Speed Considerations

When working with large corpora, consider these optimization strategies:

Sparse Matrix Usage: TF-IDF matrices are typically sparse, so use scipy sparse matrices to save memory.

Batch Processing: For very large datasets, process documents in batches to manage memory usage.

Vocabulary Limiting: Use max_features, min_df, and max_df parameters to control vocabulary size.

Preprocessing Efficiency: Optimize text preprocessing steps for better performance.

Common Pitfalls and Solutions

Zero Division Issues: Handle empty documents and ensure proper normalization.

Memory Limitations: Monitor memory usage when processing large document collections.

Vocabulary Explosion: Control vocabulary size to prevent excessive memory usage and computational complexity.

Inconsistent Preprocessing: Ensure consistent preprocessing between training and inference.

Conclusion

Understanding how to calculate TF-IDF score in Python opens up numerous possibilities for text analysis and natural language processing tasks. Whether you choose manual implementation for educational purposes, scikit-learn for production efficiency, or custom solutions for specialized requirements, TF-IDF remains a powerful and versatile technique.

The key to successful TF-IDF implementation lies in understanding the mathematical foundations, choosing appropriate preprocessing steps, and selecting the right approach for your specific use case. By mastering these concepts and techniques, you’ll be well-equipped to tackle a wide range of text analysis challenges and build sophisticated NLP applications.

Remember that TF-IDF is just one tool in the NLP toolkit. As you advance in your text analysis journey, consider exploring more advanced techniques like word embeddings, transformer models, and deep learning approaches that can complement or enhance TF-IDF-based solutions.

Leave a Comment