BibDedupe Documentation

BibDedupe is an open-source Python library for deduplication of bibliographic records, tailored for literature reviews. Unlike traditional deduplication methods, BibDedupe focuses on entity resolution, linking duplicate records instead of simply deleting them.

Features

  • Automated Duplicate Linking with Zero False Positives: BibDedupe automates the duplicate linking process with a focus on eliminating false positives.

  • Preprocessing Approach: BibDedupe uses a preprocessing approach that reflects the unique error generation process in academic databases, such as author re-formatting, journal abbreviation or translations.

  • Entity Resolution: BibDedupe does not simply delete duplicates, but it links duplicates to resolve the entitity and integrates the data. This allows for validation, and undo operations.

  • Programmatic Access: BibDedupe is designed for seamless integration into existing research workflows, providing programmatic access for easy incorporation into scripts and applications.

  • Transparent and Reproducible Rules: BibDedupe’s blocking and matching rules are transparent and easily reproducible to promote reproducibility in deduplication processes.

  • Continuous Benchmarking: Continuous integration tests running on GitHub Actions ensure ongoing benchmarking, maintaining the library’s reliability and performance across datasets.

  • Efficient and Parallel Computation: BibDedupe implements computations efficiently and in parallel, using appropriate data structures and functions for optimal performance.

Installation

BibDedupe is available on PyPI, and can be installed via pip (see installation instructions):

pip install bib-dedupe

Getting Started

The BibDedupe library can be used in different ways, including Python scripts, Jupyter notebooks, or in other Python packages. In a basic setup, the library can be used as follows:

import pandas as pd
from bib_dedupe.bib_dedupe import merge

# Load your bibliographic dataset into a pandas DataFrame
records_df = pd.read_csv("records.csv")

# Get the merged_df
merged_df = merge(records_df)

For advanced use cases, it is also possible to complete and customize each step individually

from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe

records_df = prep(records_df)

# Block records
blocked_df = block(records_df)

# Identify matches
matched_df = match(blocked_df)

# Check maybe cases
export_maybe(matched_df, records_df)
matches = import_maybe(matched_df)

# Merge
merged_df = merge(records_df, matches=matches)