Usage¶

It is possible to complete and customize each step individually:

import pandas as pd
from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe

# Load your bibliographic dataset into a pandas DataFrame
records_df = pd.read_csv("records.csv")

# Preproces records
records_df = prep(records_df)

# Block records
blocked_df = block(records_df)

# Identify matches
matched_df = match(blocked_df)

# Check maybe cases
export_maybe(records_df, matched_df=matched_df)
matched_df = import_maybe(matched_df)

# Merge
merged_df = merge(records_df, matched_df=matched_df)

Fields used by BibDeduper

Name	Definition
ID	A unique ID
ENTRYTYPE	The type of publication (e.g., article, book, inproceedings)
author	The author(s) of the publication
title	The title of the publication
year	The year of publication
journal	The name of the journal in which the publication appeared
volume	The volume number of the publication
number	The issue number of the publication
pages	The page numbers of the publication
doi	The Digital Object Identifier (DOI)
abstract	The abstract
search_set	Distinct sets of papers (e.g., old_search), can be empty. *

* The merge() function ensures that records from the same search_set are not merged. The match() function ensures that individual pairs (e.g., A-B, B-C) do not come from the same search_set. match() does not consider transitive relations (i.e., A-C could be from the same search_set). The cluster() and get_connected_components() functions (part of merge()) ensure that records are not merged if the component already contains a record from the same search_set.

Search updates¶

When updating a literature search, the old_search can be assumed to have no duplicates. To exclude a set of papers from deduplication, it is possible to pass a corresponding label to the search_set column.

Example data¶

Data from the example datasets can be loaded as follows:

from bib_dedupe.bib_dedupe import merge
from bib_dedupe.bib_dedupe import load_example_data

# Load example dataset
records_df = load_example_data("stroke")

# Get the merged_df
merged_df = merge(records_df)

# Save as csv
merged_df.to_csv("merged.csv", index=False)

Import file formats¶

BibDedupe can process any bibliographic data set once it is in a pandas DataFrame, and contains the columns listed above. Import functions are available as part of the CoLRev project.