Usage¶
It is possible to complete and customize each step individually:
import pandas as pd
from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe
# Load your bibliographic dataset into a pandas DataFrame
records_df = pd.read_csv("records.csv")
# Preproces records
records_df = prep(records_df)
# Block records
blocked_df = block(records_df)
# Identify matches
matched_df = match(blocked_df)
# Check maybe cases
export_maybe(records_df, matched_df=matched_df)
matched_df = import_maybe(matched_df)
# Merge
merged_df = merge(records_df, matched_df=matched_df)
Fields used by BibDeduper
Name |
Definition |
---|---|
ID |
A unique ID |
ENTRYTYPE |
The type of publication (e.g., article, book, inproceedings) |
author |
The author(s) of the publication |
title |
The title of the publication |
year |
The year of publication |
journal |
The name of the journal in which the publication appeared |
volume |
The volume number of the publication |
number |
The issue number of the publication |
pages |
The page numbers of the publication |
doi |
The Digital Object Identifier (DOI) |
abstract |
The abstract |
search_set |
Distinct sets of papers (e.g., old_search), can be empty. * |
* The merge() function ensures that records from the same search_set are not merged. The match() function ensures that individual pairs (e.g., A-B, B-C) do not come from the same search_set. match() does not consider transitive relations (i.e., A-C could be from the same search_set). The cluster() and get_connected_components() functions (part of merge()) ensure that records are not merged if the component already contains a record from the same search_set.
Search updates¶
When updating a literature search, the old_search can be assumed to have no duplicates. To exclude a set of papers from deduplication, it is possible to pass a corresponding label to the search_set column.
Example data¶
Data from the example datasets can be loaded as follows:
from bib_dedupe.bib_dedupe import merge
from bib_dedupe.bib_dedupe import load_example_data
# Load example dataset
records_df = load_example_data("stroke")
# Get the merged_df
merged_df = merge(records_df)
# Save as csv
merged_df.to_csv("merged.csv", index=False)
Import file formats¶
BibDedupe can process any bibliographic data set once it is in a pandas DataFrame, and contains the columns listed above. Import functions are available as part of the CoLRev project.