Usage¶
It is possible to complete and customize each step individually:
import pandas as pd
from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe
# Load your bibliographic dataset into a pandas DataFrame
records_df = pd.read_csv("records.csv")
# Preproces records
records_df = prep(records_df)
# Block records
blocked_df = block(records_df)
# Identify matches
matched_df = match(blocked_df)
# Check maybe cases
export_maybe(records_df, matched_df=matched_df)
matched_df = import_maybe(matched_df)
# Merge
merged_df = merge(records_df, matched_df=matched_df)
Fields used by BibDeduper
Name |
Definition |
---|---|
ID |
A unique ID |
ENTRYTYPE |
The type of publication (e.g., article, book, inproceedings) |
author |
The author(s) of the publication |
title |
The title of the publication |
year |
The year of publication |
journal |
The name of the journal in which the publication appeared |
volume |
The volume number of the publication |
number |
The issue number of the publication |
pages |
The page numbers of the publication |
doi |
The Digital Object Identifier (DOI) |
abstract |
The abstract |
search_set |
Distinct sets of papers (e.g., old_search), can be empty. * |
* The merge() function ensures that records from the same search_set are not merged. The match() function ensures that individual pairs (e.g., A-B, B-C) do not come from the same search_set. match() does not consider transitive relations (i.e., A-C could be from the same search_set). The cluster() and get_connected_components() functions (part of merge()) ensure that records are not merged if the component already contains a record from the same search_set.
Search updates¶
When updating a literature search, the old_search can be assumed to have no duplicates. To exclude a set of papers from deduplication, it is possible to pass a corresponding label to the search_set column.
Example data¶
Data from the example datasets can be loaded as follows:
from bib_dedupe.bib_dedupe import merge
from bib_dedupe.bib_dedupe import load_example_data
# Load example dataset
records_df = load_example_data("stroke")
# Get the merged_df
merged_df = merge(records_df)
# Save as csv
merged_df.to_csv("merged.csv", index=False)
Import file formats¶
BibDedupe can process any bibliographic data set once it is in a pandas DataFrame, and contains the columns listed above. Given that each database follows its own schema with slightly different column names, import functionality must be customized to the specific database. We are working on corresponding import functions as part of the CoLRev project. Once the import functions are available, they will be described here (see this issue for more information).