Usage ==================================== It is possible to complete and customize each step individually: .. code-block:: python import pandas as pd from bib_dedupe.bib_dedupe import prep, block, match, merge, export_maybe, import_maybe # Load your bibliographic dataset into a pandas DataFrame records_df = pd.read_csv("records.csv") # Preproces records records_df = prep(records_df) # Block records blocked_df = block(records_df) # Identify matches matched_df = match(blocked_df) # Check maybe cases export_maybe(records_df, matched_df=matched_df) matched_df = import_maybe(matched_df) # Merge merged_df = merge(records_df, matched_df=matched_df) Fields used by BibDeduper .. list-table:: :widths: 20 80 :header-rows: 1 * - **Name** - **Definition** * - **ID** - A unique ID * - **ENTRYTYPE** - The type of publication (e.g., article, book, inproceedings) * - **author** - The author(s) of the publication * - **title** - The title of the publication * - **year** - The year of publication * - **journal** - The name of the journal in which the publication appeared * - **volume** - The volume number of the publication * - **number** - The issue number of the publication * - **pages** - The page numbers of the publication * - **doi** - The Digital Object Identifier (DOI) * - **abstract** - The abstract * - **search_set** - Distinct sets of papers (e.g., old_search), can be empty. \* \* The `merge()` function ensures that records from the same `search_set` are not merged. The `match()` function ensures that individual pairs (e.g., A-B, B-C) do not come from the same `search_set`. `match()` does not consider transitive relations (i.e., A-C could be from the same `search_set`). The `cluster()` and `get_connected_components()` functions (part of `merge()`) ensure that records are not merged if the component already contains a record from the same `search_set`. Search updates ----------------------- When updating a literature search, the `old_search` can be assumed to have no duplicates. To exclude a set of papers from deduplication, it is possible to pass a corresponding label to the `search_set` column. Example data ----------------------- Data from the `example datasets`_ can be loaded as follows: .. code-block:: python from bib_dedupe.bib_dedupe import merge from bib_dedupe.bib_dedupe import load_example_data # Load example dataset records_df = load_example_data("stroke") # Get the merged_df merged_df = merge(records_df) # Save as csv merged_df.to_csv("merged.csv", index=False) Import file formats ----------------------- BibDedupe can process any bibliographic data set once it is in a pandas DataFrame, and contains the columns listed above. Import functions are available as part of the `CoLRev project `_. .. Given that each database follows its own schema with slightly different column names, import functionality must be customized to the specific database. .. code-block:: python import colrev.loader.load_utils from pathlib import Path # import bib, ris, csv, xlsx, json, ... records_df = colrev.loader.load_utils.load_df(filename=Path("records.bib")) .. _example datasets: https://github.com/CoLRev-Environment/bib-dedupe/tree/main/data