API¶

bib_dedupe.bib_dedupe.prep(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) → DataFrame¶

Preprocesses records for deduplication.

Args:: records_df (pd.DataFrame): The dataframe containing the records to be preprocessed. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
Returns:: pd.DataFrame: The preprocessed records dataframe.

bib_dedupe.bib_dedupe.block(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) → DataFrame¶

Blocks pairs of records for deduplication.

Args:: records_df (pd.DataFrame): The dataframe containing the records to be deduplicated. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
Returns:: pd.DataFrame: The dataframe containing the blocked pairs for deduplication.

bib_dedupe.bib_dedupe.match(pairs_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) → DataFrame¶

Identifies true/maybe matches from the given pairs based on similarity scores.

Args:: pairs_df (pd.DataFrame): The DataFrame containing the pairs to be compared. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None. debug (bool, optional): If True, enables debug mode. Defaults to False.
Returns:: pd.DataFrame: The DataFrame containing the true/maybe matches.

bib_dedupe.bib_dedupe.export_maybe(records_df: DataFrame, *, matched_df: DataFrame, verbosity_level: int | None = None) → None¶

Exports ‘maybe’ cases for manual review during deduplication.

Args:: matched_df (pd.DataFrame): The dataframe containing the matched pairs. records_df (pd.DataFrame): The dataframe containing the records. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

bib_dedupe.bib_dedupe.import_maybe(matched_df: DataFrame, *, verbosity_level: int | None = None) → DataFrame¶

Imports decisions for ‘maybe’ cases after manual review.

Args:: matched_df (pd.DataFrame): The dataframe containing the matches. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
Returns:: pd.DataFrame: The dataframe containing the updated matches.

bib_dedupe.bib_dedupe.merge(records_df: DataFrame, *, matched_df: DataFrame | None = None, duplicate_id_sets: list | None = None, verbosity_level: int | None = None) → DataFrame¶

Merges duplicate records in the given dataframe.

Args:: records_df (pd.DataFrame): The DataFrame containing the records to be merged. duplicate_id_sets (list, optional): List of sets containing duplicate record IDs. If None, the function will perform deduplication process to identify duplicates. Defaults to None. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
Returns:: pd.DataFrame: The merged DataFrame.