API

bib_dedupe.bib_dedupe.prep(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame

Preprocesses records for deduplication.

Args:

records_df (pd.DataFrame): The dataframe containing the records to be preprocessed. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

Returns:

pd.DataFrame: The preprocessed records dataframe.

bib_dedupe.bib_dedupe.block(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame

Blocks pairs of records for deduplication.

Args:

records_df (pd.DataFrame): The dataframe containing the records to be deduplicated. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

Returns:

pd.DataFrame: The dataframe containing the blocked pairs for deduplication.

bib_dedupe.bib_dedupe.match(pairs_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame

Identifies true/maybe matches from the given pairs based on similarity scores.

Args:

pairs_df (pd.DataFrame): The DataFrame containing the pairs to be compared. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None. debug (bool, optional): If True, enables debug mode. Defaults to False.

Returns:

pd.DataFrame: The DataFrame containing the true/maybe matches.

bib_dedupe.bib_dedupe.export_maybe(records_df: DataFrame, *, matched_df: DataFrame, verbosity_level: int | None = None) None

Exports ‘maybe’ cases for manual review during deduplication.

Args:

matched_df (pd.DataFrame): The dataframe containing the matched pairs. records_df (pd.DataFrame): The dataframe containing the records. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

bib_dedupe.bib_dedupe.import_maybe(matched_df: DataFrame, *, verbosity_level: int | None = None) DataFrame

Imports decisions for ‘maybe’ cases after manual review.

Args:

matched_df (pd.DataFrame): The dataframe containing the matches. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

Returns:

pd.DataFrame: The dataframe containing the updated matches.

bib_dedupe.bib_dedupe.merge(records_df: DataFrame, *, matched_df: DataFrame | None = None, duplicate_id_sets: list | None = None, verbosity_level: int | None = None) DataFrame

Merges duplicate records in the given dataframe.

Args:

records_df (pd.DataFrame): The DataFrame containing the records to be merged. duplicate_id_sets (list, optional): List of sets containing duplicate record IDs. If None, the function will perform deduplication process to identify duplicates. Defaults to None. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.

Returns:

pd.DataFrame: The merged DataFrame.