API¶
- bib_dedupe.bib_dedupe.prep(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame ¶
Preprocesses records for deduplication.
- Args:
records_df (pd.DataFrame): The dataframe containing the records to be preprocessed. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
- Returns:
pd.DataFrame: The preprocessed records dataframe.
- bib_dedupe.bib_dedupe.block(records_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame ¶
Blocks pairs of records for deduplication.
- Args:
records_df (pd.DataFrame): The dataframe containing the records to be deduplicated. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
- Returns:
pd.DataFrame: The dataframe containing the blocked pairs for deduplication.
- bib_dedupe.bib_dedupe.match(pairs_df: DataFrame, *, verbosity_level: int | None = None, cpu: int = -1) DataFrame ¶
Identifies true/maybe matches from the given pairs based on similarity scores.
- Args:
pairs_df (pd.DataFrame): The DataFrame containing the pairs to be compared. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None. debug (bool, optional): If True, enables debug mode. Defaults to False.
- Returns:
pd.DataFrame: The DataFrame containing the true/maybe matches.
- bib_dedupe.bib_dedupe.export_maybe(records_df: DataFrame, *, matched_df: DataFrame, verbosity_level: int | None = None) None ¶
Exports ‘maybe’ cases for manual review during deduplication.
- Args:
matched_df (pd.DataFrame): The dataframe containing the matched pairs. records_df (pd.DataFrame): The dataframe containing the records. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
- bib_dedupe.bib_dedupe.import_maybe(matched_df: DataFrame, *, verbosity_level: int | None = None) DataFrame ¶
Imports decisions for ‘maybe’ cases after manual review.
- Args:
matched_df (pd.DataFrame): The dataframe containing the matches. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
- Returns:
pd.DataFrame: The dataframe containing the updated matches.
- bib_dedupe.bib_dedupe.merge(records_df: DataFrame, *, matched_df: DataFrame | None = None, duplicate_id_sets: list | None = None, verbosity_level: int | None = None) DataFrame ¶
Merges duplicate records in the given dataframe.
- Args:
records_df (pd.DataFrame): The DataFrame containing the records to be merged. duplicate_id_sets (list, optional): List of sets containing duplicate record IDs. If None, the function will perform deduplication process to identify duplicates. Defaults to None. verbosity_level (int, optional): Level of verbosity for logging. Defaults to None.
- Returns:
pd.DataFrame: The merged DataFrame.