colrev.ops.dedupe.Dedupe

class Dedupe(*, review_manager, notify_state_transition_operation=True)[source]

Bases: Operation

Deduplicate records (entity resolution)

Methods

apply_merges

Apply deduplication decisions

check_precondition

Check the operation precondition

conclude

Conclude the operation (stop Docker containers)

connected_components

Find the connected components in a graph.

decorate

Decorator for operations

fix_errors

Fix lists of errors

get_info

Get info on cuts (overlap of search sources) and same source merges

get_records_for_dedupe

Get (pre-processed) records for dedupe

main

rtype:

Any

merge_based_on_global_ids

Merge records based on global IDs (e.g., doi)

merge_records

Merge records by ID sets

notify

Notify the review_manager about the next operation

unmerge_records

Unmerge duplicate decision of the records, as identified by their ids.

Attributes

DUPLICATES_TO_VALIDATE

NON_DUPLICATE_FILE_TXT

NON_DUPLICATE_FILE_XLSX

PREVENTED_SAME_SOURCE_MERGE_FILE

SAME_SOURCE_MERGE_FILE

debug

type

apply_merges(*, id_sets, complete_dedupe=False, preferred_masterdata_sources=None)[source]

Apply deduplication decisions

id_sets : [[ID_1, ID_2, ID_3], …] :rtype: None

  • complete_dedupe: when not all potential duplicates were considered,

we cannot set records to md_procssed for non-duplicate decisions

check_precondition()

Check the operation precondition

Return type:

None

conclude()

Conclude the operation (stop Docker containers)

Return type:

None

classmethod connected_components(id_sets)[source]

Find the connected components in a graph.

Return type:

list

Args:

id_sets (list): A list of id sets.

Returns:

list: A list of connected components.

classmethod decorate()

Decorator for operations

Return type:

Callable

fix_errors(*, false_positives, false_negatives)[source]

Fix lists of errors

Return type:

None

get_info()[source]

Get info on cuts (overlap of search sources) and same source merges

Return type:

dict

classmethod get_records_for_dedupe(*, records_df, verbosity_level=0)[source]

Get (pre-processed) records for dedupe

Return type:

DataFrame

merge_based_on_global_ids(*, apply=False)[source]

Merge records based on global IDs (e.g., doi)

Return type:

None

merge_records(*, merge)[source]

Merge records by ID sets

Return type:

None

notify(*, state_transition=True)

Notify the review_manager about the next operation

Return type:

None

unmerge_records(*, current_record_ids)[source]

Unmerge duplicate decision of the records, as identified by their ids.

The current_record_ids identifies the records by their current IDs and unmerges their most recent merge in history.

Return type:

None