colrev.ops.dedupe.Dedupe¶
- class Dedupe(*, review_manager, notify_state_transition_operation=True)[source]¶
Bases:
Operation
Deduplicate records (entity resolution)
Methods
Apply deduplication decisions
Check the operation precondition
Conclude the operation (stop Docker containers)
Find the connected components in a graph.
Decorator for operations
Fix lists of errors
Get info on cuts (overlap of search sources) and same source merges
Get (pre-processed) records for dedupe
main
- rtype:
Any
Merge records based on global IDs (e.g., doi)
Merge records by ID sets
Notify the review_manager about the next operation
Unmerge duplicate decision of the records, as identified by their ids.
Attributes
DUPLICATES_TO_VALIDATE
NON_DUPLICATE_FILE_TXT
NON_DUPLICATE_FILE_XLSX
PREVENTED_SAME_SOURCE_MERGE_FILE
SAME_SOURCE_MERGE_FILE
debug
type
- apply_merges(*, id_sets, complete_dedupe=False, preferred_masterdata_sources=None)[source]¶
Apply deduplication decisions
id_sets : [[ID_1, ID_2, ID_3], …] :rtype:
None
complete_dedupe: when not all potential duplicates were considered,
we cannot set records to md_procssed for non-duplicate decisions
- check_precondition()¶
Check the operation precondition
- Return type:
None
- conclude()¶
Conclude the operation (stop Docker containers)
- Return type:
None
- classmethod connected_components(id_sets)[source]¶
Find the connected components in a graph.
- Return type:
list
- Args:
id_sets (list): A list of id sets.
- Returns:
list: A list of connected components.
- classmethod decorate()¶
Decorator for operations
- Return type:
Callable
- get_info()[source]¶
Get info on cuts (overlap of search sources) and same source merges
- Return type:
dict
- classmethod get_records_for_dedupe(*, records_df, verbosity_level=0)[source]¶
Get (pre-processed) records for dedupe
- Return type:
DataFrame
- merge_based_on_global_ids(*, apply=False)[source]¶
Merge records based on global IDs (e.g., doi)
- Return type:
None
- notify(*, state_transition=True)¶
Notify the review_manager about the next operation
- Return type:
None