colrev.ops.dedupe.Dedupe¶
- class colrev.ops.dedupe.Dedupe(*, review_manager, notify_state_transition_operation=True)[source]¶
Bases:
Operation
Deduplicate records (entity resolution)
Methods
apply_merges
Apply deduplication decisions
check_precondition
Check the operation precondition
conclude
Conclude the operation (stop Docker containers)
connected_components
Find the connected components in a graph.
decorate
Decorator for operations
fix_errors
Fix lists of errors
get_info
Get info on cuts (overlap of search sources) and same source merges
get_records_for_dedupe
Get (pre-processed) records for dedupe
main
- rtype:
Any
merge_based_on_global_ids
Merge records based on global IDs (e.g., doi)
merge_records
Merge records by ID sets
notify
Notify the review_manager about the next operation
unmerge_records
Unmerge duplicate decision of the records, as identified by their ids.
Attributes
DUPLICATES_TO_VALIDATE
NON_DUPLICATE_FILE_TXT
NON_DUPLICATE_FILE_XLSX
PREVENTED_SAME_SOURCE_MERGE_FILE
SAME_SOURCE_MERGE_FILE
debug
type
- apply_merges(*, id_sets, complete_dedupe=False, preferred_masterdata_sources=None)[source]¶
Apply deduplication decisions
id_sets : [[ID_1, ID_2, ID_3], …] :rtype:
None
complete_dedupe: when not all potential duplicates were considered,
we cannot set records to md_procssed for non-duplicate decisions
- check_precondition()¶
Check the operation precondition
- Return type:
None
- conclude()¶
Conclude the operation (stop Docker containers)
- Return type:
None
- classmethod connected_components(id_sets)[source]¶
Find the connected components in a graph.
- Return type:
list
- Args:
id_sets (list): A list of id sets.
- Returns:
list: A list of connected components.
- classmethod decorate()¶
Decorator for operations
- Return type:
Callable
- get_info()[source]¶
Get info on cuts (overlap of search sources) and same source merges
- Return type:
dict
- classmethod get_records_for_dedupe(*, records_df, verbosity_level=0)[source]¶
Get (pre-processed) records for dedupe
- Return type:
DataFrame
- merge_based_on_global_ids(*, apply=False)[source]¶
Merge records based on global IDs (e.g., doi)
- Return type:
None
- notify(*, state_transition=True)¶
Notify the review_manager about the next operation
- Return type:
None