colrev.dataset.Dataset

class colrev.dataset.Dataset(*, review_manager)[source]

Bases: object

The CoLRev dataset (records and their history in git).

Initialize the instance.

Methods

format_records_file

Format the records file (Entrypoint for pre-commit hooks).

get_committed_origin_state_dict

Get the committed origin_state_dict.

get_origin_state_dict

Get the origin_state_dict (to determine state transitions efficiently).

load_records_dict

Load the records.

load_records_from_history

Iterates through Git history, yielding records file contents as dictionaries.

propagated_id

Check whether an ID is propagated (i.e., its record's status is beyond md_processed).

read_next_record

Read records (Iterator) based on condition.

reset_log_if_no_changes

Reset the report log file if there are not changes.

save_records_dict

Save the records dict in RECORDS_FILE.

save_records_dict_to_file

Save the records dict.

set_ids

Set the IDs of records according to predefined formats or according to the LocalIndex.

Attributes

git_repo

Get the GitRepo object for the review_manager path.

format_records_file()[source]

Format the records file (Entrypoint for pre-commit hooks).

Return type:

dict

get_committed_origin_state_dict()[source]

Get the committed origin_state_dict.

Return type:

dict

get_origin_state_dict(records_string='')[source]

Get the origin_state_dict (to determine state transitions efficiently).

{‘30_example_records.bib/Staehr2010’: <RecordState.pdf_not_available: 10>,}

Return type:

dict

property git_repo: GitRepo

Get the GitRepo object for the review_manager path.

load_records_dict(*, header_only=False)[source]

Load the records.

header_only:

{“Staehr2010”: {‘ID’: ‘Staehr2010’, ‘colrev_origin’: [‘30_example_records.bib/Staehr2010’], ‘colrev_status’: <RecordState.md_imported: 2>, ‘screening_criteria’: ‘criterion1=in;criterion2=out’, ‘file’: PosixPath(‘data/pdfs/Smith2000.pdf’), ‘colrev_data_provenance’: {Fields.AUTHOR:{“source”:”…”, “note”:”…”}}}, }

Return type:

dict[str, dict[str, Any]]

load_records_from_history(commit_sha='')[source]

Iterates through Git history, yielding records file contents as dictionaries.

Starts iteration from a provided commit SHA. Skips commits where the records file is unchanged. Useful for tracking dataset changes over time.

Return type:

Iterator[dict]

Parameters

commit_sha (str, optional): Start iteration from this commit SHA. Defaults to beginning of Git history if not provided.

Yields

dict: Records file contents at a specific Git history point, as a dictionary.

propagated_id(*, record_id)[source]

Check whether an ID is propagated (i.e., its record’s status is beyond md_processed).

Return type:

bool

read_next_record(*, conditions)[source]

Read records (Iterator) based on condition.

Return type:

Iterator[dict]

reset_log_if_no_changes()[source]

Reset the report log file if there are not changes.

Return type:

None

save_records_dict(records, *, partial=False)[source]

Save the records dict in RECORDS_FILE.

Return type:

None

save_records_dict_to_file(records)[source]

Save the records dict.

Return type:

None

set_ids(selected_ids=None)[source]

Set the IDs of records according to predefined formats or according to the LocalIndex.

Return type:

dict