CEP002 - Data schema

Author

Carlo Tang, Gerit Wagner

Status

Draft

Created

2023-10-02

Table of contents

Abstract

ENTRYTYPEs

Field sets

Fields

Schema Mapping

Defect codes

Test data values

Abstract

This document describes the standard data schema for CoLRev records, including ENTRYTYPEs, and record fields. Fields can be part of field sets, with corresponding metadata stored in the provenance fields (colrev_data_provenance, colrev_masterdata_provenance), and quality defects tracked by defect codes. It also outlines principles for mapping schemata from the feed records (retrieved from SearchSources), and provides unified test data.

ENTRYTYPEs

Each record has an ENTRYTYPE with respective required fields. Required and inconsistent fields are evaluated by the QualityModel (the missing-field <https://colrev.readthedocs.io/en/latest/manual/appendix/quality_model.html#missing-field>_ and `inconsistent-with-entrytype checkers).

Note that fields (like title) can have a different meaning depending on the ENTRYTYPEs.

In the following listing are available bibtex ENTRYTYPEs with their respective required fields (data elements) (source: bibtex.eu). Additional information and optional fields can be accessed with the link included in the name of each entry type.

ENTRYTYPE : article

  • author

  • title

  • journal

  • year

  • volume (if exists)

  • number (if exists)

ENTRYTYPE: book

  • author

  • year

  • title

  • publisher

  • address

ENTRYTYPE: conference

  • author

  • title

  • booktitle

  • year

ENTRYTYPE: inbook

  • author

  • title (i.e., chapter)

  • booktitle

  • publisher

  • year

ENTRYTYPE: incollection

  • author

  • title - TBD: chapter?

  • booktitle

  • publisher

  • year

ENTRYTYPE: inproceedings

  • author

  • title

  • booktitle

  • year

ENTRYTYPE: manual

  • title

  • year

ENTRYTYPE: mastersthesis

  • author

  • title

  • school

  • year

ENTRYTYPE: misc

  • author

  • title

  • year

ENTRYTYPE: phdthesis

  • author

  • title

  • school

  • year

ENTRYTYPE: proceedings

  • title

  • year

ENTRYTYPE: techreport

  • author

  • title

  • institution

  • year

  • number (if exists)

ENTRYTYPE: unpublished

  • author

  • title

  • institution

  • year

Field sets

The following field sets are distinguished (work-in-progress):

  • IDENTIFYING_FIELD_KEYS

  • colrev_data_provenance/colrev_masterdata_provenance

Fields

Standardized field names and explanations. Value restrictions are implemented in the QualityModel.

Fields should be in unicode (i.e., not contain latex or html characters or tags).

Fields not listed in the ENTRYTYPEs section are optional.

  • author (Last-name, FirstName - separated by ” and “; institutional authors are escaped with double braces; particles are escaped with last names using braces)

  • title

  • year

  • journal

  • booktitle

  • chapter

  • publisher

  • volume

  • number

  • pages

  • editor (format: see author)

  • language (ISO 639-1 standard language codes)

  • abstract

  • keywords (separated by “,”)

  • url

  • fulltext

  • note: containing custom notes entered by users (note fields from SearchSources do not replace this field)

  • cited_by: current number of citations (volatile)

work-in-progress

  • Identifiers

  • title fields in different languages (e.g., title_deu)

Schema Mapping

Upon load, the SearchSource fields are mapped to the standardized fields. This is necessary to handle naming conflicts (e.g., field name “authors” in one SearchSource and “author” in another), and type/domain conflicts (e.g., “citations” containing an integer in one SearchSoruce and a list of citing papers in another). Fields which cannot be mapped receive a SearchSource-specific prefix (e.g., “colrev.dblp.dblp_key”).

The schema mapping should be completed in the search methods. Search feeds should contain raw (non-prefixed) fields.

Defect codes

Defect codes are stored in the field provenance. They can be ignored as false positives based on the IGNORE: prefix.

The standardized defect codes are in the QualityModel and PDFQualityModel.

Test data values

Five different entry examples for dummy values used in the tests.

@article{ID274107,
   author                        = {Marilena, Ferdinand and Ethelinda Aignéis},
   title                         = {Article title},
   journal                       = {Journal name},
   year                          = {2020},
   volume                        = {23},
   number                        = {78},
}

@book{ID438965,
   author                        = {Romilius, Milivoj and Alphaeus, Cheyanne},
   year                          = {2020},
   title                         = {Book title},
   publisher                     = {Publisher name},
   address                       = {Publisher address},
}


@conference{ID461901,
   author                        = {Derry, Wassa and Wemba, Sandip},
   title                         = {Conference title},
   booktitle                     = {Conference book title},
   year                          = {2020},
}

@inproceedings{ID110380,
   author                        = {Raanan, Cathrine and Philomena, Miigwan},
   title                         = {Inproceedings title},
   booktitle                     = {Inproceedings book title},
   year                          = {2020},
}

@phdthesis{ID833501,
   author                        = {Davie, Ulyana},
   title                         = {PhD thesis title},
   school                        = {PhD school name},
   year                          = {2020},
}