CEP002 - Data schema
Author |
Carlo Tang, Gerit Wagner |
Status |
Draft |
Created |
2023-10-02 |
Table of contents
Abstract
This document describes the standard data schema for CoLRev records, including ENTRYTYPEs, and record fields. Fields can be part of field sets, with corresponding metadata stored in the provenance fields (colrev_data_provenance, colrev_masterdata_provenance), and quality defects tracked by defect codes. It also outlines principles for mapping schemata from the feed records (retrieved from SearchSources), and provides unified test data.
ENTRYTYPEs
Each record has an ENTRYTYPE with respective required fields. Required and inconsistent fields are evaluated by the QualityModel (the missing-field <https://colrev-environment.github.io/colrev/manual/appendix/quality_model.html#missing-field>_ and `inconsistent-with-entrytype checkers).
Note that fields (like title) can have a different meaning depending on the ENTRYTYPEs.
In the following listing are available bibtex ENTRYTYPEs with their respective required fields (data elements) (source: bibtex.eu). Additional information and optional fields can be accessed with the link included in the name of each entry type.
ENTRYTYPE : article
author
title
journal
year
volume (if exists)
number (if exists)
ENTRYTYPE: book
author
year
title
publisher
address
ENTRYTYPE: conference
author
title
booktitle
year
ENTRYTYPE: inbook
author
title (i.e., chapter)
booktitle
publisher
year
ENTRYTYPE: incollection
author
title - TBD: chapter?
booktitle
publisher
year
ENTRYTYPE: inproceedings
author
title
booktitle
year
ENTRYTYPE: manual
title
year
ENTRYTYPE: mastersthesis
author
title
school
year
ENTRYTYPE: misc
author
title
year
ENTRYTYPE: phdthesis
author
title
school
year
ENTRYTYPE: proceedings
title
year
ENTRYTYPE: techreport
author
title
institution
year
number (if exists)
ENTRYTYPE: unpublished
author
title
institution
year
Field sets
The following field sets are distinguished (work-in-progress):
IDENTIFYING_FIELD_KEYS
colrev_data_provenance/colrev_masterdata_provenance
Fields
Standardized field names and explanations. Value restrictions are implemented in the QualityModel.
Fields should be in unicode (i.e., not contain latex or html characters or tags).
Fields not listed in the ENTRYTYPEs section are optional.
author (Last-name, FirstName - separated by “ and “; institutional authors are escaped with double braces; particles are escaped with last names using braces)
title
year
journal
booktitle
chapter
publisher
volume
number
pages
editor (format: see author)
language (ISO 639-1 standard language codes)
abstract
keywords (separated by “,”)
url
fulltext
note: containing custom notes entered by users (note fields from SearchSources do not replace this field)
cited_by: current number of citations (volatile)
work-in-progress
Identifiers
title fields in different languages (e.g., title_deu)
Schema Mapping
Upon load, the SearchSource fields are mapped to the standardized fields. This is necessary to handle naming conflicts (e.g., field name “authors” in one SearchSource and “author” in another), and type/domain conflicts (e.g., “citations” containing an integer in one SearchSoruce and a list of citing papers in another). Fields which cannot be mapped receive a SearchSource-specific prefix (e.g., “colrev.dblp.dblp_key”).
The schema mapping should be completed in the search methods. Search feeds should contain raw (non-prefixed) fields.
Defect codes
Defect codes are stored in the field provenance. They can be ignored as false positives based on the IGNORE: prefix.
The standardized defect codes are in the QualityModel and PDFQualityModel.
Test data values
Five different entry examples for dummy values used in the tests.
@article{ID274107,
author = {Marilena, Ferdinand and Ethelinda Aignéis},
title = {Article title},
journal = {Journal name},
year = {2020},
volume = {23},
number = {78},
}
@book{ID438965,
author = {Romilius, Milivoj and Alphaeus, Cheyanne},
year = {2020},
title = {Book title},
publisher = {Publisher name},
address = {Publisher address},
}
@conference{ID461901,
author = {Derry, Wassa and Wemba, Sandip},
title = {Conference title},
booktitle = {Conference book title},
year = {2020},
}
@inproceedings{ID110380,
author = {Raanan, Cathrine and Philomena, Miigwan},
title = {Inproceedings title},
booktitle = {Inproceedings book title},
year = {2020},
}
@phdthesis{ID833501,
author = {Davie, Ulyana},
title = {PhD thesis title},
school = {PhD school name},
year = {2020},
}
Links informing the standard
first source bibtex.com required and optional fields are not specified
better bibtex.eu
but not consistent across different bibtex manager, e.g. “field” or “manual” in following tool: Bib-it (SourceForge project documentation)
listing of field variables and in which entry they are required https://www.bibtex.com/format/fields/
https://www.ncbi.nlm.nih.gov/books/NBK3827/, examples of different fields and descriptions
bibTeX Definition in Web Ontology Language (OWL) Version 0.2