CEP002 - Data schema¶
Author |
Carlo Tang, Gerit Wagner |
Status |
Draft |
Created |
2023-10-02 |
Table of contents¶
Abstract¶
This document describes the standard data schema for CoLRev records, including ENTRYTYPEs, and record fields. Fields can be part of field sets, with corresponding metadata stored in the provenance fields (colrev_data_provenance, colrev_masterdata_provenance), and quality defects tracked by defect codes. It also outlines principles for mapping schemata from the feed records (retrieved from SearchSources), and provides unified test data.
ENTRYTYPEs¶
Each record has an ENTRYTYPE with respective required fields. Required and inconsistent fields are evaluated by the QualityModel (the missing-field <https://colrev-environment.github.io/colrev/manual/appendix/quality_model.html#missing-field>_ and `inconsistent-with-entrytype checkers).
Note that fields (like title) can have a different meaning depending on the ENTRYTYPEs.
In the following listing are available bibtex ENTRYTYPEs with their respective required fields (data elements) (source: bibtex.eu). Additional information and optional fields can be accessed with the link included in the name of each entry type.
ENTRYTYPE : article¶
author
title
journal
year
volume (if exists)
number (if exists)
ENTRYTYPE: book¶
author
year
title
publisher
address
ENTRYTYPE: conference¶
author
title
booktitle
year
ENTRYTYPE: inbook¶
author
title (i.e., chapter)
booktitle
publisher
year
ENTRYTYPE: incollection¶
author
title - TBD: chapter?
booktitle
publisher
year
ENTRYTYPE: inproceedings¶
author
title
booktitle
year
ENTRYTYPE: manual¶
title
year
ENTRYTYPE: mastersthesis¶
author
title
school
year
ENTRYTYPE: misc¶
author
title
year
ENTRYTYPE: phdthesis¶
author
title
school
year
ENTRYTYPE: proceedings¶
title
year
ENTRYTYPE: techreport¶
author
title
institution
year
number (if exists)
ENTRYTYPE: unpublished¶
author
title
institution
year
Field sets¶
The following field sets are distinguished (work-in-progress):
IDENTIFYING_FIELD_KEYS
colrev_data_provenance/colrev_masterdata_provenance
Fields¶
Standardized field names and explanations. Value restrictions are implemented in the QualityModel.
Fields should be in unicode (i.e., not contain latex or html characters or tags).
Fields not listed in the ENTRYTYPEs section are optional.
author (Last-name, FirstName - separated by ” and “; institutional authors are escaped with double braces; particles are escaped with last names using braces)
title
year
journal
booktitle
chapter
publisher
volume
number
pages
editor (format: see author)
language (ISO 639-1 standard language codes)
abstract
keywords (separated by “,”)
url
fulltext
note: containing custom notes entered by users (note fields from SearchSources do not replace this field)
cited_by: current number of citations (volatile)
work-in-progress
Identifiers
title fields in different languages (e.g., title_deu)
Schema Mapping¶
Upon load, the SearchSource fields are mapped to the standardized fields. This is necessary to handle naming conflicts (e.g., field name “authors” in one SearchSource and “author” in another), and type/domain conflicts (e.g., “citations” containing an integer in one SearchSoruce and a list of citing papers in another). Fields which cannot be mapped receive a SearchSource-specific prefix (e.g., “colrev.dblp.dblp_key”).
The schema mapping should be completed in the search methods. Search feeds should contain raw (non-prefixed) fields.
Defect codes¶
Defect codes are stored in the field provenance. They can be ignored as false positives based on the IGNORE: prefix.
The standardized defect codes are in the QualityModel and PDFQualityModel.
Test data values¶
Five different entry examples for dummy values used in the tests.
@article{ID274107,
author = {Marilena, Ferdinand and Ethelinda Aignéis},
title = {Article title},
journal = {Journal name},
year = {2020},
volume = {23},
number = {78},
}
@book{ID438965,
author = {Romilius, Milivoj and Alphaeus, Cheyanne},
year = {2020},
title = {Book title},
publisher = {Publisher name},
address = {Publisher address},
}
@conference{ID461901,
author = {Derry, Wassa and Wemba, Sandip},
title = {Conference title},
booktitle = {Conference book title},
year = {2020},
}
@inproceedings{ID110380,
author = {Raanan, Cathrine and Philomena, Miigwan},
title = {Inproceedings title},
booktitle = {Inproceedings book title},
year = {2020},
}
@phdthesis{ID833501,
author = {Davie, Ulyana},
title = {PhD thesis title},
school = {PhD school name},
year = {2020},
}
Links informing the standard¶
first source bibtex.com required and optional fields are not specified
better bibtex.eu
but not consistent across different bibtex manager, e.g. “field” or “manual” in following tool: Bib-it
listing of field variables and in which entry they are required https://www.bibtex.com/format/fields/
https://www.nlm.nih.gov/bsd/mms/medlineelements.html, examples of different fields and descriptions
bibTeX Definition in Web Ontology Language (OWL) Version 0.2