Evaluation

Evaluation

We run a continuous evaluation of tools for duplicate removal in bibliographic datasets. The evaluation script is executed by a GitHub workflow on a weekly basis. Detailed results are exported to a csv file, and aggregated summaries are exported to current_results.md.

Tools are ranked according to false-positive rate. The lower the false-positive rate, the better the tool.

Rank

Tool

Type

Comment

1

BibDedupe

Python library

NA

2

ASySD

R package

NA

3

ASReview Data Dedup

Python library

NA

NA

Endnote

Proprietary software (local)

No programmatic access (prevents automated evaluation)

NA

Covidence

Proprietary software (web-based)

No programmatic access (prevents automated evaluation)

Datasets

Dataset

Reference

Size (n)

Status

cytology_screening

[4]

1,856

Included (based on OSF)

haematology

[4]

1,415

Included (based on OSF)

respiratory

[4]

1,988

Included (based on OSF)

stroke

[4]

1,292

Included (based on OSF)

cardiac

[2]

8,948

Included (based on OSF)

depression

[2]

79,880

Included (based on OSF)

diabetes

[2]

1,845

Included (based on OSF)

neuroimaging

[2]

3,438

Included (based on OSF)

srsr

[2]

53,001

Included (based on OSF)

digital_work

[5]

7,159

Included

TBD

[1]

NA

Requested: 2023-11-14

TBD

[3]

NA

Requested: 2023-11-14

Note

The SYNERGY datasets are not useful to evaluate duplicate identification algorithms because they only contain IDs, and the associated metadata would have no variance.

Duplicate definition

Duplicates are defined as potentially differing bibliographic representations of the same real-world record (Rathbone et al. 2015,[4]). This conceptual definition is operationalized as follows. The following are considered duplicates:

  • Papers referring to the same record (per definition)

  • Paper versions, including the author’s original, submitted, accepted, proof, and corrected versions

  • Papers that are continuously updated (e.g., versions of Cochrane reviews)

  • Papers with different DOIs if they refer to the same record (e.g., redundantly registered DOIs for online and print versions)

The following are considered non-duplicates:

  • Papers reporting on the same study if they are published separately (e.g., involving different stages of the study such as pilots and protocols, or differences in outcomes, interventions, or populations)

  • A conference paper and its extended journal publication

  • A journal paper and a reprint in another journal

It is noted that the focus is on duplicates of bibliographic records. The linking of multiple records reporting results from the same study is typically done in a separate step after full-text retrieval, using information from the full-text document, querying dedicated registers, and potentially corresponding with the authors [see Higgins et al. sections 4.6.2 and 4.6.2].

The datasets may have applied a different understanding of duplicates. We double-checked borderline cases to make sure that the duplicate pairs in the data correspond to our definition.

  • Rathbone et al. (2015) [4]: “A duplicate record was defined as being the same bibliographic record (irrespective of how the citation details were reported, e.g. variations in page numbers, author details, accents used or abridged titles). Where further reports from a single study were published, these were not classed as duplicates as they are multiple reports which can appear across or within journals. Similarly, where the same study was reported in both journal and conference proceedings, these were treated as separate bibliographic records.”

  • Borissov et al. (2022) [1]: “Following a standardized definition [6, 7, 9], we defined one or more duplicates as an existing unique record having the same title, authors, journal, DOI, year, issue, volume, and page number range metadata.”

Evaluation: Dataset model and confusion matrix

Record list before de-duplication

ID

Author

Title

1

John Doe

Introduction to Data Science

2

  1. Smith

the art of problem solving

3

Jane A. Smith

The Art of Problem Solving

4

Jane M. Smith

the art of problem solving

5

Alex Johnson

beyond the basics: advanced programming

Duplicate matrix:

1

2

3

4

5

1

2

3

X

4

X

X

5

Components:

ID

Component

1

c_1

2

c_2

3

c_2

4

c_2

5

c_3

Record list without duplicates:

ID

Author

Title

1

John Doe

Introduction to Data Science

2

  1. Smith

the art of problem solving

5

Alex Johnson

beyond the basics: advanced programming

Note: Instead of paper 2, papers 3 or 4 could have been retained. It is not pre-determined which duplicates are retained or removed. That makes the evaluation challenging because the following list would also be correct:

ID

Author

Title

1

John Doe

Introduction to Data Science

4

Jane M. Smith

the art of problem solving

5

Alex Johnson

beyond the basics: advanced programming

We use the compare_dedupe_id() method of bib_dedupe.dedupe_benchmark, which compares sets.

Given the set of duplicate IDs did = [2,3,4] as the ground truth, it is evident that only one of the IDs should be retained in the merged list ml (although any selection among the IDs in did would be valid).

  • If none of the duplicate IDs is retained, there is one false positive (FP), i.e., a record that was erroneously removed as a duplicate. The remaining (len(did)-1) records are counted as true positives (TP).

  • The first duplicate ID that is retained is counted as the true negative (TN), i.e., the record correctly marked as a non-duplicate. Additional records in ml are marked as false negatives (FN) because they should have been removed. Remaining records from did that are not in ml are marked as true positives (TP) because they were correctly removed from ml.

References