Evaluation¶

We run a continuous evaluation of tools for duplicate removal in bibliographic datasets. The evaluation script is executed by a GitHub workflow on a weekly basis. Detailed results are exported to a csv file, and aggregated summaries are exported to current_results.md.

Tools are ranked according to false-positive rate. The lower the false-positive rate, the better the tool.

Rank	Tool	Type	Comment
1	BibDedupe	Python library	🏅 State-of-the-art (Python)
2	ASySD	R package	🏅 State-of-the-art (R)
4	ASReview Data Dedup	Python library	☢️ High FP error rate
3	Buhos	Web-based (Ruby)	☢️ High FP error rate ❓ Evaluation incomplete ⌛ Long runtime
NA	dedupe.io / pandas-dedupe	Python library	😩 Requires manual training ❓ Performance not reproducible
NA	deduklick	Proprietary software	❓ Performance unknown 🚫 No programmatic access 🔒 Proprietary code
NA	SRA-Dedupe	Proprietary software	❓ Performance unknown 🚫 No programmatic access
NA	Endnote	Proprietary software (local)	😩 Requires manual review ❓ Performance unknown 🚫 No programmatic access 🔒 Proprietary code
NA	Covidence	Proprietary software (web-based)	❓ Performance unknown 🚫 No programmatic access 🔒 Proprietary code

Datasets¶

Dataset	Reference	Size (n)	Status
cytology_screening	[4]	1,856	Included (based on OSF)
haematology	[4]	1,415	Included (based on OSF)
respiratory	[4]	1,988	Included (based on OSF)
stroke	[4]	1,292	Included (based on OSF)
cardiac	[2]	8,948	Included (based on OSF)
depression	[2]	79,880	Included (based on OSF)
diabetes	[2]	1,845	Included (based on OSF)
neuroimaging	[2]	3,438	Included (based on OSF)
srsr	[2]	53,001	Included (based on OSF)
digital_work	[5]	7,159	Included
TBD	[1]	NA	Requested: 2023-11-14
TBD	[3]	NA	Requested: 2023-11-14

Note

The SYNERGY datasets are not useful to evaluate duplicate identification algorithms because they only contain IDs, and the associated metadata would have no variance.

Duplicate definition¶

Duplicates are defined as potentially differing bibliographic representations of the same real-world record (Rathbone et al. 2015,[4]). This conceptual definition is operationalized as follows. The following are considered duplicates:

Papers referring to the same record (per definition)
Paper versions, including the author’s original, submitted, accepted, proof, and corrected versions
Papers that are continuously updated (e.g., versions of Cochrane reviews)
Papers with different DOIs if they refer to the same record (e.g., redundantly registered DOIs for online and print versions)

The following are considered non-duplicates:

Papers reporting on the same study if they are published separately (e.g., involving different stages of the study such as pilots and protocols, or differences in outcomes, interventions, or populations)
A conference paper and its extended journal publication
A journal paper and a reprint in another journal

It is noted that the focus is on duplicates of bibliographic records. The linking of multiple records reporting results from the same study is typically done in a separate step after full-text retrieval, using information from the full-text document, querying dedicated registers, and potentially corresponding with the authors [see Higgins et al. sections 4.6.2 and 4.6.2].

The datasets may have applied a different understanding of duplicates. We double-checked borderline cases to make sure that the duplicate pairs in the data correspond to our definition.

Rathbone et al. (2015) [4]: “A duplicate record was defined as being the same bibliographic record (irrespective of how the citation details were reported, e.g. variations in page numbers, author details, accents used or abridged titles). Where further reports from a single study were published, these were not classed as duplicates as they are multiple reports which can appear across or within journals. Similarly, where the same study was reported in both journal and conference proceedings, these were treated as separate bibliographic records.”
Borissov et al. (2022) [1]: “Following a standardized definition [6, 7, 9], we defined one or more duplicates as an existing unique record having the same title, authors, journal, DOI, year, issue, volume, and page number range metadata.”

Evaluation: Dataset model and confusion matrix¶

Record list before de-duplication

ID	Author	Title
1	John Doe	Introduction to Data Science
2	Smith	the art of problem solving
3	Jane A. Smith	The Art of Problem Solving
4	Jane M. Smith	the art of problem solving
5	Alex Johnson	beyond the basics: advanced programming

Duplicate matrix:

	2	3
1
2
3	X
4	X	X
5

Components:

ID	Component
1	c_1
2	c_2
3	c_2
4	c_2
5	c_3

Record list without duplicates:

ID	Author	Title
1	John Doe	Introduction to Data Science
2	Smith	the art of problem solving
5	Alex Johnson	beyond the basics: advanced programming

Note: Instead of paper 2, papers 3 or 4 could have been retained. It is not pre-determined which duplicates are retained or removed. That makes the evaluation challenging because the following list would also be correct:

ID	Author	Title
1	John Doe	Introduction to Data Science
4	Jane M. Smith	the art of problem solving
5	Alex Johnson	beyond the basics: advanced programming

We use the compare_dedupe_id() method of bib_dedupe.dedupe_benchmark, which compares sets.

Given the set of duplicate IDs did = [2,3,4] as the ground truth, it is evident that only one of the IDs should be retained in the merged list ml (although any selection among the IDs in did would be valid).

If none of the duplicate IDs is retained, there is one false positive (FP), i.e., a record that was erroneously removed as a duplicate. The remaining (len(did)-1) records are counted as true positives (TP).
The first duplicate ID that is retained is counted as the true negative (TN), i.e., the record correctly marked as a non-duplicate. Additional records in ml are marked as false negatives (FN) because they should have been removed. Remaining records from did that are not in ml are marked as true positives (TP) because they were correctly removed from ml.

Evaluation¶

Datasets¶

Duplicate definition¶

Evaluation: Dataset model and confusion matrix¶

References¶