1 of 54

Introduction to Bibliometric Data Sources

Slides by: Nicolas Robinson-Garcia

2 of 54

What will we learn?

🤓

Learn which are the main bibliometric databases
Understand their strengths and weaknesses
Learn how to retrieve and assess the quality of bibliometric data
Be able to critically decide which database is more suitable for a bibliometric analysis

3 of 54

The last 20 years have witnessed an explosion of new scientific data sources

1964

2000

2004

2011

2012

2013

2017

2015

2018

2022

2009

4 of 54

But we only have 45 minutes to go through them…

… so we will focus on these four ⬆️⬆️⬆️

1964

2000

2004

2011

2012

2013

2017

2015

2018

2022

2009

😥

5 of 54

It is not only about bibliometric data sources, but about accessibility and manipulation of the data

But still there are issues of lack of transparency and reproducibility hindering a responsible use of metrics

6 of 54

In this session we will focus in just four of these new data sources

Non-traditional bibliometric data source
Free AND OPEN access
Transparent
Limited quality control
≠ options

Search engine
Free NOT OPEN access
Lack of transparency
Download only available thru web scraping or third-party software

Traditional bibliometric data source
Requires subscription
Quality control
≠ options but API not well-documented

Traditional bibliometric data source
Requires subscription
Quality control
≠ options

7 of 54

Indexing and coverage
Journal Citation Reports
Metadata
Data retrieval process

8 of 54

Indexing and coverage

Created in 1964 by E. Garfield
Multidisciplinary
English-language bias
+90M records
+2B cited references
+22K scientific journals
Databases included:

Core Collection: SCIE, SSCI, AHCI, ESCI, BKCI, CPCI, DCI,
Others: Scielo, MEDLINE…

9 of 54

Indexing and coverage

How can we download the complete list of journals?

Create a personal account

10 of 54

Indexing and coverage

How can we download the complete list of journals?

Create a personal account
Click on Products

11 of 54

Indexing and coverage

How can we download the complete list of journals?

Create a personal account
Click on Products
Go to Master Journal List

12 of 54

Indexing and coverage

How can we download the complete list of journals?

Create a personal account
Click on Products
Go to Master Journal List
In Downloads you can get an Excel file with lists of journals per Citation Index + JCR + Essential Science Indicators

😋

13 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

14 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

Click on the filtering options to refine your results

15 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

Click on the filtering options to refine your results

Add up to 23 different indicators

16 of 54

Journal Citation Reports

Let’s explore the information provided

Click on the filtering options to refine your results

Add up to 23 different indicators

Download the top 600 journals!

The JCR is a valuable source of journal-level metrics

17 of 54

Metadata

The bread and butter of bibliometrics

18 of 54

Metadata

Bibliographic fields are key to:

Improve data retrieval
Analyze and describe the dataset
Build variables and indicators

The bread and butter of bibliometrics

19 of 54

Data retrieval process

To download the complete bibliographic record, max. no. of records is 500
To download basic fields, max. no. of records is 5,000 (Authors, title, source)

20 of 54

Indexing and coverage
Journal level metrics
Metadata
Data retrieval process

21 of 54

Indexing and coverage

Owned by Elsevier and launched in 2004
Main competitor of WoS along with Dimensions
Includes its own set of journal-level indicators
Single citation index without subcollections

Singh et al., 2021. The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis. Scienotmetrics 126:5113-5142

22 of 54

Indexing and coverage

Scopus indexes new journals between 2 and 3 times a year.
In 2009 it created a Scopus Content Selection and Advisory Board to avoid potential accusations of conflict of interest.
The master list of journals and books can be downloaded here.

23 of 54

Journal-level metrics

Scopus includes three types of journal level metrics:

Scimago Journal Rank. It weights citation using an adaptation of the PageRank
Source Normalized Item per Paper. Which normalizes citations to allow between fields comparison
CiteScore. Which follows a very similar process to that of the Journal Impact Factor.

24 of 54

Metadata

Bibliographic records show similar structure to WoS
Consider there are differences in terms of labeling, document types, classification scheme, etc.
Combining data from Scopus and WoS would require further data processing and homogenization.

25 of 54

Metadata - Author profiles

A nice feature of Scopus is their algorithmically generated Author Profiles
It offers a list of author-level metrics (e.g., H-Index, total publications).
It also includes other author-level metadata (e.g., affiliation changes).

26 of 54

Data retrieval process

Scopus allows also different download formats

27 of 54

Data retrieval process

Scopus allows also different download formats
Once selected, you indicate the fields you need
Here the max. no. of records is 20,000

28 of 54

Indexing and coverage
Usability
Google Scholar Citations
Metadata
Data retrieval process
Publish or Perish

29 of 54

Indexing and coverage

Unlike other data sources, Google Scholar is a search engine and not a database.

This means that data in Google is dynamic and volatile.

Also, it means that there is no quality control of the metadata or the indexed records

30 of 54

Indexing and coverage

Scientific publishers’ websites

Including predatory publishers

Research-like institutions

Any PDF document, article like falling from the domain of universities, research centres, etc.

Repositories

Although there are certain technical criteria they need to fulfil. For instance, Zenodo, the EU repository, is not indexed in Google Scholar

Which sources does Google Scholar index?

31 of 54

Indexing and coverage

32 of 54

Usability

Unlike other scientific data sources, Google Scholar’s interface is a simple search engine with limited ‘advanced search’ options

These options are also available through the use of commands, e.g., author: allintitle: source:

33 of 54

Usability

It also identifies and merges different versions of a document including preprints and OA versions

Unlike other scientific data sources, Google Scholar offers direct access to the full text of documents when available

34 of 54

Google Scholar Citations

One of the most known services by scientists is GS Citations. Let’s have a close look at it to find how it controls for documents and authors

35 of 54

Google Scholar Citations

36 of 54

Metadata

Authors with a GS profile have unique identifiers.

Record data is structured, signaling the use of metadata, but this is many times incomplete

Records also seem to have unique identifiers but they differ by profile and from the Google Scholar search engine

The fact that DOIs are not included makes it difficult to accurately link records from different profiles

37 of 54

Data retrieval process

One of the main issues of Google Scholar is the lack of a download option or API for data collection

This does not mean data cannot be retrieved, but it has a cost

Data verification
Large-scale downloads

38 of 54

Publish or Perish

Created in 2007 by Anne-Wil Harzing, this software allows to download up to 1,000 records from a Google Scholar search

39 of 54

Origins
Source description
Metadata
Data retrieval

40 of 54

Origins

Microsoft Academic Graph, the direct competitor to Google Scholar, offered a fully open knowledge graph to over 225M records.

Although not as popular as Google Scholar among users, it became a promising data source for bibliometricians as it included the perks of both Google Scholar and traditional bibliometric databases (i.e., WoS, Socpus)

41 of 54

Origins

In 2022, OpenAlex was released it.

It does not only feed from the MAG project, but combines information from a variety of data sources.

But in 2021, Microsoft decided to discontinue the project…

As a response to that, the Arcadia Foundation funded OpenAlex, a project led by non-profit organization Our Research, to continue with the project.

42 of 54

Source description

43 of 54

Metadata

Although OpenAlex is built upon MAG, it has greatly improved the quality of its metadata

But of course, there is room for improvement

The project is still in its very early stages

🤌

44 of 54

Metadata

Since spring 2024 it includes an interface which is in constant transformation.

45 of 54

Metadata

Expect to find poorer quality in the metadata as well as incomplete bibliographic records.

46 of 54

Data retrieval

47 of 54

Data retrieval