1 of 54

Introduction to Bibliometric Data Sources

Slides by: Nicolas Robinson-Garcia

2 of 54

What will we learn?

🤓

  1. Learn which are the main bibliometric databases
  2. Understand their strengths and weaknesses
  3. Learn how to retrieve and assess the quality of bibliometric data
  4. Be able to critically decide which database is more suitable for a bibliometric analysis

3 of 54

The last 20 years have witnessed an explosion of new scientific data sources

1964

2000

2004

2011

2012

2013

2017

2015

2018

2022

2009

4 of 54

But we only have 45 minutes to go through them…

… so we will focus on these four ⬆️⬆️⬆️

1964

2000

2004

2011

2012

2013

2017

2015

2018

2022

2009

😥

5 of 54

It is not only about bibliometric data sources, but about accessibility and manipulation of the data

But still there are issues of lack of transparency and reproducibility hindering a responsible use of metrics

6 of 54

In this session we will focus in just four of these new data sources

  • Non-traditional bibliometric data source
  • Free AND OPEN access
  • Transparent
  • Limited quality control
  • ≠ options
  • Search engine
  • Free NOT OPEN access
  • Lack of transparency
  • Download only available thru web scraping or third-party software
  • Traditional bibliometric data source
  • Requires subscription
  • Quality control
  • ≠ options but API not well-documented
  • Traditional bibliometric data source
  • Requires subscription
  • Quality control
  • ≠ options

7 of 54

  1. Indexing and coverage
  2. Journal Citation Reports
  3. Metadata
  4. Data retrieval process

8 of 54

Indexing and coverage

  • Created in 1964 by E. Garfield
  • Multidisciplinary
  • English-language bias
  • +90M records
  • +2B cited references
  • +22K scientific journals
  • Databases included:
    • Core Collection: SCIE, SSCI, AHCI, ESCI, BKCI, CPCI, DCI,
    • Others: Scielo, MEDLINE…

9 of 54

Indexing and coverage

How can we download the complete list of journals?

  1. Create a personal account

10 of 54

Indexing and coverage

How can we download the complete list of journals?

  • Create a personal account
  • Click on Products

11 of 54

Indexing and coverage

How can we download the complete list of journals?

  • Create a personal account
  • Click on Products
  • Go to Master Journal List

12 of 54

Indexing and coverage

How can we download the complete list of journals?

  • Create a personal account
  • Click on Products
  • Go to Master Journal List
  • In Downloads you can get an Excel file with lists of journals per Citation Index + JCR + Essential Science Indicators

😋

13 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

14 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

Click on the filtering options to refine your results

15 of 54

Journal Citation Reports

The JCR is a valuable source of journal-level metrics

Let’s explore the information provided

Click on the filtering options to refine your results

Add up to 23 different indicators

16 of 54

Journal Citation Reports

Let’s explore the information provided

Click on the filtering options to refine your results

Add up to 23 different indicators

Download the top 600 journals!

The JCR is a valuable source of journal-level metrics

17 of 54

Metadata

The bread and butter of bibliometrics

18 of 54

Metadata

Bibliographic fields are key to:

  1. Improve data retrieval
  2. Analyze and describe the dataset
  3. Build variables and indicators

The bread and butter of bibliometrics

19 of 54

Data retrieval process

  • To download the complete bibliographic record, max. no. of records is 500
  • To download basic fields, max. no. of records is 5,000 (Authors, title, source)

20 of 54

  • Indexing and coverage
  • Journal level metrics
  • Metadata
  • Data retrieval process

21 of 54

Indexing and coverage

  • Owned by Elsevier and launched in 2004
  • Main competitor of WoS along with Dimensions
  • Includes its own set of journal-level indicators
  • Single citation index without subcollections

22 of 54

Indexing and coverage

  • Scopus indexes new journals between 2 and 3 times a year.
  • In 2009 it created a Scopus Content Selection and Advisory Board to avoid potential accusations of conflict of interest.
  • The master list of journals and books can be downloaded here.

23 of 54

Journal-level metrics

Scopus includes three types of journal level metrics:

  • Scimago Journal Rank. It weights citation using an adaptation of the PageRank
  • Source Normalized Item per Paper. Which normalizes citations to allow between fields comparison
  • CiteScore. Which follows a very similar process to that of the Journal Impact Factor.

24 of 54

Metadata

  • Bibliographic records show similar structure to WoS
  • Consider there are differences in terms of labeling, document types, classification scheme, etc.
  • Combining data from Scopus and WoS would require further data processing and homogenization.

25 of 54

Metadata - Author profiles

  • A nice feature of Scopus is their algorithmically generated Author Profiles
  • It offers a list of author-level metrics (e.g., H-Index, total publications).
  • It also includes other author-level metadata (e.g., affiliation changes).

26 of 54

Data retrieval process

  • Scopus allows also different download formats

27 of 54

Data retrieval process

  • Scopus allows also different download formats
  • Once selected, you indicate the fields you need
  • Here the max. no. of records is 20,000

28 of 54

  • Indexing and coverage
  • Usability
  • Google Scholar Citations
  • Metadata
  • Data retrieval process
  • Publish or Perish

29 of 54

Indexing and coverage

Unlike other data sources, Google Scholar is a search engine and not a database.

This means that data in Google is dynamic and volatile.

Also, it means that there is no quality control of the metadata or the indexed records

30 of 54

Indexing and coverage

  • Scientific publishers’ websites

Including predatory publishers

  • Research-like institutions

Any PDF document, article like falling from the domain of universities, research centres, etc.

  • Repositories

Although there are certain technical criteria they need to fulfil. For instance, Zenodo, the EU repository, is not indexed in Google Scholar

Which sources does Google Scholar index?

31 of 54

Indexing and coverage

32 of 54

Usability

Unlike other scientific data sources, Google Scholar’s interface is a simple search engine with limited ‘advanced search’ options

These options are also available through the use of commands, e.g., author: allintitle: source:

33 of 54

Usability

It also identifies and merges different versions of a document including preprints and OA versions

Unlike other scientific data sources, Google Scholar offers direct access to the full text of documents when available

34 of 54

Google Scholar Citations

One of the most known services by scientists is GS Citations. Let’s have a close look at it to find how it controls for documents and authors

35 of 54

Google Scholar Citations

36 of 54

Metadata

Authors with a GS profile have unique identifiers.

Record data is structured, signaling the use of metadata, but this is many times incomplete

Records also seem to have unique identifiers but they differ by profile and from the Google Scholar search engine

The fact that DOIs are not included makes it difficult to accurately link records from different profiles

37 of 54

Data retrieval process

One of the main issues of Google Scholar is the lack of a download option or API for data collection

This does not mean data cannot be retrieved, but it has a cost

  • Data verification
  • Large-scale downloads

38 of 54

Publish or Perish

Created in 2007 by Anne-Wil Harzing, this software allows to download up to 1,000 records from a Google Scholar search

39 of 54

  • Origins
  • Source description
  • Metadata
  • Data retrieval

40 of 54

Origins

Microsoft Academic Graph, the direct competitor to Google Scholar, offered a fully open knowledge graph to over 225M records.

Although not as popular as Google Scholar among users, it became a promising data source for bibliometricians as it included the perks of both Google Scholar and traditional bibliometric databases (i.e., WoS, Socpus)

41 of 54

Origins

In 2022, OpenAlex was released it.

It does not only feed from the MAG project, but combines information from a variety of data sources.

But in 2021, Microsoft decided to discontinue the project…

As a response to that, the Arcadia Foundation funded OpenAlex, a project led by non-profit organization Our Research, to continue with the project.

42 of 54

Source description

43 of 54

Metadata

Although OpenAlex is built upon MAG, it has greatly improved the quality of its metadata

But of course, there is room for improvement

The project is still in its very early stages

🤌

44 of 54

Metadata

Since spring 2024 it includes an interface which is in constant transformation.

45 of 54

Metadata

Expect to find poorer quality in the metadata as well as incomplete bibliographic records.

46 of 54

Data retrieval

47 of 54

Data retrieval

Let’s go through its different options and features…

48 of 54

Data retrieval

Let’s go through its different options and features…

  • Record type

49 of 54

Data retrieval

Let’s go through its different options and features…

  • Record type
  • Filtering options

50 of 54

Data retrieval

Let’s go through its different options and features…

  • Record type
  • Filtering options
  • API query

51 of 54

Data retrieval

Let’s go through its different options and features…

  • Record type
  • Filtering options
  • API query
  • Export options

52 of 54

Which database should I choose?

Things to consider:

  1. Coverage
  2. Data quality
  3. Transparency and interpretability (e.g., classification scheme)

53 of 54

Final thoughts

  • Changing and growing landscape of bibliometric data sources
  • No ground truth, but different angles of it
  • Consider coverage, data quality and purpose

54 of 54

Thank you, questions?

☝️

Slides by: Nicolas Robinson-Garcia