1 of 8

What Drives a Country’s

Data Performance?

Using World Bank Statistical Performance Indicators (2023)

Mahya Tazike | Statistics for Data Science | Spring 2026

2 of 8

Audience and Objective

International development policymakers (World Bank, UN)

Objective

Identify which data capabilities are most strongly associated with overall statistical performance.

Source: World Bank Statistical Performance Indicators (SPI)

Audience

3 of 8

Data Overview

187

Countries

2023

Year Focus

0 - 100

Score Range

Outcome

overall score

The 5 Sub-Dimensions Measured:

Data Use: How data is used for decision-making and policy

Data Services: Availability and accessibility of data to users

Data Products: Quality and range of statistical outputs (reports, indicators)

Data Sources: Data collection systems (surveys, administrative data)

Data Infrastructure: Systems and tools supporting data storage and management

Source: World Bank Statistical Performance Indicators (SPI)

5

Sub-Dimensions

4 of 8

Distribution of Overall Statistical Performance (2023)

Key insight:�Scores range widely (28 to 95). Analysis shows high-income countries score ~25 points higher than low-income countries.

Why histogram?�Shows the distribution of overall scores across countries

What it shows:�Most countries fall in the mid-to-high range (60–90), with noticeable variation across countries

5 of 8

Overall Statistical Performance by Income Group (2023)

Key insight:�High-income countries score about 25 points higher than low-income countries on average, indicating a clear income-based gap in statistical performance.

Why boxplot?�Compares the distribution of scores across different income groups

What it shows:

Median scores increase with income level
High-income countries have consistently higher scores
Lower-income groups show more variability

6 of 8

Relationship Between Data Use and Overall Performance (2023)

Key insight:�There is a strong positive relationship between data use and overall statistical performance.

Why scatterplot?�Shows the relationship between two numeric variables

What it shows:

As data use increases, overall performance increases
The relationship appears roughly linear
Some variation exists, but the upward trend is clear

7 of 8

Analysis: Hypothesis Test and Regression

Part A: Hypothesis Test

Question:�Do high-income and low-income countries have significantly different overall scores?

H₀: No difference between groups�H₁: High-income countries score higher

Method: Two-sample t-test

Result:�p-value = 2.76e-10 → highly significant difference

Mean high-income: 81.2 | Mean low-income: 56.4

Part B: Regression Model

Question:�Which sub-dimension is most strongly associated with overall score?

Step 1 (primary):�overall_score ~ data_use_score��Step 2 (extended):�overall_score ~ data_use_score + data_products_score

Interpretation:�A 1-point increase in data use score is associated with a

0.75-point increase in overall score.

Model 1 R-squared = 0.735 (73.5% of variation explained)

Model 2 R-squared = 0.812 (81.2% with data products added)

Why regression? Variables are numeric, and we want to measure the strength of association with overall score.

Based on the patterns observed in the EDA, I now perform formal statistical analysis.�First, I test whether high-income and low-income countries differ significantly in overall performance.�The p-value is extremely small, about 2.76 times 10 to the negative 10, which indicates a highly significant difference.�On average, high-income countries score about 81, compared to about 56 for low-income countries, confirming the gap we saw earlier.

Next, I use regression to quantify relationships between variables.�In the simple model, a one-point increase in data use score is associated with about a 0.75-point increase in overall score.�This model explains about 73.5% of the variation in performance.

When adding data products, the model improves, explaining about 81% of the variation.�Both variables show a strong positive association with overall performance.

8 of 8

Key Findings and Recommendations

01

Income gap is significant

High-income countries score ~25 points higher than low-income countries on average.

02

Data use is the strongest association

Countries with higher data use scores tend to have higher overall performance.

03

Data products also matter

Including data products adds explanatory value to the model.

Recommendation:

International organizations should prioritize data use capacity and data products when supporting lower-income countries. These dimensions show the strongest association with overall statistical performance.

Limitations:

Cross-sectional analysis (2023 only). Relationships are associative, not causal.

To summarize, there are three key findings from this analysis.�First, there is a significant income gap—high-income countries score about 25 points higher than low-income countries on average.�Second, data use shows the strongest association with overall performance. Countries that use data more effectively tend to perform better overall.�Third, data products also play an important role, adding additional explanatory value to the model.

Based on these findings, the main recommendation is that international organizations should prioritize investment in data use capacity and data products when supporting lower-income countries.

Finally, it is important to note that this analysis is based on cross-sectional data from 2023, and the results show associations rather than causal relationships.��Thank you for your time and attention.