1 of 41

2 of 41

Introduction�Link to slides: https://bit.ly/mlab-ais2019

3 of 41

What is M-Lab?

  • People/Organization: A joint initiative between staff at Code for Science & Society, Google, and Princeton PlanetLab
  • Data: An open repository of user-contributed, longitudinal, open-source derived Internet infrastructure data
  • Infrastructure: A global infrastructure deployed, built, and run to aid in the creation of that data repository

4 of 41

M-Lab’s Mission

�Measure the internet. �Save the data.�Make it universally accessible and useful.

Note: we don’t measure the internet by ourselves -- people measure the Internet, using their own computers/phones and our servers, and we collect the data, and support them in their measurements

5 of 41

M-Lab Principles

  • All measurements are active measurements
    • All the data is synthetic data, we take user privacy seriously.
    • Client initiated tests only, Servers do not start tests on their own.
  • Experiments are curated and approved by a panel of reviewers, primarily developed by academics
  • Edge clients come from the community.
    • Anyone can develop them.
    • Anyone can run tests against the platform.
  • Openness
    • All of the data is released CC0.
    • All of the code is open source.

6 of 41

2018 numbers projected: current total / (7/12)

7 of 41

High capacity servers placed next to content

M-lab measures user experience of the full route from user to content

8 of 41

Today — 500+ Servers in 130+ locations

9 of 41

In Africa

  • ~30 Servers
  • 7 Locations
    • Cape Town (TENET)
    • Johannesburg (TENET)
    • Maputo (Maluana Science and Technology Park)
    • Nairobi (KENET)
    • Antananarivo (Telecom Malagasy)
    • Tunis (ATI)
  • 3 10GB Sites

10 of 41

Where do tests come from?

CIRA’s IPT, Google Search, Software Integrations (uTorrent), Router Integrations, Fingbox,

Chrome Extension, the M-Lab Website

11 of 41

Where do tests come from?

  • Partnerships with Regulatory Agencies: Monitoring the progress of broadband deployment for policy-making.
  • Examples: United States (FCC & SamKnows), EC (Alladin.IT), Canada/CIRA, Greece, Cyprus, Thailand, Austria, Netherlands.

12 of 41

Running an M-Lab Speed Test

On demand:

  • Browser - speed.measurementlab.net
  • uTorrent - www.utorrent.com
  • Android - OONI Probe App in Play Store

Scheduled:

  • Docker - Raspberry Pi - github.com/m-lab/murakami
  • Chrome Extension - “M-Lab Measure

13 of 41

Where does M-Lab have data from?

Let’s check the quick visualization site

14 of 41

How can the M-Lab datasets be useful to you?

  • ...to diagnose operational issues?
  • ...to provide data up to policy-makers?
  • ...to do your own research into the Internet infrastructure?

15 of 41

Tutorial

16 of 41

Goal

By the end of this session, each of you will have constructed a unique query to gather and display M-Lab data of interest to you.

For example:

https://datastudio.google.com/s/nqN5k9ktVns

which visualizes the results of

https://console.cloud.google.com/bigquery?sq=754187384106:a67f3ad29f474169b1902b55de2b4e0d

17 of 41

Agenda

  • Introduce M-Lab datasets (NDT, PTR, Sidestream, Switch)
  • Access to and run BigQuery queries
  • NDT in Depth
  • Visualize results in Data Studio
  • Learn about BigQuery Aggregate, Approximate Aggregate, and Statistical Functions
  • If we have time..
    • BigQuery Geographic Information Systems (GIS)
    • Combining with RIPE Atlas

18 of 41

M-Lab Datasets

19 of 41

Glasnost

Max Planck Institute for Software Systems

MobiPerf

University of Michigan

Network Diagnostic Tool

Internet2

Neubot

Nexa Center for Internet and Society, Politecnico di Torino

NPAD

Pittsburgh Supercomputing Center

Reverse Traceroute

University of Washington

Paris Traceroute

University Pierre et Marie Curie

Project Bismark

Princeton University

Sharperprobe

Georgia Tech College of Computing

Windrider

Northwestern University

Experiments

20 of 41

Datasets in BigQuery

21 of 41

Datasets

Network Diagnostic Tool

Internet2

Paris Traceroute

University Pierre et Marie Curie

New schema coming soon!

Access data schemas in Bigquery.

22 of 41

M-Lab Data in BigQuery - Free!

23 of 41

NDT

24 of 41

Network Diagnostic Tool - NDT

  • NDT measures “single stream performance” or “bulk transport capacity” as defined in IETF’s RFC 3148

Fun facts and links

25 of 41

NDT

26 of 41

NDT BigQuery Schema

test_id & log_time & parse_time metadata for every row��connection_spec.* client metadata�connection_spec.client_geolocation.* lat/lon, country, region, etc�connection_spec.data_direction 1 / download - 0 / upload��connection_spec.client.network.asn Client ASN �connection_spec.server.network.asn M-Lab server ASN��web100_log_entry.connection_spec.* server and client IP & ports

web100_log_entry.snap.* Web100 metrics�web100_log_entry.snap.HCThruOctetsAcked download byte count�web100_log_entry.snap.HCThruOctetsReceived upload byte count�web100_log_entry.snap.SndLimTimeRwin Receiver Limited Time�web100_log_entry.snap.SndLimTimeCwnd Congestion Limited Time�web100_log_entry.snap.SndLimTimeSnd Sender Limited Time�web100_log_entry.snap.CongSignals Total congestion events

27 of 41

  • NDT saves the variables used to calculate metrics like download and upload speed�
  • Download Speed in Mbps:

�8 * (web100_log_entry.snap.HCThruOctetsAcked / (web100_log_entry.snap.SndLimTimeRwin +� web100_log_entry.snap.SndLimTimeCwnd +�web100_log_entry.snap.SndLimTimeSnd))�

  • Upload Speed in Mbps:

�8 * (web100_log_entry.snap.HCThruOctetsReceived / web100_log_entry.snap.Duration)

28 of 41

Geolocation Annotations

  • IP address geolocation using Maxmind Geolite 2�
  • Data is annotated after data collection
    • connection_spec.client_geolocation.*
    • connection_spec.server_geolocation.*�
  • Client latitude / longitude represents location of ISP infrastructure providing IP addresses not individual household addresses�
  • The locations are not precise, but if you have a better location dataset, you can join on IP Address to get better geolocation information.

29 of 41

Querying in BigQuery

30 of 41

Back to the example we started with...

  • 2019 Median Speeds by Country (Query)
    • Add ASN (Query, DataStudio HeatMap)

31 of 41

Now let’s add time...

  • 2019 Median Speeds by Country, ASN (Query, DataStudio HeatMap)
    • Add Day (Query, DataStudio Timeseries )

32 of 41

  • Example: Counts
    • COUNT(*) or COUNT DISTINCT(*)�
  • Example: Download Medians
    • APPROX_QUANTILES(8 * (web100_log_entry.snap.HCThruOctetsAcked /� (web100_log_entry.snap.SndLimTimeRwin +� web100_log_entry.snap.SndLimTimeCwnd +� web100_log_entry.snap.SndLimTimeSnd)), 100)[safe_ordinal(51)]

33 of 41

Mapping & GIS

34 of 41

  • BigQuery recently added GIS functions & field types
  • Spatial joins, point-in-polygon functions can be used with spatial datasets

35 of 41

BigQuery (GIS) - Sample Query

#standardSQL

SELECT

count(test_id) as count_tests,

count(distinct connection_spec.client_ip) as count_ips,

APPROX_QUANTILES(8 * SAFE_DIVIDE(web100_log_entry.snap.HCThruOctetsAcked,

(web100_log_entry.snap.SndLimTimeRwin +

web100_log_entry.snap.SndLimTimeCwnd +

web100_log_entry.snap.SndLimTimeSnd)), 101)[SAFE_ORDINAL(51)] AS download_Mbps,

APPROX_QUANTILES(web100_log_entry.snap.MinRTT, 101)[SAFE_ORDINAL(51)] AS min_rtt,

state_name as name,

ANY_VALUE(state_geom) AS WKT

FROM

`measurement-lab.ndt.downloads`,

`bigquery-public-data.geo_us_boundaries.us_states`

WHERE

connection_spec.server_geolocation.country_name = "United States"

AND partition_date BETWEEN '2019-01-01' AND '2019-05-30'

AND ST_WITHIN(ST_GeogPoint(connection_spec.client_geolocation.longitude , connection_spec.client_geolocation.latitude ), state_geom)

GROUP BY name

36 of 41

BigQuery (GIS) - Sample Query

SELECT

paris_traceroute_hop.src_geolocation.country_name as src,

paris_traceroute_hop.dest_geolocation.country_name as dest,

COUNT(*) as hops,

TIMESTAMP_TRUNC(log_time, DAY) as day,

APPROX_QUANTILES(paris_traceroute_hop.rtt[OFFSET(1)],101)[SAFE_ORDINAL(51)] as rtt,

MAX(paris_traceroute_hop.rtt[OFFSET(1)]) as max_rtt,

MIN(paris_traceroute_hop.rtt[OFFSET(1)]) as min_rtt,

ST_MAKELINE(ST_GEOGPOINT(ANY_VALUE(paris_traceroute_hop.src_geolocation.longitude),

ANY_VALUE(paris_traceroute_hop.src_geolocation.latitude)),

ST_GEOGPOINT(ANY_VALUE(paris_traceroute_hop.dest_geolocation.longitude),

ANY_VALUE(paris_traceroute_hop.dest_geolocation.latitude))) as WKT

FROM

`measurement-lab.node.traceroute`

WHERE

TIMESTAMP_TRUNC(log_time, DAY) > TIMESTAMP("2019-01-01")

AND (paris_traceroute_hop.src_geolocation.continent_code = "AF"

OR paris_traceroute_hop.dest_geolocation.continent_code = "AF")

GROUP BY src, dest, day

HAVING

hops > 50

AND src != ""

AND dest != ""

ORDER BY hops desc

37 of 41

38 of 41

RIPE Atlas

39 of 41

Using M-Lab Servers with RIPE Atlas

  • Servers have public IP Addresses:
    • Site Info Data
  • Suggestions:
    • RIPE Atlas Ping or Traceroute to M-Lab Servers

40 of 41

Resources

41 of 41