1 of 31

Case study - how to use big-data machine learning to predict future delays and bottlenecks in public transport

1

Presented at�GeoPython 2018

Presenters:

Ture Friese

2 of 31

Presenter

Ture Friese�tf@computas.com�+47 99230426

  • Master in Business Administration and Engineering from University in Karlsruhe, Germany�
  • Relocated in 2000 to Norway�
  • Currently working as senior software engineer at IT consultancy Computas AS, Stavanger

2

3 of 31

Agenda

  • The task at hand
  • Learning pipeline
  • Prediction pipeline
  • The Google Cloud Platform
  • Systems architecture
  • Data processing technologies
  • Machine Learning
  • Future work

3

4 of 31

The task at hand

  • Public transport company wants prediction of future bus delays, positions and bottlenecks�
  • Public call for tenders, August 2018�
  • Computas in partnership with Google wins assignment�
  • 3-man project from sep-feb 2018

4

5 of 31

Questions

5

Is machine learning mature enough?

Can we provide travellers with accurate predictions beyond ½ hour ahead?

Can ML help to reduce costs and diminish delays?

Can we improve the time table with the predictions?

Where will the biggest bus delays occur this summer?

Where should we place extra busses during next holiday rush?

Can we discover new patterns that we don´t know yet?

Can we make predictions for future bus schedules?

6 of 31

Prerequisite

Input

  • Real-time bus tracking data (in Siri VM xml format)
  • Weather data
  • Holiday data
  • Timetable data �(in GTFS format)

Output

  • Pre-existing mapping application

6

© Kolumbus 2018

© Kolumbus 2018

7 of 31

Idea of solution

7

🚎

🚎

1. Bus position calculated from an assumed future route schedule

25 sec

2. Delay prediction

3. Calculate position according to route

8 of 31

Learning pipeline

8

Why not fix the data at the same time?

Other data sources

DataFlow ← Apache Beam

Pandas ← Python

Pandas ← Python

TensorFlow ← Keras

Big Data / Bus tracking data

8h

2,5h

3w

10w

5w

4w

4.300.000+ json files �at Azure Blob Storage

24+ csv files�at Google Cloud Storage

wildcard tables �in BigQuery

source file�at Google Cloud Storage

model�at Google Cloud ML

9h

Source

Assemble, co-locate

Join,�aggregate

Explore, �feature engineering

Machine Learning

How much work?

execution time:

1

2

3

4

5

  • transfer
  • convert
  • filter
  • preprocess
  • training
  • testing
  • feature�engineering

9 of 31

Prediction Pipeline

9

Future point in time

200 buses at their scheduled positions

Cloud ML

Calculate delayed positions and heading

every 10 sec

Intelligent lookup in “future” bus schedule

Combine bus data with weather forecast and holiday info into a batch predition call.

Calculate, stipulate along the bus route

Color busses according to their delay and plot everything

If selected time window is in the future: make a request to prediction API

START

FINISH

10 of 31

Dataflow in Prediction Pipeline

10

delay

position

Prediction request with a point in time

Tensor�Flow

Complete feature set for 300 busses

Updated feature set

11 of 31

Google Cloud Platform

A detour looking at the cloud platform we used

11

12 of 31

Data Science & Google Cloud Platform

  • Toolset for common data-processing and storage tasks
  • Cost-effective, no-ops
  • Different modules play well together
  • Python affinity
  • Embraces open source
  • Muscle to handle really high data throughputs*

12

Exploration

Automation

Data Science Is Multidisciplinary

By Brendan Tierney, 2012

13 of 31

Prepare & Process

GCP Data & Analytics Ecosystem

Cloud Dataflow

Exploration & Collaboration

Data Studio

Cloud Datalab

Capture & Ingest

Cloud Pub/Sub

Store & Serve

Cloud Bigtable

Cloud Storage

Analytics & ML

Google BigQuery

Cloud Dataproc

Analyze & Enrich

Video API

Natural Language API

Vision API

Speech API

Cloud Storage

Cloud Dataprep

Cloud Machine Learning

Google BigQuery

Ads Data Transfer

Cloud Dataproc

Cloud SQL

Cloud Datastore

Source: “Operational Machine Learning” presentation by Khalid Salaman, Google

Proprietary + Confidential

14 of 31

Getting started...

14

15 of 31

System Architecture

of the “time machine” solution

15

16 of 31

Architecture of the time machine solution

16

Cloud Storage

Cloud �App Engine

BigQuery

Data Warehouse

Cloud Dataflow

Unified �Data Processing

Cloud �ML Engine

Model Training �and Deployment

Prediction API

Training

Serving

Fast Access Database

Cloud �SQL

Bus �tracking �data

Time table data

Weather

School�holidays

Map

17 of 31

Big Data technologies

… used in the project

17

18 of 31

Cloud Dataflow

  • https://beam.apache.org
  • Code-centric ETL-tool for writing data processing pipelines
  • Allows among other things: groupBy, windowing, streaming data
  • Write pipeline in Python – �deploy and execute it on the platform of your likings:
    • DirectRunner – runs locally on your machine
    • ApexRunner – runs on Apache Apex
    • FlinkRunner – runs on Apache Flink
    • SparkRunner – runs on Apache Spark
    • GearpumpRunner – runs on Apache Gearpump
    • DataflowRunner – runs on Google Cloud Dataflow

18

19 of 31

Cloud Dataflow sample code

def count_ones(word_ones):

""" Count the occurrences of each word. """

(word, ones) = word_ones

return (word, sum(ones))

def run(argv=None):

""" Main entry point; defines and runs the word count pipeline. """

parser = argparse.ArgumentParser()

parser.add_argument('--input', dest='input', default='gs://dataflow-samples/sp/kinglear.txt')

parser.add_argument('--output', dest='output', required=True)

known_args, pipeline_args = parser.parse_known_args(argv)

pipeline_options = PipelineOptions(pipeline_args)

p = beam.Pipeline(options=pipeline_options)

# Read the text file[pattern] into a PCollection.

lines = p | 'read' >> ReadFromText(known_args.input)

counts = (lines

| 'split' >> (beam.ParDo(WordExtractingDoFn()).with_output_types(six.text_type))

| 'pair_with_one' >> beam.Map(lambda x: (x, 1))

| 'group' >> beam.GroupByKey()

| 'count' >> beam.Map(count_ones))

output = counts | 'format' >> beam.Map(format_result)

output | 'write' >> WriteToText(known_args.output)

result = p.run()

result.wait_until_finish()

19

wordcount.py

Use of call-out function

Typical pipeline definition

Use of lambda

Configuration

20 of 31

BigQuery

  • Google’s data warehouse product in the cloud
  • No indecees, no foreign keys
  • Distributed, massive-parallel query-execution with caching
  • 95% similarity of SQL syntax with Oracle and SQLServer
  • Analytics functions
  • Use it for joining large datasets from different sources
  • Access it via
    • web-browser: https://bigquery.cloud.google.com/
    • bq command-line tool
    • python script
    • ODBC
    • other tools (DataStudio, Tableau, PowerBI)

20

21 of 31

BigQuery sample code

from google.cloud import bigquery

client = bigquery.Client(project='kolumbus-time-machine')

dataset = client.dataset('siri')

sourceTableName = 'VehicleActivities_20180124_150352'

source_table_ref = dataset.table(sourceTableName)

query = ''' SELECT va.*, wo.precipitation, wo.temperature

FROM siri.%s va

LEFT JOIN `weather.weatherObservationsProcessed` wo

ON va.stationId = wo.stationId

AND va.recordedAtHour = wo.datetimeWeatherCES

''' % (sourceTableName)

job_config = bigquery.QueryJobConfig()

job_config.destination = source_table_ref

job_config.write_disposition = 'WRITE_TRUNCATE'

job_config.schemaUpdateOptions = ['ALLOW_FIELD_ADDITION']

job = client.query(query, job_config=job_config)

start = time.time()

job.result()

21

Use it for data manipulation

Good separation of query and code

22 of 31

Machine Learning

used in the project

22

23 of 31

Machine Learning

23

24 of 31

Model selection

24

25 of 31

Neural Networks

Pros

  • Extremely powerful
  • Can model even very complex relationships
  • No need to understand the underlying data
  • Almost works by “magic”

Cons

  • Prone to overfitting
  • Long training time
  • Requires significant computing power for large datasets
  • Model is essentially unreadable

Good at

  • Images
  • Video
  • “Human-intelligence” type tasks like driving or flying
  • Robotics

25

26 of 31

Feature Engineering

26

  • Raw data: 28 columns�
  • Remove 16 columns:
    • dependant on delay
    • redundant
    • not relevant
  • Machine Learning: 12 columns become 8424 �(x 125.000.000 rows!)
    • date column is split up
    • one-hot encoding: �one column for each possible value

coordinates, nextStop

name of bus stop

weekday, hour, month

weekdays�[0, 1, 2, 3, 4, 5, 6]

lineId �[0, 1, ...... ,8000]

lineId, tripId, longitude, latitude, vehicleId, directionRef, recordedAtTime, timeTraveled, ...

27 of 31

Feedforward neural network

27

8409

Line 6 to Sandnes, monday 15:35,

10°c, some rain, ….�

“ 73 seconds

behind schedule ”

3 Hidden Layers

Batch Size: 2048 rows

Adam optimizer, dropout(20%), Relu/Tanh

28 of 31

Google Cloud ML - learning for 1 year of data

Epoch

Training time

Error train data

Error test data

Price

1

1 hour

132 seconds

107 seconds

45 NOK

5

4 hours

88 seconds

92 seconds

180 NOK

10

8 hours

66 seconds

86 seconds

360 NOK

15

12 hours 30 minutes

55 seconds

84 seconds

567 NOK

Hvordan finne oppdaterte bilder:

  • Gå inn på: w://Markedsmateriell/Bildebibliotek
  • Last ned bildene du ønsker
  • Trykk på «Sett inn bilde» og last opp

29 of 31

Model and deployment

29

$ gcloud ml-engine predict --model model_1_year_2016_2017 --version v1 --json-instances sample.json

$ gcloud ml-engine jobs submit training $JOB_NAME � �--job-dir $JOB_DIR �--module-name trainer.keras_train_cloudml �--scale-tier=$TIER �--region $REGION

<..>�-- �--train-file <..> �--job-dir $JOB_DIR �--test-split $TS �--batch-size 1024 �--epochs $E �--learning-rate $LR

Python

Keras

Tensorflow

Train model

Predict

30 of 31

Future work

  • Preventing “jumpy” busses when predictions change
  • Integrate short-term with long term prediction; options:
    • Kalman filter
    • Long Short-Term Memory cells (LSTM)
  • Improve model accuracy
  • Find better measures for quality of the predictor
  • Refine and extend model
    • Travel days, wind, fog, driver, bus types, “blocking”, boat traffic, passenger numbers, car traffic volume

30

31 of 31

Thank you!

Questions?

31