1 of 31

Case study - how to use big-data machine learning to predict future delays and bottlenecks in public transport

1

Presented at�GeoPython 2018

Presenters:

Ture Friese

2 of 31

Presenter

Ture Friese�tf@computas.com�+47 99230426�

Master in Business Administration and Engineering from University in Karlsruhe, Germany�
Relocated in 2000 to Norway�
Currently working as senior software engineer at IT consultancy Computas AS, Stavanger

2

3 of 31

Agenda

The task at hand
Learning pipeline
Prediction pipeline
The Google Cloud Platform
Systems architecture
Data processing technologies
Machine Learning
Future work

3

4 of 31

The task at hand

Public transport company wants prediction of future bus delays, positions and bottlenecks�
Public call for tenders, August 2018�
Computas in partnership with Google wins assignment�
3-man project from sep-feb 2018

4

5 of 31

Questions

5

Is machine learning mature enough?

Can we provide travellers with accurate predictions beyond ½ hour ahead?

Can ML help to reduce costs and diminish delays?

Can we improve the time table with the predictions?

Where will the biggest bus delays occur this summer?

Where should we place extra busses during next holiday rush?

Can we discover new patterns that we don´t know yet?

Can we make predictions for future bus schedules?

6 of 31

Prerequisite

Input

Real-time bus tracking data (in Siri VM xml format)
Weather data
Holiday data
Timetable data �(in GTFS format)

Output

Pre-existing mapping application

6

7 of 31

Idea of solution

7

🚎

1. Bus position calculated from an assumed future route schedule

25 sec

2. Delay prediction

3. Calculate position according to route

8 of 31

Learning pipeline

8

Why not fix the data at the same time?

Other data sources

DataFlow ← Apache Beam

Pandas ← Python

TensorFlow ← Keras

Big Data / Bus tracking data

8h

2,5h

3w

10w

5w

4w

4.300.000+ json files �at Azure Blob Storage

24+ csv files�at Google Cloud Storage

wildcard tables �in BigQuery

source file�at Google Cloud Storage

model�at Google Cloud ML

9h

Source

Assemble, co-locate

Join,�aggregate

Explore, �feature engineering

Machine Learning

How much work?

execution time:

1

2

3

4

5

transfer
convert

filter
preprocess

training
testing

feature�engineering

9 of 31

Prediction Pipeline

9

Future point in time

200 buses at their scheduled positions

Cloud ML

Calculate delayed positions and heading

every 10 sec

Intelligent lookup in “future” bus schedule

Combine bus data with weather forecast and holiday info into a batch predition call.

Calculate, stipulate along the bus route

Color busses according to their delay and plot everything

If selected time window is in the future: make a request to prediction API

START

FINISH

10 of 31

Dataflow in Prediction Pipeline

10

delay

position

Prediction request with a point in time

Tensor�Flow

Complete feature set for 300 busses

Updated feature set

11 of 31

Google Cloud Platform

A detour looking at the cloud platform we used

11

12 of 31

Data Science & Google Cloud Platform

Toolset for common data-processing and storage tasks
Cost-effective, no-ops
Different modules play well together
Python affinity
Embraces open source
Muscle to handle really high data throughputs*

12

Exploration

Automation

Data Science Is Multidisciplinary

By Brendan Tierney, 2012

13 of 31

Prepare & Process

GCP Data & Analytics Ecosystem

Cloud Dataflow

Exploration & Collaboration

Data Studio

Cloud Datalab

Capture & Ingest

Cloud Pub/Sub

Store & Serve

Cloud Bigtable

Cloud Storage

Analytics & ML

Google BigQuery

Cloud Dataproc

Analyze & Enrich

Video API

Natural Language API

Vision API

Speech API

Cloud Storage

Cloud Dataprep

Cloud Machine Learning

Google BigQuery

Ads Data Transfer

Cloud Dataproc

Cloud SQL

Cloud Datastore

Source: “Operational Machine Learning” presentation by Khalid Salaman, Google

Proprietary + Confidential

14 of 31

Getting started...

14

15 of 31

System Architecture

of the “time machine” solution

15

16 of 31

Architecture of the time machine solution

16

Cloud Storage

Cloud �App Engine

BigQuery

Data Warehouse

Cloud Dataflow

Unified �Data Processing

Cloud �ML Engine

Model Training �and Deployment

Prediction API

Training

Serving

Fast Access Database

Cloud �SQL

Bus �tracking �data

Time table data

Weather

School�holidays

Map

17 of 31

Big Data technologies

… used in the project

17

18 of 31

Cloud Dataflow

https://beam.apache.org
Code-centric ETL-tool for writing data processing pipelines
Allows among other things: groupBy, windowing, streaming data
Write pipeline in Python – �deploy and execute it on the platform of your likings:

DirectRunner – runs locally on your machine
ApexRunner – runs on Apache Apex
FlinkRunner – runs on Apache Flink
SparkRunner – runs on Apache Spark
GearpumpRunner – runs on Apache Gearpump
DataflowRunner – runs on Google Cloud Dataflow

18

19 of 31

Cloud Dataflow sample code

def count_ones(word_ones):

""" Count the occurrences of each word. """

(word, ones) = word_ones

return (word, sum(ones))

def run(argv=None):

""" Main entry point; defines and runs the word count pipeline. """

parser = argparse.ArgumentParser()

parser.add_argument('--input', dest='input', default='gs://dataflow-samples/sp/kinglear.txt')

parser.add_argument('--output', dest='output', required=True)

known_args, pipeline_args = parser.parse_known_args(argv)

pipeline_options = PipelineOptions(pipeline_args)

p = beam.Pipeline(options=pipeline_options)

# Read the text file[pattern] into a PCollection.

lines = p | 'read' >> ReadFromText(known_args.input)

counts = (lines

| 'split' >> (beam.ParDo(WordExtractingDoFn()).with_output_types(six.text_type))

| 'pair_with_one' >> beam.Map(lambda x: (x, 1))

| 'group' >> beam.GroupByKey()

| 'count' >> beam.Map(count_ones))

output = counts | 'format' >> beam.Map(format_result)

output | 'write' >> WriteToText(known_args.output)

result = p.run()

result.wait_until_finish()

19

wordcount.py

Use of call-out function

Typical pipeline definition

Use of lambda

Configuration

20 of 31

BigQuery

Google’s data warehouse product in the cloud
No indecees, no foreign keys
Distributed, massive-parallel query-execution with caching
95% similarity of SQL syntax with Oracle and SQLServer
Analytics functions
Use it for joining large datasets from different sources
Access it via

web-browser: https://bigquery.cloud.google.com/
bq command-line tool
python script
ODBC
other tools (DataStudio, Tableau, PowerBI)

20

21 of 31

BigQuery sample code

from google.cloud import bigquery

client = bigquery.Client(project='kolumbus-time-machine')

dataset = client.dataset('siri')

sourceTableName = 'VehicleActivities_20180124_150352'

source_table_ref = dataset.table(sourceTableName)

query = ''' SELECT va.*, wo.precipitation, wo.temperature

FROM siri.%s va

LEFT JOIN `weather.weatherObservationsProcessed` wo

ON va.stationId = wo.stationId

AND va.recordedAtHour = wo.datetimeWeatherCES

''' % (sourceTableName)

job_config = bigquery.QueryJobConfig()

job_config.destination = source_table_ref

job_config.write_disposition = 'WRITE_TRUNCATE'

job_config.schemaUpdateOptions = ['ALLOW_FIELD_ADDITION']

job = client.query(query, job_config=job_config)

start = time.time()

job.result()

21

Use it for data manipulation

Good separation of query and code

22 of 31

Machine Learning

used in the project

22

23 of 31

Machine Learning

23

https://mythoughtlabpublic.blob.core.windows.net/public/ML%20Poster.pdf

24 of 31

Model selection

24

https://blogs.sas.com/content /subconsciousmusings/2017/04/12/machine-learning-algorithm-use/

25 of 31

Neural Networks

Pros

Extremely powerful
Can model even very complex relationships
No need to understand the underlying data
Almost works by “magic”

Cons

Prone to overfitting
Long training time
Requires significant computing power for large datasets
Model is essentially unreadable

Good at

Images
Video
“Human-intelligence” type tasks like driving or flying
Robotics

25

26 of 31

Feature Engineering

26

Raw data: 28 columns�
Remove 16 columns:

dependant on delay
redundant
not relevant�

Machine Learning: 12 columns become 8424 �(x 125.000.000 rows!)

date column is split up
one-hot encoding: �one column for each possible value

coordinates, nextStop

name of bus stop

weekday, hour, month

weekdays�[0, 1, 2, 3, 4, 5, 6]

lineId �[0, 1, ...... ,8000]

lineId, tripId, longitude, latitude, vehicleId, directionRef, recordedAtTime, timeTraveled, ...

27 of 31

Feedforward neural network

27

8409

Line 6 to Sandnes, monday 15:35,

10°c, some rain, ….�

“ 73 seconds

behind schedule ”

3 Hidden Layers

Batch Size: 2048 rows

Adam optimizer, dropout(20%), Relu/Tanh

28 of 31

Google Cloud ML - learning for 1 year of data

Epoch	Training time	Error train data	Error test data	Price
1	1 hour	132 seconds	107 seconds	45 NOK
5	4 hours	88 seconds	92 seconds	180 NOK
10	8 hours	66 seconds	86 seconds	360 NOK
15	12 hours 30 minutes	55 seconds	84 seconds	567 NOK

Hvordan finne oppdaterte bilder:

Gå inn på: w://Markedsmateriell/Bildebibliotek
Last ned bildene du ønsker
Trykk på «Sett inn bilde» og last opp

29 of 31

Model and deployment

29

$ gcloud ml-engine predict --model model_1_year_2016_2017 --version v1 --json-instances sample.json

$ gcloud ml-engine jobs submit training $JOB_NAME � �--job-dir $JOB_DIR �--module-name trainer.keras_train_cloudml �--scale-tier=$TIER �--region $REGION

<..>�-- �--train-file <..> �--job-dir $JOB_DIR �--test-split $TS �--batch-size 1024 �--epochs $E �--learning-rate $LR

Python

Keras

Tensorflow

Train model

Predict

30 of 31

Future work

Preventing “jumpy” busses when predictions change
Integrate short-term with long term prediction; options:

Kalman filter
Long Short-Term Memory cells (LSTM)

Improve model accuracy
Find better measures for quality of the predictor
Refine and extend model

Travel days, wind, fog, driver, bus types, “blocking”, boat traffic, passenger numbers, car traffic volume

30

31 of 31

Thank you!

Questions?

31