Case study - how to use big-data machine learning to predict future delays and bottlenecks in public transport
1
Presented at�GeoPython 2018
Presenters:
Ture Friese
Presenter
Ture Friese�tf@computas.com�+47 99230426�
2
Agenda
3
The task at hand
4
Questions
5
Is machine learning mature enough?
Can we provide travellers with accurate predictions beyond ½ hour ahead?
Can ML help to reduce costs and diminish delays?
Can we improve the time table with the predictions?
Where will the biggest bus delays occur this summer?
Where should we place extra busses during next holiday rush?
Can we discover new patterns that we don´t know yet?
Can we make predictions for future bus schedules?
Prerequisite
Input
Output
6
© Kolumbus 2018
© Kolumbus 2018
Idea of solution
7
🚎
🚎
1. Bus position calculated from an assumed future route schedule
25 sec
2. Delay prediction
3. Calculate position according to route
Learning pipeline
8
Why not fix the data at the same time?
Other data sources
DataFlow ← Apache Beam
Pandas ← Python
Pandas ← Python
TensorFlow ← Keras
Big Data / Bus tracking data
8h
2,5h
3w
10w
5w
4w
4.300.000+ json files �at Azure Blob Storage
24+ csv files�at Google Cloud Storage
wildcard tables �in BigQuery
source file�at Google Cloud Storage
model�at Google Cloud ML
9h
Source
Assemble, co-locate
Join,�aggregate
Explore, �feature engineering
Machine Learning
How much work?
execution time:
1
2
3
4
5
Prediction Pipeline
9
Future point in time
200 buses at their scheduled positions
Cloud ML
Calculate delayed positions and heading
every 10 sec
Intelligent lookup in “future” bus schedule
Combine bus data with weather forecast and holiday info into a batch predition call.
Calculate, stipulate along the bus route
Color busses according to their delay and plot everything
If selected time window is in the future: make a request to prediction API
START
FINISH
Dataflow in Prediction Pipeline
10
delay
position
Prediction request with a point in time
Tensor�Flow
Complete feature set for 300 busses
Updated feature set
Google Cloud Platform
A detour looking at the cloud platform we used
11
Data Science & Google Cloud Platform
12
Exploration
Automation
Data Science Is Multidisciplinary
By Brendan Tierney, 2012
Prepare & Process
GCP Data & Analytics Ecosystem
Cloud Dataflow
Exploration & Collaboration
Data Studio
Cloud Datalab
Capture & Ingest
Cloud Pub/Sub
Store & Serve
Cloud Bigtable
Cloud Storage
Analytics & ML
Google BigQuery
Cloud Dataproc
Analyze & Enrich
Video API
Natural Language API
Vision API
Speech API
Cloud Storage
Cloud Dataprep
Cloud Machine Learning
Google BigQuery
Ads Data Transfer
Cloud Dataproc
Cloud SQL
Cloud Datastore
Source: “Operational Machine Learning” presentation by Khalid Salaman, Google
Proprietary + Confidential
Getting started...
14
System Architecture
of the “time machine” solution
15
Architecture of the time machine solution
16
Cloud Storage
Cloud �App Engine
BigQuery
Data Warehouse
Cloud Dataflow
Unified �Data Processing
Cloud �ML Engine
Model Training �and Deployment
Prediction API
Training
Serving
Fast Access Database
Cloud �SQL
Bus �tracking �data
Time table data
Weather
School�holidays
Map
Big Data technologies
… used in the project
17
Cloud Dataflow
18
Cloud Dataflow sample code
def count_ones(word_ones):
""" Count the occurrences of each word. """
(word, ones) = word_ones
return (word, sum(ones))
def run(argv=None):
""" Main entry point; defines and runs the word count pipeline. """
parser = argparse.ArgumentParser()
parser.add_argument('--input', dest='input', default='gs://dataflow-samples/sp/kinglear.txt')
parser.add_argument('--output', dest='output', required=True)
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
counts = (lines
| 'split' >> (beam.ParDo(WordExtractingDoFn()).with_output_types(six.text_type))
| 'pair_with_one' >> beam.Map(lambda x: (x, 1))
| 'group' >> beam.GroupByKey()
| 'count' >> beam.Map(count_ones))
output = counts | 'format' >> beam.Map(format_result)
output | 'write' >> WriteToText(known_args.output)
result = p.run()
result.wait_until_finish()
19
wordcount.py
Use of call-out function
Typical pipeline definition
Use of lambda
Configuration
BigQuery
20
BigQuery sample code
from google.cloud import bigquery
client = bigquery.Client(project='kolumbus-time-machine')
dataset = client.dataset('siri')
sourceTableName = 'VehicleActivities_20180124_150352'
source_table_ref = dataset.table(sourceTableName)
query = ''' SELECT va.*, wo.precipitation, wo.temperature
FROM siri.%s va
LEFT JOIN `weather.weatherObservationsProcessed` wo
ON va.stationId = wo.stationId
AND va.recordedAtHour = wo.datetimeWeatherCES
''' % (sourceTableName)
job_config = bigquery.QueryJobConfig()
job_config.destination = source_table_ref
job_config.write_disposition = 'WRITE_TRUNCATE'
job_config.schemaUpdateOptions = ['ALLOW_FIELD_ADDITION']
job = client.query(query, job_config=job_config)
start = time.time()
job.result()
21
Use it for data manipulation
Good separation of query and code
Machine Learning
used in the project
22
Machine Learning
23
Model selection
24
Neural Networks
Pros
Cons
Good at
25
Feature Engineering
26
coordinates, nextStop
name of bus stop
weekday, hour, month
weekdays�[0, 1, 2, 3, 4, 5, 6]
lineId �[0, 1, ...... ,8000]
lineId, tripId, longitude, latitude, vehicleId, directionRef, recordedAtTime, timeTraveled, ...
Feedforward neural network
27
8409
Line 6 to Sandnes, monday 15:35,
10°c, some rain, ….�
“ 73 seconds
behind schedule ”
3 Hidden Layers
Batch Size: 2048 rows
Adam optimizer, dropout(20%), Relu/Tanh
Google Cloud ML - learning for 1 year of data
Epoch | Training time | Error train data | Error test data | Price |
1 | 1 hour | 132 seconds | 107 seconds | 45 NOK |
5 | 4 hours | 88 seconds | 92 seconds | 180 NOK |
10 | 8 hours | 66 seconds | 86 seconds | 360 NOK |
15 | 12 hours 30 minutes | 55 seconds | 84 seconds | 567 NOK |
Hvordan finne oppdaterte bilder:
Model and deployment
29
$ gcloud ml-engine predict --model model_1_year_2016_2017 --version v1 --json-instances sample.json
$ gcloud ml-engine jobs submit training $JOB_NAME � �--job-dir $JOB_DIR �--module-name trainer.keras_train_cloudml �--scale-tier=$TIER �--region $REGION
<..>�-- �--train-file <..> �--job-dir $JOB_DIR �--test-split $TS �--batch-size 1024 �--epochs $E �--learning-rate $LR
Python
Keras
Tensorflow
Train model
Predict
Future work
30
Thank you!
Questions?
31