In 2012, something remarkable happened in computer vision research.  A yearly competition called the ‘ImageNet Large Scale Visual Recognition Challenge’ (ILSVRC) had been running since 2010.  Competitors write algorithms which generate labels for objects in photos.  For example., “this image contains a dog”.  In 2010, the best performing system had an error rate of 28% (i.e. it got 28% of the labels wrong).  In 2011, the best performing system nudged that error rate down to 26% (i.e. a tiny improvement).  In 2012, the second-best performing system had an error rate of 26%.  But the best performing system in 2012 [Krizhevsky et al 2012] used a relatively new technique called ‘deep learning’.  It devoured the other systems.  It delivered an error rate of 15%.  This truly ground-breaking result was arguably the single most important result to wake the world up to the idea that deep neural networks can deliver outstanding performance on some tasks.  We now know that previous approaches are very unlikely to stand up against deep learning systems in image classification tasks.  Fast forward to 2014 and Google’s team won the classification and localization challenge with a 6.66% error rate (also using deep learning) and also performed very well on image localisation.  (Here is a blog post from Google describing their approach, and their paper is Szegedy at al 2014)  Here is an example image from the Google blog post showing image localization and classification:


In other words, because the computer vision community had a robust mechanism to compare image classification techniques, it was easy for the community to spot which algorithms worked well and which did not work so well.  The high-performing algorithms were adopted by the community and pushed to remarkable performance levels.  The low-performing algorithms were dropped.  There was measurable progress year-on-year.  The algorithms evolved.  (See Russakovsky et al 2014 for a fascinating overview of the ILSVRC challenge.)  Some other machine learning fields have established competitions (such as TIMIT for speech recognition and MNIST for handwriting recognition).

The aim of this project is to do for energy disaggregation what ILSVRC did for computer vision!

What is energy disaggregation?

Energy disaggregation estimates an itemised electricity bill from a single, whole-house electricity meter.  The ultimate aim is to help people better manage their energy consumption (there is evidence that consumers are best able to manage their energy consumption if given appliance-by-appliance consumption data rather than whole-house aggregate demand data).  For example, here is a breakdown of energy usage by appliance:


For the ‘classic’ paper on energy disaggregation, see Hart 1992.  For a recent review of energy disaggregation, see Armel et al, 2013.

Energy disaggregation research has a big problem

One of the big problems in energy disaggregation research is that it is currently impossible to directly compare the performance any pair of disaggregation algorithms described in the literature.  We cannot do direct comparisons because different researchers use different data sets, different metrics, different pre-processing etc.  For example, if paper A claims 80% accuracy and paper B claims 85% accuracy, we cannot infer that paper B is better.  This problem is so bad that we cannot say with confidence that all modern disaggregation approaches outperform the first approaches developed in the 1980s!  i.e. we cannot be certain that we have made progress in the last 30 years!

This problem is well recognised and there is strong demand from academic and industrial energy disaggregation researchers for a way to measure performance across research groups.


The aim of this project is to produce a web-based ‘competition’ or ‘validation tool’ to test the performance of energy disaggregation algorithms.  This is not as easy as it might sound because there are a large number of different metrics, different data sets, different energy disaggregation algorithms, different ways to ‘game’ the system etc.

The only existing example of an web-based energy disaggregation competition is the Belkin Energy Disaggregation Competition on Kaggle (described in a great blog post by disaggregation researcher Oli Parson).  You might choose to model your approach on this competition.  Or not!  It’s up to you.

Your project might follow several stages:

  1. Discuss the exact specification of your web app with the disaggregation community on the energy disaggregation mailing list and maybe in-person at meetups (it’s a very friendly community with a strong presence in the UK.  And if a validation tool is to have any success then it must have buy-in from the community).  Perhaps you could collaboratively construct the spec for your project on the energy disaggregation wiki (which doesn’t exist yet but soon will!) or on a public github repository.
  2. Build a prototype web app.
  3. Try hard to ‘game’ the web app to cheat!
  4. Modify the web app to make it harder (or impossible) to cheat in the way you identified.
  5. Repeat steps 3 & 4 until you run out of time!

If you succeed in building a usable validation tool then your work will be a very strong candidate for presenting at a research conference!

You don’t need to build all the code from scratch.  You can build on top of the open source NILM tool kit, NILMTK  (NILM is another name for energy disaggregation.  NILM stands for ‘non-intrusive load monitoring).  We also have our own dataset, UK-DALE.

This project would suit a group with interests which include:

  1. Solving a real-world problem and interacting with a real community to refine the specification
  2. Building a web app
  3. Data visualisation
  4. Handling large quantities of data
  5. Energy conservation

Some challenges

Keep the validation data private?

There are lots of companies involved in energy disaggregation.  They are very unlikely to share their disaggregation algorithms or even share an executable of their algorithms.  As such, the competition would have to work something like this: they download some labelled training data to train their systems.  i.e. this data would include the whole-house aggregate power demand signal, as well as the ground-truth power demand measured for each individual appliance.  Competitors would then download some unlabelled testing data (i.e. just the whole-house aggregate power demand signal).  Competitors then try their best to disaggregate this signal and upload their appliance-by-appliance estimates to the competition website.  These results are then scored against the hidden ground-truth appliance-by-appliance validation data.

A big problem with this approach is that it would be possible to ‘game’ the system.  If competitors can submit an unlimited number of attempts then they could reverse-engineer the hidden ground-truth validation data.  Some solutions to this problem might be:

  1. Limit the number of attempts each team can submit.  Perhaps only allow one submission per day or per week.  You’d also have to protect against competitors signing up with multiple accounts (to detect whether a competitor is trying to sign up with multiple accounts you could check that IP address, cookies, e-mail address are unique.  Or perhaps you could write a little JavaScript program which generates a ‘hash’ for each physical machine based on its hardware specification).
  2. At the extreme, perhaps the competition should only be run once a year, perhaps located with a conference.  But a year is a long time for a disaggregation researcher wait to see if his/her algorithm performs better than everyone elses!
  3. Penalise competitors for submitting too often.
  4. Use new validation data every, say, 3 months.  This has several disadvantages.  One is that you need a continual source of new data.  Another is that you can’t directly compare results over time.

Public validation data

Another way to look at this would be to assume that companies wouldn’t want to publically validate their disaggregation algorithms.  So perhaps we only care about scoring academic disaggregation algorithms.  In which case we can fairly safely use public validation data (as long as the competitors publish how their algorithms work and, even better, publish their code so other people can run their disaggregation algorithms).  If the validation data is public then this also allows the community to help to identify issues with the validation data (incorrect labels etc).

File formats

Some disaggregation algorithms will be written in C++; some in Java; some in Python; some in Matlab; some in Julia etc etc.  All these researchers need a file format which will work for them.  Perhaps the safest approach is to put the time series data into CSV files and the metadata into YAML or JSON files.  We have been working on a file format specification which should help, and NILMTK has dataset converters.


Update 24/11/2014: Microsoft’s Indoor Localization Competition IPSN2014 might be an interesting case study.