Arxiv Paper Analysis Worksheet (Responses)
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
TimestampYour Name (Full)Paper NamePaper UrlHigh level summary
Does it claim a State of the Art result?
What datasets are used in the paper? Can you provide a link to the dataset?
What techniques are used in the paper?
Are you excited about this research paper?
Put a rating between 1 and 5 for the paper.
Any other notesResearcher affiliation
If you're excited, why are you excited?
2/27/2017 16:24:09Jack Clark
Robot gains Social Intelligence through Multimodal Deep Reinforcement Learning
researchers train a robot via MDQN to learn social interaction skills with people. The robot is trained to select from four possible actions (waiting, looking towards human, waving hand, handshaking). It gets a reward for a handshake, 0 for the other actions, and -0.1 for an unsuccessful handshake. During testing, human volunteers evaluated the correctness of the robot's response to various situations. The robot learns to do interesting things, like infer walking trajectories from people, or learn which actions to take to increase the likelihood of a handshake (like waving to someone walking towards them.)
New dataset, Deep RHI
deep learning, multimodal deep Q-network (MDQN)
Seems quite novel to me. Not a common area of research
Uses Aldebaran's Pepper robot. Next steps are to increase the action space beyond four actions, and use a recurrent attention model so they can get more state on the robot. This paper mostly interesting due to the future it indicates, than the proof-of-concept research it consists of.
2/27/2017 19:52:44Jack Clark
Neural Map: Structured Memory for Deep Reinforcement Learning
Represent a memory system to an RL agent as a map. Do this by making the memory a spatially structured 2D memory image that can learn to store info about the environment. Structure the map so the agent takes it in as an embedding with current position, then use to navigate the agent. Experiments claim better performance than MenNN-32 and random and an LSTM on a new "goal search" environment. Uses gated recurrent units to create a better performance characteristic. Next plan is an "ego-centric" map, similar to recent work by Sergine Levine and others.
deep learning, gated recurrent units, LSTM, reinforcement learning, GRU,
Mappin + neural feels like a new area, and this intersects with 2014> work on memory networks etc. Quite original.
Maps look like a neat, hacky way to deal with complex memory representations of navigation/action tasks. Also: interpretable!
2/27/2017 20:25:54Matthew Gibson
A multi-task convolutional neural network for mega-city analysis using very high resolution satellite imagery and geospatial data
Authors train a pretty standard CNN on satellite imagery to compute a some statistics interesting to remote sensing researchers: land-cover classification, building density, and floor area ratio. They then use these statistics to estimate population density. Results for land-cover classification seem reasonable, but there is not a lot of quantitive assessment.
DigitalGlobe satellite imagery + private dataset derived from Wuhan planning bureau
deep learning, CNNs
Somewhat novel, applies known technique to known problem, but this particular technique-problem pair is novel.
The paper suffers from not benchmarking against other known methods or datasets such as the UC Merced Land Use Dataset. There is not a lot interesting here from an ML/CV point of view save the application. Seem oriented toward remote sensing researchers.
Wuhan University
2/27/2017 20:54:12Jack Clark
Deceiving Google's Perspective API Built for Detecting Toxic Comments
Targets recently released Google 'perspective' service to detect abusive comments. Demonstrates vulnerability via creation of adversarial examples (text that contains abusive phrases but is somehow hidden). Simple mis-spellings trick google's api. EEg - "They are stupid and ignorant with no class (91%) abusive > They are st.upid and ig.norant with no class (11%). Additional more subtle pertubations, eg - They are liberal idiots who are uneducated. (90%) They are not liberal idiots who are uneducated. (83%) < but note the significantly less pronounced effect.
NoGoogle Perspectivebasically none? no 1
no methods section, so unclear how adversarial phrases derived - if in an automated way then I might move this to a 2, but otherwise hard to see the ML component here.
University of Washington
2/27/2017 21:34:58Lynn Langit
Skin Lesion Classification Using Hybrid Deep Neural Networks
Tests this method (pre-trained convolutional neural networks and ensembles learning) to classify skin lesions (images) using 2,000 images for training and 150 images for testing. Result is 84.8% and 93.6% correct for Melanoma and seborrheic keratosis labeling via binary classification which compares to diagnosis via an experienced dermatologist. Proposed that a larger training dataset would result in increased model performance.
The available images from ISIC 2017 challenge were used for training purposes.
This dataset is composed of 2000 color dermoscopic skin images with
corresponding labels. Link is here --
pre-trained convolutional neural networks and ensembles learning
I would like to see this research repeated with a much larger dataset.
University of Vienna, Vienna, Austria & TissueGnostics GmbH, Vienna, Austria
2/27/2017 21:48:55Jack Clark
CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi
New hardware alert! Researchers introduce 'Controlled Hogwild with Arbitrary Order of Synchronization' (CHAOS), a parallelization method for neural nets. Has thread and vector parallelism and other nice timey-wimey features. Can show speedups of 103X on Xeon Phi, compared to execution on one thread, 14X to sequential execution of Xeon E5, and 58X of Intel Core i5. Contains a nice review of history of parallelization approaches. Future scaling a'hoy "results of the performance model indicate that CHAOS scales well beyond
the 240 hardware threads of the Intel Xeon Phi that is used in this paper
for experimentation."
New architecture means new optimizations. James Mickens would be proud.
Proves out Xeon Phi performance, but lack of comparison to GPUs makes hard to compare. Valuable signal of benefit of optimizations, though.
Linnaeus University, Machine Intelligence Research Labs
2/27/2017 22:37:43Asim Imdad
Multi-Label Segmentation via Residual-Driven Adaptive Regularization
Researchers presented here multi-label segmentation algorithm which uses robust Huber loss for both the data and the regularizer. The proposed model has been designed to make use of a prior, or regularization functional, that adapts to the residual during convergence of the algorithm in both space, and time.The authors designed the cost function in a parameter free way and it can be minimized with a convex optimization framework. The results are tested and compared with the state of the art in the field such as total variation based algorithms.
Berkeley segmentation dataset
Convex Optimization and Regularization
It uses a new regularization scheme to control the segmentation
2/27/2017 22:55:28Cynthia Yeung
Learning Hierarchical Features from Generative Models
Researchers show that hierarchical generative models aren't much better than non-hierarchical ones. They then propose an alternative approach (Variational Ladder Autoencoders), which they train over three datasets, and obtain results similar to those obtained with InfoGAN. However, this new approach is still limited in that it can't specify the type of feature to be disentangled and can't learn structures except for disentanglement.
Various datasets available through GitHub:
hierarchical variational autoencoders and variational ladder autoencoders
2/27/2017 22:58:28Dhruv Parthasarathy
On the Origin of Deep Learning
This paper covers the history of deep learning and provides a wonderful survey of the field as well. It goes through how the original ideas were created and provides the biological inspirations where applicable. It goes through the original findings in each piece of DL today - perceptrons, backprop, deep nets, ... all the way to GANs. It ends by showing how historical results can help us discover new areas for research.
yes - you rarely see excellent summaries of the history of the field and how it has evolved.
This is just a must read for anyone interested in the space. Well written and inspiring.
2/27/2017 23:01:30Adrian Ulloa
Boundary-Seeking Generative Adversarial Networks

A new GAN structure is introduced, which relies on generating samples that lie on the decision boundary of the discriminator in each update.
It is proven empirically that this approach works for discrete data (and also continues to work with continuous data too), an active area of research on GANs lately, and performs better than the reparametrization technique (aka Gumbel-softmax). It also provides with a strong first step in building a unified learning framework for both types of data.
SVHN (, quantized CelebA (
deep convolutional GANs
Yes, as far as I'm aware. It is a fairly hot research area though
i would love to see an implementation, hope to see a release of the code soon :)
University of Montreal, University of Waterloo, New York University, CIFAR
2/27/2017 23:18:45Cynthia Yeung
Deep Voice: Real-time Neural Text-to-Speech
Deep Voice: Real-time Neural Text-to-Speech
Researchers create a fully-neural TTS system consisting of the following: a segmentation model for locating phoneme boundaries, a grapheme-to- phoneme conversion model, a phoneme duration prediction model, a fundamental frequency pre- diction model, and an audio synthesis model. The system requires zero manual annotation. The result? A 400X speedup over previous WaveNet inference implementations.
1. Internal English speech database containing approximately 20 hours of speech data segmented into 13,079 utterances. 2. Subset of the Blizzard 2013 data (Prahallad et al., 2013)
deep neural networks
Yes. Results are impressive.
Faster-than-real-time inference!
2/27/2017 23:19:50
Remya Cherussery Varriem
A case study on English-Malayalam Machine Translation
The research focuses on comparison between Statistical machile translation (SMT ) Vs Rule Based Machine Translation (RBMT) for English - Malayalam translation. Because of highly different structure these languages come with challenges in terms of ambiguity in meaning, as well as difference in vocabulary. The authors preprocessed data and trained the SMT on around 25k - 30k sentences. When passed through the two models they did error analysis on the results. SMT outperformed RBMT but it could not do morphology analysis, so the goal is to continue the research giving the SMT model a morphology analyzer and lower the error rate even further.
Could not find the exact data. But it is part of
Statistical Machine Learning
I think so. I speak Malayalam and I am yet to see an excellent translator software . I can only imagine how it can be extended to other languages , which all have origin in sanskrit.
This research has great potential if it can take into consideration how complex and rich malayalam is and there is no way to translate from word to word. Also taken into consideration variations in slang, dialect etc, It can get even more complex. I am excited to have found this paper and related papers
IIT Bombay
2/27/2017 23:22:25Rinat Maksutov
Spatially Aware Melanoma Segmentation Using Hybrid Deep Learning Techniques
A new network architecture is proposed for a more accurate delineation of skin lesion images from the dataset of the ISBI 2017 lesion segmentation challenge. The researchers propose a hybrid method, which is a combination of deep convolutional networks with recurrent networks. It is aimed at dealing with relatively low accuracy of the previously proposed architecures on low contrast and hair occluded images of a lesion. By injecting 4 recurrent network layers between the convolutional encoders and pool layers and the convolutional decoder layer it was possible to significantly outperform the traditional methods.
Recurrent neural networks, convolutional neural networks
Does not seem novel, but provides an alternative way of solving a certain problem using the existing techniques.
3Deakin University
2/27/2017 23:28:20Dhruv Parthasarathy
Generative Adversarial Active Learning
The authors apply GANs in the field of Active Learning. In Active Learning, the algorithm has a pool of training samples to choose from but can only get labels for a limited set. This method uses GANs to synthesize the most optimal training data to get labeled. This is the first application of GANs to this field. The results are comparable to state of the art results but only in some cases. More work will need to be done to verify that this is a promising approach.
MaybeCIFAR-10, MNIST, SVHNGANs, Active Learning
Yes - This seems to be the first paper to apply GANs in the context of Active Learning.
3Boston College
2/28/2017 0:00:24Dhruv Parthasarathy
Adversarial Networks for the Detection of Aggressive Prostate Cancer
The authors use adverserial training on U-Net to segment MRI images to reveal tumors. This seems well suited to the task given that such a method does well in fields with limited labeled data such as medical imaging. The authors find that such a method outperforms simple cross entropy on the given data set both in terms of sensitivity and Dice coefficient.
MRI dataset from 152 patients acquired at the National Center for Tumor Diseases in Heidelberg, Germany.
Convnets trained using Adverserial training.
Yes - the authors claim that this is the first paper to introduce adversarial training for semantic segmentation of medical images.
German Cancer Research Center (DKFZ)
2/28/2017 0:05:53Prateek Dorwal
Low-Precision Batch-Normalized Activations
This paper presents a novel approach of quantization of batch normalized activations to reduce memory requirements and computational costs. It presents low precision approximation formulas that can be applied after normalization but before activation function, so as to replace costly multiplication operations with cheaper addition operations. This decrease in costs come with a trade-off of reduced accuracy(a minor loss though). In the paper, the researcher has presented results if his experimentation on ResNets models on ImageNet
convolutional neural network
No. There are papers with similar approach to achieve lower precision, eg . However the researcher applies these formulas before activations, which is a new approach.
3Facebook AI research
2/28/2017 0:32:40Anthony Cooper
Automated Verification and Synthesis of Embedded Systems using Machine Learning
The author presents an overview of current problems and areas of focus regarding embedded computer systems (C, C++ based). Autonomous vehicles like UAVs are referenced, along with disease diagnosis, and some possible issues are highlighted. Unfortunately, most of the research problems (RP 1-6) have particularly specific applications, and the paper fails to deliver new insight about machine learning. Little about ML is referenced or discussed, and the paper's usefulness outside of specific engineering applications is questionable.
SMT-related (Little on ML)
Nothing extremely groundbreaking here, although it does present current issues around embedded systems.
The paper is quite short, at only 3 pages, and I feel it would have been better combined with a larger piece of work. It presents current issues which can definitely help raise awareness and promote discussion, but it doesn't strike me as a must-read. Those working with C or C++ based systems in autonomous vehicles, like self driving cars or UAVs, might find this paper more relevant. :)
University of Oxford
2/28/2017 4:21:26Sagar Uprety
Dynamic Word Embeddings via Skip-Gram Filtering
Word embeddings are representation of words as vectors which capture their semantic context. This paper investigates the semantic evolution of words over time, by evaluating continuous word embeddings over a very long time period. Cosine similarity between the word vectors of individual words are plotted against time which capture how meaning of a word has changed over time.
1) Google books corpus:
2)State of Union(SoU) address of US Presidents:
3)Twitter data for 21 randomly picked days between 2010 and 2016
Skip-gram filtering, Skip-gram smoothening, Stochastic gradient ascent
I find it novel because it studies the continuous evolution of word embeddings. Evolution of word embeddings have been studied before but they were based on static evolution. This paper gives better results.
4Disney Research
2/28/2017 4:57:48Nikolay Morozov
Bayesian Nonparametric Feature and Policy Learning for Decision-Making
Authors proposed use of Bayesian Nonparametric Model on the Indian Buffet Process, where features are inferred from observed data with assumption that observations are represented by latent features. Features
are based on policies; agents decision are based on superset of those policies. Approach shows good results although high noise and high number of features appears to be a challenging problem. Application to self driving car scenario where considered and real KITTI benchmark real data where used to validate algorithm efficiency, reliable results where obtained (overall accuracy is consistently over 70%).
Bayesian Nonparametric Model on the Indian Buffet Process
Learning from demonstration in application is not something unseen before, however present algorithmic approach demonstrated reliable results on realist data in application to self-driving cars.
Technische Universit at Darmstadt
2/28/2017 8:18:50Nikolaos Sarafianos
Seeing What is Not There: Learning Context to Determine Where Objects Are Missing
The authors are interested in finding the potential locations of missing objects in images (in their case road curbs). To do so they utilize contextual information (heat maps) to find where an object should occur in the image and object detection for the primary object. They use a Siamese Fully convolutional Context network fed with (i) the masked image for classification and (ii) the raw image which enforces the network to output similar results regardless of whether the binary object mask is used or not.
YesStreet View Curb Ramps Dataset
ConvNets for context and objects in images
Yes, as it combines contextual information with object detection masks (to hide the already detected object) so as to identify locations in the image where an object could appear
Very well motivated. Although very different from both it reminded me of these 2 papers: (i) (ii)
University of Maryland
2/28/2017 8:52:23Kostas Oraiopoulos
A Unifying Framework for Convergence Analysis of Approximate Newton Methods
The paper proposes a framework to analyze local convergence properties of sub sampled Newton Methods. Sub sampled Newton methods is a new approach in solving optimization problems, where instead of calculating the complete Hessian matrix in the classical Newton method, a small subset of samples is used to approximate the Hessian.
They try to explore the convergence properties of the proposed sub sampling methods. Namely they try to address three basic questions.
1. Is the Lipschitz continuity of the Hessian necessary to achieve linear-quadratic convergence rate?
2. What is the size of the sketch Newton method that is required to obtain linear convergence
3. Can regularized sub sampled Newton methods use smaller sample size than conventional sub sampled Newton methods
MaybeNo datasetcalculus
Yes. It is focused on approximating second order optimization methods, which is active in the last couple of year. Not a breakthrough but exploring important questions of the area
The paper requires heavy mathematical knowledge of the field and the details of the paper are accessible only to scientists that are closely related to the field. On the other hand the research on these subsampled Newton Methods seems extremely interesting
Shanghai Jiao Tong University, Peking University
2/28/2017 9:27:13Matt Blomquist
Uniform Deviation Bounds for Unbounded Loss Functions like k-Means
Researchers develop a framework to obtain uniform deviation bounds for k-means loss functions when the kurtosis of the underlying distribution is bounded. The researches present a general form for obtaining a uniform deviation bound on the k-means function and describe it's benefits over previous research. These benefits include being scale-invariant and stability for an unbounded solution space. The researchers provide all of the mathematical proofs for the new framework and see a significant improvement in the complexity as compared to previous research.
YesNo.k-means clustering
The paper is somewhat novel as there is a significant improvement from previous research. However, broad implementation of findings are not presented.
Better results are obtained on stronger assumptions such as a bounded higher moment, subgaussianity, or bound support.
ETH Zurich
2/28/2017 10:00:16Dmytro Filatov
An Efficient Pseudo-likelihood Method for Sparse Binary Pairwise Markov Network Estimation
Researchers offer a faster alternative to the optimization of pseudo-likelihood by posing pseudo-likelihood (PL) as an logistic regression (LR) problem in the context of learning an L1-regularized binary pairwise Markov network (BPMN).
The proposed method, Pseudolikelihood method using glmnet (PLG), offers a substantial speedup without losing in accuracy.
Researchers conduct a performances experiment using the senator voting records. The task of interest was to investigate the clustering effect of voting consistency, to find the senators who are more likely to cast similar votes on bills.
U.S. Senate Roll Call Votes 109th Congress - 2nd Session (2006),
Logistic regression (LR), pseudo-likelihood approach (PL), Pseudolikelihood method using glmnet (PLG), binary pairwise Markov networks (BPMNs), node-wise logistic regression (NLR)
Somewhat. Researchers introduced a new optimization method and claims that this method substantially outperforms the existing state-of-the-art implementation of PL: ("Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods", Holger Hofling, Robert Tibshirani, 2010)
3University of Wisconsin
2/28/2017 11:48:42Michele Cavaioni
Asymmetric Tri-training for Unsupervised Domain Adaptation
This paper proposes a tri-training method for unsupervised domain adaptation. It aims to learn discriminative representations by utilizing pseudo-labels assigned to unlabeled target samples. It uses three classifiers, where two networks assign pseudo-labels to unlabeled target samples and the remaining network learns from them.
MNIST, SVHN dataset; Synthetic Traffic Signs and German Traffic Signs Recognition Benchmark datasets; Amazon Reviews dataset.
Asymmetric tri-training for unsupervised domain method. Where asymmetric means assigning different roles to three classifiers.
It's not too novel as it leverage existing methodology of Co-training (leveraging multiple classifiers to artificially label unlabeled samples and retrain the classifiers).
The University of Tokyo, Tokyo, Japan
2/28/2017 14:47:49
Claudio Bernardo Rodriguez Rodriguez
Efficient Learning of Graded Membership Models
Researchers propose a new method to calculating graded membership. Authors propose their new approach PTPQP and walk through the mathematical steps. Authors compare the real mean squared error with other popular methods and show how as N increases their method has better accuracy and less computation. Authors present Matlab code for their method.
YesNo dataset
It uses Dirichlet to allocate a variable into a cluster to discover latent variables. To do this they propose their method partitioned tensor
parallel quadratic programming (PTPQP) which is can be parallelized.
Seems very novel and very useful for unsupervised learning.
Authors tackle Latent Variable Model issues with negative numbers. Research method has similar prediction rates as TPM as HALS. Next step would be to validate results on more datasets and test parallelization of algorithm.
Duke University
2/28/2017 17:30:15Anthony Cooper
Related Pins at Pinterest: The Evolution of a Real-World Recommender System
The authors begin by presenting the thought that no state-of-the-art system is build instantly. Instead, the system requires many hours of development and iterations to be successful. This paper highlights the attempts, results, and reasoning of the Pinterest researchers and engineers who worked on 'Related Pins', their machine-learning driven system that suggests pins for users. In order to maximize the number of users who click the suggested pins, the researchers employ several methods, (candidate generation, Memboost, ranking) many of which leverage machine-learning techniques. The paper also documents the challenges and solutions the researchers faced when building their system.
Recommender systems, gradient-boosted decision trees (GBDT), Random walk service
The authors share their journey, starting with 2 engineers shipping a product to test in 3 weeks, and the three years of research and implementation that come after. I enjoyed how thorough and honest the paper was. I was also pleasantly surprised with how readable and accessible the material is, compared to some other highly technical, math-heavy papers.
2/28/2017 18:26:04Jack Clark
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning
Tries to solve problem of RL mostly being good for single-agents. Claims some success on a Starcraft unit micromanagement task. Modern RL methods frequently use an experience replay memory. How do we make such a memory work with multiple independent q-learners. Instead you learn a policy conditioned on an estimate of the policies of other agents, and you infer this from their behavior. Also adds in some tweaks to memory to make it work better with inferring recent important actions versus old ones, particularly when analyzing and different agents' actions. One of the tricks here seems to be the assumption that you're training all your agents in cooperation with one another.
IQL, Deep Learning, Q Learning, RNN,
Multi-agent learning seems bloody expensive (computationally) compared to single agent learning. seems like we need many more researchers working on this problem to stop computation costs exploding. The 'fingerprint' idea outlined in this paper is intriguing but a little hard to parse on an early read.
University of Oxford, Microsoft Research Redmond
Provides another view on how to tackle complex, dynamic, partially observable worlds like Starcraft.
2/28/2017 19:29:44Anthony Cooper
Show, Attend and Interact: Perceivable Human-Robot Social Interaction through Neural Attention Q-Network
Researchers create a robot called "Pepper" to test human-robot social interactions (HSRI). The robot can do four things - wait, look towards human, wave hand, and handshake. They use a new model called a Multimodal Deep Attention Q-Network (MDARQN) to train the robot, which is composed of two Q-network streams running independently. Each stream consists of convolutional neural networks, long short term memory (LSTM) networks, and an attention network. The researchers train Pepper over 14 days, with the result of Pepper learning intention and intent from human behavioral patterns and actions.
New Dataset + Code, Physical human-robot social interaction
Deep learning, reinforcement learning, dual-stream convolutional neural networks, multimodal deep attention recurrent Q-network (MDARQN), LSTM
They use Aldebaran's Pepper robot, I believe.
Osaka University
Even though Pepper only had four actions to choose from, the robot managed to learn and adapt in an uncontrolled, real world environment. In the future, a similar sized robot may have four hundred actions to choose from - the applications are tremendous! The researchers' proposed network looks very promising, and it shows that there is at least one way to successfully train an agent to understand human behavioral cues.
2/28/2017 20:14:45Anthony Cooper
Multi-agent systems and decentralized artificial superintelligence
The paper is simply an outline of how one could use existing infrastructure and tools to create decentralized AI, using multi-agent systems across several domains. Nothing modern is referenced, such as machine learning or neural networks. It focuses more on theory, and without actual, current data, I do not see how this paper presents anything new.
Berkeley Open Infrastructure for Network Computing (BOINC),
While I found the topic and proposal interesting, the most recent reference is from 1997. Despite this, modern technologies like Siry (Siri?), and the Ethereum blockchain are brought up along with Windows 95. The paper also contains grammatical and formatting errors which impair readability and flow.

There is no research conducted; the paper resembles an outline on possible methods to create decentralized AI using existing tools, like BOINC or P2P networks.
Moscow Institute of Physics and Technology
2/28/2017 21:18:30Fred Tanada
Improving Machine Learning Ability with Fine-Tuning
The supplemental IRT training data has two qualities that make it desirable as training data: (i) the data were identified because of their local independence from each other in the process of fitting an evaluation metric to measure latent 27 Feb 2017 Improving Machine Learning Ability with Fine-Tuning ability, and (ii) each example in the data has a large number

Our contributions are as follows: (i) we show that finetuning parameter weights for a memory-augmented neural network (Munkhdalai & Yu, 2017) with a specialized supplemental training set improves model performance for certain tasks when compared with a human population, (ii) this fine-tuning does not overfit on the supplemental data and therefore does not negatively affect generalization in terms of accuracy, and (iii) we motivate the use of supplemental training data as a way to improve performance in terms of ability with regards to a human population.

The model may have learned a set of parameters that better models the data than the human population, and updating parameters to reflect the human distribution could lead to a drop in performance.

In this case the fact that performance in terms of theta improved is a surprising result. Fine-tuning with the full SICK training set negatively impacts performance in all cases, while fine-tuning with a random sample from SICK improves performance over the original model in all cases except G4-Neutral, where the original model performed best. By introducing only a subset of new data to the model in the fine-tuning phase, the model is able to update its parameters without completely re-learning based on the new data.
Hovy, Dirk, Berg-Kirkpatrick, Taylor, Vaswani, Ashish,
and Hovy, Eduard. Learning whom to trust with mace.
Sentences Involving Compositional Knowledge (SICK) data set, another widely used RTE data set (Marelli et al., 2014), Algorithm 1 Fine-Tuning Procedure Input: NumEpochs e, XN train, Xtest, F Ttrain, IRTtest, Loss Function l for i = 1 to e do Train NSE with XN train with loss function CCE end for for i = 1 to e do Train NSE with F Ttrain with loss function l end
YesColumbia University
I'm excited about this research because it will improve the overfitting results from human performance.
2/28/2017 21:50:40Fred Tanada
Improving the Neural GPU Architecture for Algorithm Learning
The proposed improvements allow us to achieve substantial gains: the model can learn binary multiplication in 800 steps versus 30000 steps that are needed for the original Neural GPU, and, most importantly all the trained models generalize to 100 times longer inputs with less than 1% error.

It can learn fairly complicated algorithms such as addition and binary multiplication, but only a small fraction of the trained models generalize well.

The improved architecture can easily learn a variety of tasks including the binary multiplication on which other architectures struggle.

The correct generalization of the learned models to arbitrary large inputs is still an open problem, and it is not even clear why some models generalize, and others do not.
Datasets can be found in Appendix
AdaMax optimizer, Neural Turing Machine, iagonal Convolutional Gated Recurrent Unit (DCGRU)
YesCornell University
I am excited because I would like to improve performance in my GPU instances which I use in the self-driving program.
2/28/2017 22:03:30Dmytro Filatov
Deep Forest: Towards An Alternative to Deep Neural Networks
Researchers propose gcForest (multi-Grained Cascade forest), a novel decision tree ensemble approach with performance highly competitive to deep neural networks (DNN). The gcForest includes two important components: cascade
forest and multi-grained scanning. Comparing to DNN gcForest has much fewer hyper-parameters and is less sensitive to parameter setting.

This approach showed a good accuracy in:
1. Face Recognition – ORL dataset : 98.30% gcForest vs 92.50% DNN (CNN)
2. Music Classification – The GTZAN dataset: 65.67% gcForest vs 59.20% CNN
3. Sentiment Classification – The IMDB dataset: 89.32% gcForest vs 89.02% DNN (CNN)
MNIST, GTZAN Genre Collection, sEMG for Basic Hand movements, The IMDB dataset, UCI-datasets,
multi-Grained Cascade forest (gcForest), cascade forest, multi-grained scanning, deep neural networks (DNN), convolutional neural network (CNN)
Shows a good performance on CPUs: run on 2 Intel E5 2670 v3 CPUs (24 cores)
Nanjing University
gcForest has much fewer hyper-parameters to tweak than deep neural networks; performs well with default setting across different data from different domains; easy to analyze
2/28/2017 22:40:34Anuj Gupta
Related Pins at Pinterest: The Evolution of a Real-World Recommender System
The paper begins it presentation stating how the intial recommendation engine with limited computational resources has now been transformed into a advanced engine with real-time complexity being induced, the challenges like avoiding feedback loops, performance evaluation,etc. and the trade-offs require it to be successful in not so state-of-art system. They began their journey by observing the problems in initial product and challenging them by first designing the simple highest-leverage products and then reach incremental milestone using several methods (like memboost, ranking, GBDT). They conclude the research by showing the results, evaluation and comparison with the previous engine.
Recommendation engine, GBDT, memboost, random walk
I am excited since this research is a combination of challenging the legacy concept and proving them wrong with the today's not so state-of-art system. Plus it can grow upto a large system with the latest technologies like deep learning , CNN,etc.
3/1/2017 2:49:49Octavia Efraim
A Roadmap for a Rigorous Science of Interpretability
This paper is concerned with a scientific approach (i.e. a rigorous definition and evaluation) to interpretability in the context of ML. Interpretable means understandable by a human. Interpretability is involved in the pursuit of other criteria generally desired of ML systems (fairness/unbiasedness, privacy, robustness, causality, usability, and trustworthiness). The researchers argue that explanations, and thus interpretability, are needed when the problem at hand is incompletely formalised. The paper puts forward a taxonomy of evaluation approaches focused on interpretability: (1) application-grounded (real humans on real tasks, i.e. extrinsic evaluation of the system, by humans); (2) human-grounded (real humans on simplified tasks, e.g. identifying errors or patterns in relation to a system's output rather than evaluating it in its context of application); (3) functionally-grounded (no humans, proxy tasks, e.g. optimising a ML model already shown to be interpretable; avoids problems related to evaluation by humans). The authors go on to propose a data-driven approach aimed at discovering factors of interpretability, which should be based on data about the performance of different methods on their specific end tasks; they suggest that data repositories for this purpose be created (similar to existing ML repositories). Method-performance data in matrix form can reveal latent dimensions to do with interpretability (dimensions related either to the task, or to the method used). Finally, the authors recommend that the research community follow a couple of general principles: the claim of the research should match the type of evaluation; applications and methods should be categorised using a common taxonomy.
Harvard John A. Paulson School of Engineering and Applied Sciences, University of Washington Department of Computer Science and Engineering
This paper is about research methodology, which is often underrated, but remains crucial for quality work. It is important that rigorous frameworks be used to support research claims.
3/1/2017 3:13:13Ethan Caballero
Bridging the Gap Between Value and Policy Based Reinforcement Learning
Acknowledging that γ-discounted entropy regularization is used in reward expectation, they formulate a new notion of softmax temporal consistency for optimal Q-values as:
Q∗(s,a) = r(s,a)+γτ log a′ exp(Q∗(s′,a′)/τ) .

They then introduce a new RL algorithm called Path Consistency Learning (PCL) that extends this softmax temporal consistency to arbitrary (multi-step) trajectories. Pseudocode of PCL Algorithm can be seen in Section 3.5 .

Unlike algorithms using Qπ- values, PCL seamlessly combines on-policy and off-policy traces. Unlike algorithms based on hard-max consistency and Q◦-values, PCL easily generalizes to multi-step backups on arbitrary paths, while maintaining proper regularization and consistent mathematical justification (that they outline in Section 3 and the appendix).

PCL is similar to A3C (and actor-critics in general) and PGQ (policy gradient q-learning) [arxiv:1611.01626], but has some key differences/improvements.

A3C vs PCL:
In comparison to A3C, PCL’s advantage function is more aligned with rewards in that advantage is 0 on every trajectory for its optimal policy, which is not the case for A3C. Also, PCL’s value function is not dependent on the current policy.

PQQ relates the optimal policy to the hard-max Qvalues in the limit of τ = 0, and thus proposes to augment the actor-critic objective with offline updates that minimize a set of single-step hard-max Bellman errors; PQQ's weakness is assuming limit of τ = 0 (τ is entropy).
PCL extends the relationship to τ > 0 by exploiting a notion of softmax consistency of Q-values.

They compare performance of PCL, A3C, & DQN on the algorithmic tasks from OpenAI Gym, and PCL outperforms or equates A3C & DQN on all tasks.

Most impressively in my opinion:
PCL can incorporate expert trajectories very easily and very effectively.
Inserting just 10 expert trajectories in a size 400 batch drastically improves performance (e.g. solving tasks with max reward in just ~50 iterations as opposed to only getting to ~half of max reward after 4000 iterations).
They did not perform a direct comparison of effect of adding in expert trajectories to other RL algorithms (mostly because it’s more complicated to do so for others).
This attribute could be very useful for tasks with a few expert trajectories such as Description2Code.
Algorithmic tasks from OpenAI Gym:
Path Consistency Learning (PCL), Softmax Temporal Consistency; Value and Policy Based Reinforcement Learning that acknowledges γ-discounted entropy regularization; on and off policy; optional ability to effectively/easily incorporate expert trajectories
Test it out on tasks more complicated than the algorithmic tasks from OpenAI Gym.
Google Brain, University of Alberta
Beats or equates all the baselines. Can use on-policy and/or off-policy data (any state-action subsequence). Can incorporate expert trajectories very easily and very effectively; adding just a few expert trajectories to a batch yields drastic improvements.
3/1/2017 8:04:37Gregory Besson
Fast and Accurate Inference with Adaptive Ensemble Prediction in Image Classification w/ Deep NNs
Ensembling multiple predictions is a widely used technique to improve the accuracy in image classification task. The drawback being the computation cost. The approach proposed in this paper is to add a confidence level for each input based on the probability of the predicted label (the softmax output).
When the prediction for an input reaches a high enough probability on the basis of the confidence level, we stop ensembling for this input. This way, the computation time is drastically reduced while achieving similar accuracy to the regular ensembling technique.
ILSVRC 2012, Street View House Numbers (SVHN), CIFAR-10, and CIFAR-100 (with fine and course labels)
For the ILSVRC 2012 dataset, GoogLeNet as the network architecture and training the network using the stochastic gradient descent (momentum being the optimization method). For other datasets, a network with six convolutional layers with batch normalization followed by two fully connected layers (adam being the optimization method).
Because this technique would help to achieve the ensemble technique accuracy while keeping the computation time low.
3/1/2017 8:11:23Nikolaos Sarafianos
MIML-FCN+: Multi-instance Multi-label Learning via Fully Convolutional Networks with Privileged Information
The authors propose a framework for multi-label object recognition from multiple instances by employing the Learning Using Privileged Information Paradigm in their learning. A 2-stream network is used (one for primary, one for privileged information) along with a new loss which incorporates the privileged information inside the regularization term.
VOC 2007, V0C 2012, MS COCO ( - -
Deep Learning and the Learning Using Privileged Information paradigm
NoNTU, Singapore
An interesting take-home message from that paper could be the idea of PI pooling (performing pooling only from the bounding box region which is considered as privileged information). However a discussion about the similarities and the differences of the proposed method (even at a high level) with existing approaches that address the same problem is missing. The idea of incorporating priv. info as a regularization term has also been proposed by Wang and Ji (Classifier Learning with Hidden Information, CVPR 2015) whereas, employing additional info during training of a ConvNet for ImageNet (with segmentation masks as priv. info) by Chen et al. (Training Group Orthogonal Neural Networks with Privileged Information, ICLR 2017 submission).
3/1/2017 8:21:03Spiros Raptis
eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys
The paper tries to solve with Deep learning the problem of Detecting cybersecurity problems such as finding malicious URLs ,file paths and registry keys.

Previous machine learning detection techniques focused on automatically classifying file paths or registry keys and operated on groups of dynamic host-based observations. The proposed technique (eXpose) operates on individual events and learns the representation of input strings rather than using feature engineering (Very time consuming) and detects from the input character string based on lexical semantics.

eXpose applies a neural network directly to the input strings. It uses character embedding for the input string. It then feeds the previous layer into multiple convolutions for detecting features. For classification it uses fully connected layers where the final layer is a sigmoid layer.

The results achieve a 5-10% higher detection rate than manual feature extraction.

However, the overall results for file paths and registry keys are worse than for URLs. The explanation is that there’s a difficulty in properly labeling samples.

Finally, because the proposed technique is very computationally expensive, next steps include testing more complex architectures in better hardware.
For the first problem (malicious URLs) 19067879 unique
URLs were downloaded from VirusTotal. Also for malicious file paths and registry keys they extracted from 18M Cuckoo sandbox runs from VirusTotal.
deep learning, convolutional neural networks, character embedding.
I believe this paper can be a proof of concept. It doesn't appear that anybody had tried something like that before.
Invincea Inc.
I'm excited because deep learning is applied to a very hard problem. Current techniques include manual feature extraction which is a very time consuming task. The paper presents the first signs that automatically recognizing malicious URLs, registry keys and file paths could be revolutionized with deep learning.
3/1/2017 9:00:44Joseph DeBartola
Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition
Handwritten Chinese Character Recognition has seen promising results in offline applications thus far, but truly state of the art results demand large and costly CNNs. Here, the researchers present an application of two new techniques--Adaptive Drop-Weight and Global Supervised Low-Rank Expansion--in an attempt to both reduce the size of networks and improve their efficiency. They were successful in attaining a computational cost a magnitude less than the state-of-the-art while reducing the network to 1/18 its original size--with only a 0.21% drop in accuracy.
CASIA Online and Offline Chinese Handwriting Databases - Namely CASIA HWDB1.0 and CASIA HWDB1.1
Deep Learning, Convolutional Neural Networks, Adaptive Drop Weight--where the pruning threshold for redundant connections is calculated dynamically, Global Supervised Low-Rank Expansion--where low-rank filters are used in place of the typically higher-dimension filters seen in CNNs
The researchers plan to explore similar application of their new techniques to image classification and object detection.
School of Electronic and Information Engineering - South China University of Technology, bFujitsu Research & Development Center Co. Ltd.
The ability to move more demanding, state-of-the-art solutions to deep learning problems to offline applications will dictate the degree to which we can apply these models in our day-to-day life. Handwriting recognition is just the start here! This has major implications for all applications of CNNs which otherwise require massive, highly-dimensional calculations.
3/1/2017 9:56:58Rene Wang
Deep and Hierarchical Implicit Models
Authors develop a new variational inference algorithm to estimate parameters of two competing models in bayesian inference framework: one is hierarchical implicit model for estimating density of mixture model and can be applied to the network where the likelihood is intractable to compute. The other is deep implicit model for estimating multi-layers representation and can be applied to learn the complicated representations. The new variational inference using KL-divergence as objective and estimate the ratio for two competing models (Hierarchical implicit and Deep implicit models). In order to be scalable, stochastic gradient descent / minibatch along with MCMC are used for both models. The new variational inference focusing ratio estimation better performs in Bayesian GAN than Bayesian neural network alone. Other experiments such as prey-predator simulation and denoise autoencoder with large network size are conducted to demonstrate the computational efficiency of ratio-based variational inference; However, the stability of training is highly dependent on the model assumptions and training result might not be converged and unstable when the model greatly deviates the true data.
simulated data set (not provided in the paper)
KL-based variational inference algorithm along with mini-batched SGD/MCMC
Columbia University, Princeton Uni- versity
3/1/2017 10:13:48Ioanna Stypsanelli
Depth Creates No Bad Local Minima
This paper proposes a new simplified proof that depth alone creates no bad local minima. It builds on previous work of Kawaguchi (one of the co-authors).
So researchers trying to understand machine learning complexity can use this proof to work on designing efficient deep learning networks without special attention to depth but rather focusing on nonlinearity.
YesNo datasetcalculus/optimisationNo
Haihao Lu, Kenji Kawaguchi, MIT
3/1/2017 14:06:20Kostas Oraiopoulo
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
In very deep feedforward nets , weights become like white noise. That phenomenon is called by the authors 'shattered gradients problem' and in contrast , networks with skip connections like resnets are more resilient to the phenomenon.
They investigate how batch normalization affects shuttered gradients
At they end they investigate how "looks-linear" initialization affect very deep networks
YesCIFAR-10resnet, CReluYes
Victoria University of Wellington New Zealand , Disney Research Zurich Switzerland
It identifies an important problem of very deep networks and opens a gate for investigation on how to fix the problem
3/1/2017 16:23:06Adam Tetelman
Learning Latent Networks in Vector Auto Regressive Models
Researches considered the problem of causal inference from observational
time series data. Given a partially observed VAR model and a set of time-series data the researchers attempted to prove that dependencies among observed processes could be inferred. They first defined conditions under which the underlying model could be recovered from the observed processes. They then defined a limit stating how much of the model could be recovered. They derived two algorithms to recover and the minimal latent network for their datasets. They built a small experimental dataset to prove out their theorems and saw good results. They were able to duplicate their good results with real economic data.
Federal Reserve Economic Database - FRED

A collection of three US dairy prices from January 1986 to December 2016

Quarterly West German consumption expenditures X1, fixed investment X2, and disposable income X3 during 1960-1982
VAR model, Directed Tree Recovery Algorithm (DTR), Node Merging Algorithm (NM)
University of Illinois at Urbana-Champaign
This research focuses on VAR models and economic modeling. Their results are more aligned with quantitative analysis than machine learning research.
3/1/2017 18:21:31Naveed Ahmed
Scaffolding Networks for Teaching and Learning to Comprehend
Researchers have developed a learning/teaching platform that would help the teachers to teach their students using the reasoning approach. It is evident that more questioning allows to learn about any concept better and deeper. Therefore, authors develop 1) Scaffolding Network, which is an attention-based NN to reason over a dynamic memory, 2) question simulator to generate different related questions for continuous learning, and 3) train the network with annotated dataset. Both real and synthetic datasets were used for qualitative analysis of the network.
Reinforcement Learning(RL), Deep Q-Learning (DQN)
YesMicrosoft Research
Yes, I am pretty much excited about both the problem statement and its implementation likewise. It gave me another dimension where AI can show its strength to outperform over the conventional teaching approaches. I also this kind of network can be deployed outside a school environment of its use.
3/1/2017 18:46:01Cristóbal Felipe Fica
Scene Flow to Action Map: A New Representation for RGB-D based Action Recognition with Convolutional Neural Networks Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, Philip Ogunbona
With new devices like a kinect, that can capture depth additionally to a color map. human action recognition is pursued.
Researches propose a new method (SFAM-CTKRP) to transform Scene Flow (3d optical flow) to an Action Map (SFAM). by firstly fusing RGB and depth data acquired into a single 3 channel Scene flow map, then using ConvNets to learn the intrinsic relationship between the Scene flow map an a RGB analogous image which can be use with existing trained ConvNets models like ImageNet.
The paper also provides a proposal of a self calibration method of RGB-Data that could be misaligned. Finally, the method surpass the previous state of the art SFAM techniques, using the ChaLearn LAP ISO GD dataset (RGB-D videos representing one gesture instance). Yet for the M2I dataset the machine fails to converge due to the lack of training data).
YesChaLearn LAP IsoGD; M2I Dataset
ConvNets, RankPooling, Multiple-Score fusion, Primal-Dual Flow
Advanced Multimedia Research Lab, University of Wollongong, Austrialia; University of Geosciences, Wuhan, china.
This paper achived the highest accuracy for transforming a scene flow to an action map, previously done mainly by hand-crafted methods. showing again the enormous power of convNets, and how deep learning is displacing other algorithms and methods in different fields. Maybe this will imply more flawless interaction between human and machines using gestures.
3/2/2017 1:34:30Saurav Gupta
Active Learning using uncertainty information
Researchers propose a new formulation for active learning. Importantly, they incorporate the known label-distribution information within the min-max framework. The key idea of min-max framework is to minimize the gain of the objective function no matter what the label for the new data point is. Using the existing label distribution information in this framework shows that this method works marginally better than existing methods (EER and MLI) on a majority 49 real world datasets.
Active learningNo
TU Delft, University of Copenhagen
3/2/2017 5:42:20Kris Roosen
Learning Deep NBNN Representations for Robust Place Categorization
In previous work, spacial classification was usually performed by feeding features, previously extracted by pre-trained CNN models to a NBNN (Naïve Bayes Nearest Neighbor) classifier. This paper aims to turn this 2 step process into a single step by integrating the NBNN fully into a CNN and exploiting local deep representations. This method shows to be more robust, outperforming previously used techniques, faster and computationally cheaper. Another advantage is the possibility of end-to-end training (1 step process).
- ImageNet [J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.]
- Places [B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in NIPS,
2014.] [ B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places:
An image database for deep scene understanding,” arXiv preprint
1610.02055, 2016.]
- Sports8 [L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by
scene and object recognition,” in ICCV, 2007.]
- Scene15 [S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories,”
in CVPR, 2006.]
- MIT67 [A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR,
- MIT8
- [Fully Convolutional] Convolutional Neural Networks ([FC]CNN)
- Naïve Bayes Nearest Neighbors (NBNN)
- Naïve Bayes Non Linear Learning (NBNL)
There was no practical information on how to implement the ideas in practice, only a very technical mathematical derivation.
University of Rome La Sapienza, Fondazione Bruno Kessler, University of Perugia
3/2/2017 9:50:09Julie Zhu
Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees
The two big challenges of k-NN are 1) finding good representations and distance measures between samples, and 2) computational and memory requirements. The boundary tree algorithm (published in 2015) allows for efficient nearest neighbor classification, regression and retrieval, which addressed the second challenge. Researchers in this paper improved the boundary tree algorithm. By modeling traversals in the tree as stochastic events, they form a differentiable cost function which is associated with the tree's predictions. They then use a deep neural network to transform the data and back propagating through the tree allows to learn good representations for k-NN methods. The new algorithm works efficiently on non-large-scaled datasets, provided high accuracy and very simple and interpretable structures.
half-moons dataset, MNIST handwritten digit dataset, CIFAR10 dataset
deep learning, boundary tree (an improved k-NN)
YesDeepMind, London, UK
The algorithm works well on not very large-scaled data. At the end of the training, the result is an extremely simple tree. The advantages of the tree are 1) the speed to use the tree for classification, regression and retrieval is fast and 2)the accuracy is high and 3) the structure is very interpretable.
3/2/2017 20:49:22Anthony Cooper
Skin Lesion Analysis Towards Melanoma Detection
The paper outlines the method and results of an entry in the ISIC (International Skin Imaging Collaboration) 2017 challenge. It addresses two specific areas of the challenge:

1. Lesion Segmentation

The author resizes the images and applies rotations and distortions. This increases the number of training examples to 20,000. The Adam optimization algorithm trains the network over a total of 200 epochs.

2. Lesion Classification

Training occurs on 6,000 images. The author notes that images with gauze and images with a bright light become classed together. The architecture used is Alexnet (Deep CNN).Training uses Adam + 10-fold cross validation, and continued for 300 epochs.

Overall, it seems like this was a great method, especially considering the small set of training examples (which had to be 'duplicated') and the low epoch count.
No(possibly on ISIC 2017 website)
Deep Convolutional Networks, U-Net/AlexNet architecture, Dropout, ReLu, Adam optimization algorithm, Preprocessing
YesMatt Berseth
This is yet another step towards using machine learning and CNN's to detect cancer for early treatment and prevention. Exciting stuff. :)
3/2/2017 22:23:55Anthony Cooper
TumorNet: Lung Nodule Characterization Using Multi-View Convolutional Neural Network with Gaussian Process
In order to better identify lung nodules as malignant or benign, researchers employ a unique Convolutional Neural Network called TumorNet.

The traditional process of lung nodule classification involves hand crafted features classified using Support Vector Machines or Random Forest classifiers. More recently, pre-trained CNNs were only used for identifying features, with SVM and RF's being used for classifiers.

The researchers' method involves an end-to-end CNN that learns all the features itself! They take about 1,000 scans annotated by radiologists, and augment the data by random rotation and up/down-scaling. The CNN consists of 5 convolutional layers, 3 fully connected layers, and a softmax classification layer. Gaussian process (GP) is used to add uncertainty to the predictions.

One result the researchers noted is the link between high level nodule attributes and the deep learning features. They intend to explore more in this area, using transfer learning to help with the lack of labelled data.
YesLIDC-IDRI Dataset (reference [10])
CNN, data annotation, data augmentation, Gaussian process regression, computer vision, classification
University of Central Florida
This is absolutely incredible - using only 1,000 images, they manage to create an end-to-end RNN that classifies malignancy. This can really change lung cancer treatment + prevention, and we will see more and more examples of better methods.

I look forward to reading their follow up research. Definitely read this paper; it's worth it!
3/2/2017 22:37:55Jonathan Yan
Understanding Synthetic Gradients and Decoupled Neural Interfaces
Unlocked training can be achieved through the use of Synthetic Gradients (SG) and Decoupled Neural Interfaces (DNI). In this paper, the authors study their theoretical soundness, and show that (1) SG can preserve critical points of the original optimization problem, but can also introduce new ones; (2) under good conditions, the SG model converges to the solution of the original problem; and (3) the representations learned using SG are different from the ones learned using standard backpropagation.
Synthetic Gradients, Backpropagation, Feedback Alignment, Direct Feedback Alignment, Kickback
In Section 6, the authors provide a unified framework to connect several forms of approximate error propagation. I'm excited because this is new to me. Expressing these methods in the language of "conspiring networks" provides additional intuition and helps to spawn future innovations in this direction.
3/3/2017 20:31:55Adam Letts
A Robust Adaptive Stochastic Gradient Method for Deep Learning
Extensive update to the original (2014) Adasecant paper. Adasecant is an adaptive learning rate algorithm which automatically tunes the learning rates. They show that it obtains better performance in deep learning scenarios than popular SGD algorithms.
- Contains a much-improved Experiments section whereby the Adasecant algorithm is decomposed into its component parts, and graphs present comparative performance
- New description of close relationship to diagonal approximation to the Hessian
- Better description of their use of blockwise gradient normalization
- Overall presentation is now more fashionable
For experimental and comparison purposes:
Penn Treebank (PTB) - [ authoritative source unknown - available on many github repos - e.g. ]
MNIST database -
deep learning, Adasecant, Adagrad, blockwise gradient normalization, variance reduction, outlier detection, adaptive learning rate
The paper is an extension of the 2014 original - which had last been updated in 2015. Here is the link to the original:
Universite de Montréal, University of Oxford
Removing the need to tune the learning rate can speed up development, and help achieve better results in less time. I feel that in general, approaches that reduce necessary tuning are extremely important.
In terms of the new changes to the paper, I believe that the decomposition detail should be valuable for research and tuning purposes. I wonder if the new paper could renew interest and result in more implementations of the algorithm, or provide the additional accessible detail necessary for its findings to influence other algorithms and/or development methodology.
3/4/2017 0:53:05Hyungwon Chae
Unsupervised Image-to-Image Translation Networks
Researchers at NVIDIA proposed the UNsupervised Image-to-image Translation
(UNIT) framework, which is based on variational autoencoders and generative adversarial networks that can can learn the translation function without any corresponding images in two domains. This reduces the need to collect large sets of paired data. Some experiments shown in this paper include translating RGB day image to night image, Thermal image to color image, rainy-day image to sunny-day image. The team trained the UNIT networks using a Tesla P100 card in an NVIDIA DGX-1 machine. The UNIT framework achieved better results than competing algorithms do in benchmark datasets and beat the previous state-of-the-art approach in accuracy.
multispectral pedestrian detection benchmark, KAIST
CelebFaces dataset, CUHK
mnist dataset,
Street View House Number, stanford
USPS dataset
UNsupervised Image-to-image Translation
(UNIT) framework, GAN, VAE, Unsupervised Domain Adaptation (UDA)
The team is interested in extending the framework to the unsupervised language-to-language translation task.
This research has lots of potential and can be applied to improve tasks such as autonomous driving. Self driving cars can benefit from the easily obtainable images of various road conditions. Most of the existing image-to-image translation approaches are based on supervised learning that require training datasets consisting of pairs of corresponding images in two domains, which can be hard to obtain but this makes very it easier.
3/4/2017 7:11:42Dmytro MishkinShortscience
Guys, what about using ? It is website exactly for such efforts and have already a bunch of papers + search in it
I am not author of that web site, but excited about it. And sorry for spam
-Because it is cool
3/5/2017 8:01:01Martijn Handels
Addresses unsupervised learning method for language models (LM). Instead of pairing input-output samples, they exploit sequential statistics of output labels, in the form of N-gram language models, which can be obtained independently of input data and thus with low or no cost. The success of the method over previous unsupervised approaches is the introduce a novel cost function in this unsupervised learning setting, whose profiles are analyzed and shown to be highly non-convex with large barriers near the global optimum. A new stochastic primal-dual gradient method is developed to optimize this very difficult type of cost function via the use of dual variables to reduce the barriers. The paper demonstrate in experimental evaluation, with both synthetic and real-world data sets, that the new method for unsupervised learning gives drastically lower
Tensorflow neural net on lineair model, new stochastic primal-dual gradient method
While the current work is limited to unsupervised linear models for prediction, it is straightforward to generalize the current cost function and SPDG algorithm to nonlinear models such as deep neural nets.They also plan to extend our current method from exploiting Ngram LM to exploiting state-of-art LM so that the full unsupervised learning and prediction can be formulated as an end-to-end system.

Microsoft Research
Promosing result with unsupervised language models
3/5/2017 12:06:18Alex Korbonits
Coarse Grained Exponential Variational Autoencoders
Researchers relax assumptions of gaussian priors for modeling variational autoencoders and allow a much larger family of (parametrized) semi-continuous functions whose properties are not just great for generalization to for learning more complex posterior distributions, but also, due to their semi-continuous nature, they're easy to manipulate in a discretized way for computation. This paper proposes a new method, CG-BPEF-VAE, within the variational auto-encoder framework. They also prove some theoretical bounds, and touch on how to build a discrete latent structure to factor information in a continuous representation.
Deep learning, variational autoencoders, bayesian statistics,
Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST)
I'm excited about this research because they prove theoretical bounds and extend the abstraction of variational auto-encoders beyond previous results. It's good to see proofs and theoretical results in the deep learning literature. I am not excited that they used MNIST as their primary dataset on which to test out their modeling, but I think the approach is smart and worth attempting on other datasets. Wouldn't mind seeing this new approach validated experimentally on other datasets. The writing could be much improved. The density of the mathematics in the paper made it hard to digest, but I'm sure that with a more careful reading it would be easier to parse.
3/5/2017 12:32:56Alex Korbonits
Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies
MIxed hiSTory RNNs, or MIST RNNs, resolve two issues that plague the use of NARX RNNs (which attempt to address the inability of RRNs to learn long term dependencies adequately due to vanishing gradients): (1) by using exponential delays instead of continuous delays, they greatly reduce worst-case bounds on the number of previous edges in a set of delays that need to be visited; and (2) (quoted) "Second, by restricting ourselves to a (learned) convex combination
of previous states, we maintain a computational
complexity that is similar to LSTM and GRUs."

I.e. they are attacking the problem of learning long-term dependencies by introducing connections from previous units but in a way that is not too computationally complex (compared to NARX RNNs) AND with better bounds.

They also spend a lot of time (in a good way) going through the mechanics of MIST RNNs so that they could be easily implemented.
TIMIT corpus (Garofolo et al., 1993); MNIST
deep learning, recurrent neural networks, NARX RNNs, MIST RNNs, vanilla RNNs, LSTMs, GRUs
The authors do a great job differentiating between different kinds of RNNs and their advantages and disadvantages in attacking the 4 general problems that this paper applies MIST RNNs to (Copy (D = 100), Addition (L = 100), TIMIT, Sequential pMNIST). They also acknowledge future work to be done that would combine other additional regularization/optimizations that are common across different RNNs types, and they stress the importance of reproducibility by being transparent and offering code for their implementation of MIST RNNs.
Johns Hopkins University, Technische Universitat Munchen, Institute for Advanced Study at Technische Universitat Munchen
I'm excited about this research because as the authors point out, MIST RNNs are an orthogonal approach to LSTMs and RNNs, and can be combined in future work to gain (potentially) better performance. I had not reviewed many papers before that considered such specific introductions of delays to previous units in order to address the vanishing gradient problem in learning long-term dependencies.
3/5/2017 17:17:35Srinivas Neppalli
Learning Discrete Representations via Information Maximizing Self Augmented Training
This paper proposes a method called Information Maximizing Self Augmented Training (IMSAT), an information theoretic method for unsupervised discrete representation learning using deep neural networks with the end-to-end
regularization. IMSAT is then applied to clustering and hash learning to achieve the "state-of-the-art performance" on several benchmark datasets.

In IMSAT, the data points are mapped into their discrete representations by a deep neural network and it is regularized by encouraging its prediction to be invariant to data augmentation. The predicted discrete representations then exhibit the invariance specified by the augmentation. This regularization method is called Self Augmented Training (SAT). Following the Regularized Information Maximization (RIM) for clustering (Gomes et al., 2010), researchers maximized the information theoretic dependency between inputs and their mapped outputs, while regularizing the mapping function to arrive at Information Maximizing Self Augmented Training (IMSAT).
Deep Neural Networks
Unsupervised Learning
Regularized Information Maximization (RIM)
Preferred Networks, Inc.
3/5/2017 21:33:50Saurav GuptaMeta Networks
This paper introduces a novel network architecture for one-shot learning. The model acquires a meta-level knowledge across tasks. Model consists of two modules: meta-learner and base learner. The meta learner operates across tasks is responsible for fast weight parameterization of both base and meta learner.
The base learner passes meta information in the form of loss gradients to the meta learner. MetaNet is shown to have good generalization and continual learning properties.
meta Learning, one shot learning
YesU Mass Amherst
Gets SOTA in one shot task. Uses meta learning as a technique inspired from the brain.
3/11/2017 13:14:30Darshan Pai
MoleculeNet: A Benchmark for Molecular Machine Learning
The authors introduce a benchmark system for molecular machine learning techniques. Molecular compounds are complex in nature and understanding the molecular properties of compounds (called tasks by authors) is a very complex process using ab-initio computations. Molecular compounds are hard to gather and goes through an extensive process to gather the chemical properties with high accuracy. Hence the datasets are not very large. Moreover, a lack of a common benchmark precules comparison of proposed methods already in literature. Machine learning help predict these molecular properties at a much higher rate and much more accurately. The challenges include curating the data to be able to feed it to a machine learning algorithms. The goal of the paper is to provide a benchmark suite of tools named MoleculeNet that contributes a data-loading framework, guidance and algorithms for featurization techniques to provide a consistent description of the original heterogeneous and highly variable molecule data to feed into a machine learning algorithm, techniques for splitting data into training and testing samples, and a range of learning models to be applied. The authors also present an analysis of 12 datasets from 4 different categories of tasks. The analysis concludes that data-driven methods for prediction can outperform physical algorithms with moderate amounts of data. Technique based on graph convolution models give best result with low overfitting. However, some datasets need good featurization that are currently not available.
Most datasets are cited to individual papers given in references.
Some dataset links are provided.

Quantum Machine
DeepChem package has datasets in it:
The benchmark provides the following models that are available to users for data analysis and prediction
logistic regression, random forest, multitask networks, bypass networks, inference relevance voting, graph convolution models
featurization techniques for data representation looks interesting .
Stanford University, Stanford School of Medicine, Schrodinger Inc
I understand the need for benchmarking suite that is definitely needed for better comparison of new and older approaches . However, I do not see original research other than it being a assortment of multiple research sources into a common platform. It will definitely help further molecular research. Maybe the authors have some of their own techniques that they can compare. However, it was not highlighted. The authors definitely take credit for the DeepChem package, which is the base platform for MoleculeNet.
Option 1
Main menu