Arxiv Paper Analysis Worksheet (Responses)

	A	B	C	D	E	F	G	H	I	J	K	L	M	N
1	Timestamp	Your Name (Full)	Paper Name	Paper Url	High level summary	Does it claim a State of the Art result?	What datasets are used in the paper? Can you provide a link to the dataset?	What techniques are used in the paper?	Are you excited about this research paper?	Put a rating between 1 and 5 for the paper.	Any other notes	Researcher affiliation	If you're excited, why are you excited?

2	2/27/2017 16:24:09	Jack Clark	Robot gains Social Intelligence through Multimodal Deep Reinforcement Learning	https://arxiv.org/abs/1702.07492	researchers train a robot via MDQN to learn social interaction skills with people. The robot is trained to select from four possible actions (waiting, looking towards human, waving hand, handshaking). It gets a reward for a handshake, 0 for the other actions, and -0.1 for an unsuccessful handshake. During testing, human volunteers evaluated the correctness of the robot's response to various situations. The robot learns to do interesting things, like infer walking trajectories from people, or learn which actions to take to increase the likelihood of a handshake (like waving to someone walking towards them.)	Maybe	New dataset, Deep RHI https://sites.google.com/a/irl.sys.es.osaka-u.ac.jp/member/home/ahmedqureshi/deephri	deep learning, multimodal deep Q-network (MDQN)	Seems quite novel to me. Not a common area of research	3	Uses Aldebaran's Pepper robot. Next steps are to increase the action space beyond four actions, and use a recurrent attention model so they can get more state on the robot. This paper mostly interesting due to the future it indicates, than the proof-of-concept research it consists of.
3	2/27/2017 19:52:44	Jack Clark	Neural Map: Structured Memory for Deep Reinforcement Learning	https://arxiv.org/abs/1702.08360	Represent a memory system to an RL agent as a map. Do this by making the memory a spatially structured 2D memory image that can learn to store info about the environment. Structure the map so the agent takes it in as an embedding with current position, then use to navigate the agent. Experiments claim better performance than MenNN-32 and random and an LSTM on a new "goal search" environment. Uses gated recurrent units to create a better performance characteristic. Next plan is an "ego-centric" map, similar to recent work by Sergine Levine and others.	Maybe	VizDoom,	deep learning, gated recurrent units, LSTM, reinforcement learning, GRU,	Mappin + neural feels like a new area, and this intersects with 2014> work on memory networks etc. Quite original.	4	Maps look like a neat, hacky way to deal with complex memory representations of navigation/action tasks. Also: interpretable!	CMU
4	2/27/2017 20:25:54	Matthew Gibson	A multi-task convolutional neural network for mega-city analysis using very high resolution satellite imagery and geospatial data	https://arxiv.org/abs/1702.07985	Authors train a pretty standard CNN on satellite imagery to compute a some statistics interesting to remote sensing researchers: land-cover classification, building density, and floor area ratio. They then use these statistics to estimate population density. Results for land-cover classification seem reasonable, but there is not a lot of quantitive assessment.	No	DigitalGlobe satellite imagery + private dataset derived from Wuhan planning bureau	deep learning, CNNs	Somewhat novel, applies known technique to known problem, but this particular technique-problem pair is novel.	2	The paper suffers from not benchmarking against other known methods or datasets such as the UC Merced Land Use Dataset. There is not a lot interesting here from an ML/CV point of view save the application. Seem oriented toward remote sensing researchers.	Wuhan University
5	2/27/2017 20:54:12	Jack Clark	Deceiving Google's Perspective API Built for Detecting Toxic Comments	https://arxiv.org/abs/1702.08138	Targets recently released Google 'perspective' service to detect abusive comments. Demonstrates vulnerability via creation of adversarial examples (text that contains abusive phrases but is somehow hidden). Simple mis-spellings trick google's api. EEg - "They are stupid and ignorant with no class (91%) abusive > They are st.upid and ig.norant with no class (11%). Additional more subtle pertubations, eg - They are liberal idiots who are uneducated. (90%) They are not liberal idiots who are uneducated. (83%) < but note the significantly less pronounced effect.	No	Google Perspective	basically none?	no	1	no methods section, so unclear how adversarial phrases derived - if in an automated way then I might move this to a 2, but otherwise hard to see the ML component here.	University of Washington
6	2/27/2017 21:34:58	Lynn Langit	Skin Lesion Classification Using Hybrid Deep Neural Networks	https://arxiv.org/abs/1702.08434	Tests this method (pre-trained convolutional neural networks and ensembles learning) to classify skin lesions (images) using 2,000 images for training and 150 images for testing. Result is 84.8% and 93.6% correct for Melanoma and seborrheic keratosis labeling via binary classification which compares to diagnosis via an experienced dermatologist. Proposed that a larger training dataset would result in increased model performance.	No	The available images from ISIC 2017 challenge were used for training purposes. This dataset is composed of 2000 color dermoscopic skin images with corresponding labels. Link is here -- https://challenge.kitware.com/#phase/5840f53ccad3a51cc66c8dab	pre-trained convolutional neural networks and ensembles learning	No	3	I would like to see this research repeated with a much larger dataset.	University of Vienna, Vienna, Austria & TissueGnostics GmbH, Vienna, Austria
7	2/27/2017 21:48:55	Jack Clark	CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi	https://arxiv.org/abs/1702.07908	New hardware alert! Researchers introduce 'Controlled Hogwild with Arbitrary Order of Synchronization' (CHAOS), a parallelization method for neural nets. Has thread and vector parallelism and other nice timey-wimey features. Can show speedups of 103X on Xeon Phi, compared to execution on one thread, 14X to sequential execution of Xeon E5, and 58X of Intel Core i5. Contains a nice review of history of parallelization approaches. Future scaling a'hoy "results of the performance model indicate that CHAOS scales well beyond the 240 hardware threads of the Intel Xeon Phi that is used in this paper for experimentation."	Maybe	N/A	HOGWILD, CHAOS, cnns,	New architecture means new optimizations. James Mickens would be proud.	2	Proves out Xeon Phi performance, but lack of comparison to GPUs makes hard to compare. Valuable signal of benefit of optimizations, though.	Linnaeus University, Machine Intelligence Research Labs
8	2/27/2017 22:37:43	Asim Imdad	Multi-Label Segmentation via Residual-Driven Adaptive Regularization	https://arxiv.org/pdf/1702.08336.pdf	Researchers presented here multi-label segmentation algorithm which uses robust Huber loss for both the data and the regularizer. The proposed model has been designed to make use of a prior, or regularization functional, that adapts to the residual during convergence of the algorithm in both space, and time.The authors designed the cost function in a parameter free way and it can be minimized with a convex optimization framework. The results are tested and compared with the state of the art in the field such as total variation based algorithms.	Maybe	Berkeley segmentation dataset https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/	Convex Optimization and Regularization	It uses a new regularization scheme to control the segmentation	5		Udacity
9	2/27/2017 22:55:28	Cynthia Yeung	Learning Hierarchical Features from Generative Models	https://arxiv.org/abs/1702.08396	Researchers show that hierarchical generative models aren't much better than non-hierarchical ones. They then propose an alternative approach (Variational Ladder Autoencoders), which they train over three datasets, and obtain results similar to those obtained with InfoGAN. However, this new approach is still limited in that it can't specify the type of feature to be disentangled and can't learn structures except for disentanglement.	No	Various datasets available through GitHub: https://github.com/ShengjiaZhao/variational-Ladder-Autoencoder	hierarchical variational autoencoders and variational ladder autoencoders	No.	2		Stanford
10	2/27/2017 22:58:28	Dhruv Parthasarathy	On the Origin of Deep Learning	arxiv.org/abs/1702.07800	This paper covers the history of deep learning and provides a wonderful survey of the field as well. It goes through how the original ideas were created and provides the biological inspirations where applicable. It goes through the original findings in each piece of DL today - perceptrons, backprop, deep nets, ... all the way to GANs. It ends by showing how historical results can help us discover new areas for research.	No	None	history	yes - you rarely see excellent summaries of the history of the field and how it has evolved.	5	This is just a must read for anyone interested in the space. Well written and inspiring.	CMU
11	2/27/2017 23:01:30	Adrian Ulloa	Boundary-Seeking Generative Adversarial Networks	https://arxiv.org/abs/1702.08431	A new GAN structure is introduced, which relies on generating samples that lie on the decision boundary of the discriminator in each update. It is proven empirically that this approach works for discrete data (and also continues to work with continuous data too), an active area of research on GANs lately, and performs better than the reparametrization technique (aka Gumbel-softmax). It also provides with a strong first step in building a unified learning framework for both types of data.	Yes	SVHN (http://ufldl.stanford.edu/housenumbers/), quantized CelebA (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)	deep convolutional GANs	Yes, as far as I'm aware. It is a fairly hot research area though	4	i would love to see an implementation, hope to see a release of the code soon :)	University of Montreal, University of Waterloo, New York University, CIFAR
12	2/27/2017 23:18:45	Cynthia Yeung	Deep Voice: Real-time Neural Text-to-Speech	Deep Voice: Real-time Neural Text-to-Speech	Researchers create a fully-neural TTS system consisting of the following: a segmentation model for locating phoneme boundaries, a grapheme-to- phoneme conversion model, a phoneme duration prediction model, a fundamental frequency pre- diction model, and an audio synthesis model. The system requires zero manual annotation. The result? A 400X speedup over previous WaveNet inference implementations.	Yes	1. Internal English speech database containing approximately 20 hours of speech data segmented into 13,079 utterances. 2. Subset of the Blizzard 2013 data (Prahallad et al., 2013)	deep neural networks	Yes. Results are impressive.	4	Faster-than-real-time inference!	Baidu
13	2/27/2017 23:19:50	Remya Cherussery Varriem	A case study on English-Malayalam Machine Translation	https://arxiv.org/ftp/arxiv/papers/1702/1702.08217.pdf	The research focuses on comparison between Statistical machile translation (SMT ) Vs Rule Based Machine Translation (RBMT) for English - Malayalam translation. Because of highly different structure these languages come with challenges in terms of ambiguity in meaning, as well as difference in vocabulary. The authors preprocessed data and trained the SMT on around 25k - 30k sentences. When passed through the two models they did error analysis on the results. SMT outperformed RBMT but it could not do morphology analysis, so the goal is to continue the research giving the SMT model a morphology analyzer and lower the error rate even further.	Yes	Could not find the exact data. But it is part of http://www.iiitmk.ac.in/vrclc/	Statistical Machine Learning	I think so. I speak Malayalam and I am yet to see an excellent translator software . I can only imagine how it can be extended to other languages , which all have origin in sanskrit.	3	This research has great potential if it can take into consideration how complex and rich malayalam is and there is no way to translate from word to word. Also taken into consideration variations in slang, dialect etc, It can get even more complex. I am excited to have found this paper and related papers	IIT Bombay
14	2/27/2017 23:22:25	Rinat Maksutov	Spatially Aware Melanoma Segmentation Using Hybrid Deep Learning Techniques	https://arxiv.org/abs/1702.07963	A new network architecture is proposed for a more accurate delineation of skin lesion images from the dataset of the ISBI 2017 lesion segmentation challenge. The researchers propose a hybrid method, which is a combination of deep convolutional networks with recurrent networks. It is aimed at dealing with relatively low accuracy of the previously proposed architecures on low contrast and hair occluded images of a lesion. By injecting 4 recurrent network layers between the convolutional encoders and pool layers and the convolutional decoder layer it was possible to significantly outperform the traditional methods.	Yes	https://challenge.kitware.com/#phase/5841916ccad3a51cc66c8db0	Recurrent neural networks, convolutional neural networks	Does not seem novel, but provides an alternative way of solving a certain problem using the existing techniques.	3		Deakin University
15	2/27/2017 23:28:20	Dhruv Parthasarathy	Generative Adversarial Active Learning	https://arxiv.org/abs/1702.07956	The authors apply GANs in the field of Active Learning. In Active Learning, the algorithm has a pool of training samples to choose from but can only get labels for a limited set. This method uses GANs to synthesize the most optimal training data to get labeled. This is the first application of GANs to this field. The results are comparable to state of the art results but only in some cases. More work will need to be done to verify that this is a promising approach.	Maybe	CIFAR-10, MNIST, SVHN	GANs, Active Learning	Yes - This seems to be the first paper to apply GANs in the context of Active Learning.	3		Boston College
16	2/28/2017 0:00:24	Dhruv Parthasarathy	Adversarial Networks for the Detection of Aggressive Prostate Cancer	https://arxiv.org/abs/1702.08014	The authors use adverserial training on U-Net to segment MRI images to reveal tumors. This seems well suited to the task given that such a method does well in fields with limited labeled data such as medical imaging. The authors find that such a method outperforms simple cross entropy on the given data set both in terms of sensitivity and Dice coefficient.	Yes	MRI dataset from 152 patients acquired at the National Center for Tumor Diseases in Heidelberg, Germany.	Convnets trained using Adverserial training.	Yes - the authors claim that this is the first paper to introduce adversarial training for semantic segmentation of medical images.	3		German Cancer Research Center (DKFZ)
17	2/28/2017 0:05:53	Prateek Dorwal	Low-Precision Batch-Normalized Activations	https://arxiv.org/abs/1702.08231	This paper presents a novel approach of quantization of batch normalized activations to reduce memory requirements and computational costs. It presents low precision approximation formulas that can be applied after normalization but before activation function, so as to replace costly multiplication operations with cheaper addition operations. This decrease in costs come with a trade-off of reduced accuracy(a minor loss though). In the paper, the researcher has presented results if his experimentation on ResNets models on ImageNet	No	CIFAR-10	convolutional neural network	No. There are papers with similar approach to achieve lower precision, eg http://arxiv.org/abs/1603.01025 . However the researcher applies these formulas before activations, which is a new approach.	3		Facebook AI research
18	2/28/2017 0:32:40	Anthony Cooper	Automated Verification and Synthesis of Embedded Systems using Machine Learning	https://arxiv.org/abs/1702.07847	The author presents an overview of current problems and areas of focus regarding embedded computer systems (C, C++ based). Autonomous vehicles like UAVs are referenced, along with disease diagnosis, and some possible issues are highlighted. Unfortunately, most of the research problems (RP 1-6) have particularly specific applications, and the paper fails to deliver new insight about machine learning. Little about ML is referenced or discussed, and the paper's usefulness outside of specific engineering applications is questionable.	Maybe	N/A	SMT-related (Little on ML)	Nothing extremely groundbreaking here, although it does present current issues around embedded systems.	2	The paper is quite short, at only 3 pages, and I feel it would have been better combined with a larger piece of work. It presents current issues which can definitely help raise awareness and promote discussion, but it doesn't strike me as a must-read. Those working with C or C++ based systems in autonomous vehicles, like self driving cars or UAVs, might find this paper more relevant. :)	University of Oxford
19	2/28/2017 4:21:26	Sagar Uprety	Dynamic Word Embeddings via Skip-Gram Filtering	https://arxiv.org/pdf/1702.08359.pdf	Word embeddings are representation of words as vectors which capture their semantic context. This paper investigates the semantic evolution of words over time, by evaluating continuous word embeddings over a very long time period. Cosine similarity between the word vectors of individual words are plotted against time which capture how meaning of a word has changed over time.	Yes	1) Google books corpus: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html 2)State of Union(SoU) address of US Presidents: http://www.presidency.ucsb.edu/sou.php 3)Twitter data for 21 randomly picked days between 2010 and 2016	Skip-gram filtering, Skip-gram smoothening, Stochastic gradient ascent	I find it novel because it studies the continuous evolution of word embeddings. Evolution of word embeddings have been studied before but they were based on static evolution. This paper gives better results.	4		Disney Research
20	2/28/2017 4:57:48	Nikolay Morozov	Bayesian Nonparametric Feature and Policy Learning for Decision-Making	https://arxiv.org/abs/1702.08001	Authors proposed use of Bayesian Nonparametric Model on the Indian Buffet Process, where features are inferred from observed data with assumption that observations are represented by latent features. Features are based on policies; agents decision are based on superset of those policies. Approach shows good results although high noise and high number of features appears to be a challenging problem. Application to self driving car scenario where considered and real KITTI benchmark real data where used to validate algorithm efficiency, reliable results where obtained (overall accuracy is consistently over 70%).	Maybe	http://www.cvlibs.net/datasets/kitti/raw_data.php	Bayesian Nonparametric Model on the Indian Buffet Process	Learning from demonstration in application is not something unseen before, however present algorithmic approach demonstrated reliable results on realist data in application to self-driving cars.	4		Technische Universit at Darmstadt
21	2/28/2017 8:18:50	Nikolaos Sarafianos	Seeing What is Not There: Learning Context to Determine Where Objects Are Missing	https://arxiv.org/pdf/1702.07971.pdf	The authors are interested in finding the potential locations of missing objects in images (in their case road curbs). To do so they utilize contextual information (heat maps) to find where an object should occur in the image and object detection for the primary object. They use a Siamese Fully convolutional Context network fed with (i) the masked image for classification and (ii) the raw image which enforces the network to output similar results regardless of whether the binary object mask is used or not.	Yes	Street View Curb Ramps Dataset	ConvNets for context and objects in images	Yes, as it combines contextual information with object detection masks (to hide the already detected object) so as to identify locations in the image where an object could appear	4	Very well motivated. Although very different from both it reminded me of these 2 papers: (i) https://arxiv.org/pdf/1512.06974.pdf (ii) https://arxiv.org/pdf/1701.06772.pdf	University of Maryland
22	2/28/2017 8:52:23	Kostas Oraiopoulos	A Unifying Framework for Convergence Analysis of Approximate Newton Methods	https://arxiv.org/pdf/1702.08124.pdf	The paper proposes a framework to analyze local convergence properties of sub sampled Newton Methods. Sub sampled Newton methods is a new approach in solving optimization problems, where instead of calculating the complete Hessian matrix in the classical Newton method, a small subset of samples is used to approximate the Hessian. They try to explore the convergence properties of the proposed sub sampling methods. Namely they try to address three basic questions. 1. Is the Lipschitz continuity of the Hessian necessary to achieve linear-quadratic convergence rate? 2. What is the size of the sketch Newton method that is required to obtain linear convergence 3. Can regularized sub sampled Newton methods use smaller sample size than conventional sub sampled Newton methods	Maybe	No dataset	calculus	Yes. It is focused on approximating second order optimization methods, which is active in the last couple of year. Not a breakthrough but exploring important questions of the area	3	The paper requires heavy mathematical knowledge of the field and the details of the paper are accessible only to scientists that are closely related to the field. On the other hand the research on these subsampled Newton Methods seems extremely interesting	Shanghai Jiao Tong University, Peking University
23	2/28/2017 9:27:13	Matt Blomquist	Uniform Deviation Bounds for Unbounded Loss Functions like k-Means	https://arxiv.org/pdf/1702.08249.pdf	Researchers develop a framework to obtain uniform deviation bounds for k-means loss functions when the kurtosis of the underlying distribution is bounded. The researches present a general form for obtaining a uniform deviation bound on the k-means function and describe it's benefits over previous research. These benefits include being scale-invariant and stability for an unbounded solution space. The researchers provide all of the mathematical proofs for the new framework and see a significant improvement in the complexity as compared to previous research.	Yes	No.	k-means clustering	The paper is somewhat novel as there is a significant improvement from previous research. However, broad implementation of findings are not presented.	3	Better results are obtained on stronger assumptions such as a bounded higher moment, subgaussianity, or bound support.	ETH Zurich
24	2/28/2017 10:00:16	Dmytro Filatov	An Efficient Pseudo-likelihood Method for Sparse Binary Pairwise Markov Network Estimation	https://arxiv.org/abs/1702.08320	Researchers offer a faster alternative to the optimization of pseudo-likelihood by posing pseudo-likelihood (PL) as an logistic regression (LR) problem in the context of learning an L1-regularized binary pairwise Markov network (BPMN). The proposed method, Pseudolikelihood method using glmnet (PLG), offers a substantial speedup without losing in accuracy. Researchers conduct a performances experiment using the senator voting records. The task of interest was to investigate the clustering effect of voting consistency, to find the senators who are more likely to cast similar votes on bills.	Yes	U.S. Senate Roll Call Votes 109th Congress - 2nd Session (2006), https://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_109_2.htm	Logistic regression (LR), pseudo-likelihood approach (PL), Pseudolikelihood method using glmnet (PLG), binary pairwise Markov networks (BPMNs), node-wise logistic regression (NLR)	Somewhat. Researchers introduced a new optimization method and claims that this method substantially outperforms the existing state-of-the-art implementation of PL: https://pdfs.semanticscholar.org/92cb/c010c401d26fd1b811b2bb10acd83ed5078e.pdf ("Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods", Holger Hofling, Robert Tibshirani, 2010)	3		University of Wisconsin
25	2/28/2017 11:48:42	Michele Cavaioni	Asymmetric Tri-training for Unsupervised Domain Adaptation	https://arxiv.org/pdf/1702.08400.pdf	This paper proposes a tri-training method for unsupervised domain adaptation. It aims to learn discriminative representations by utilizing pseudo-labels assigned to unlabeled target samples. It uses three classifiers, where two networks assign pseudo-labels to unlabeled target samples and the remaining network learns from them.	Yes	MNIST, SVHN dataset; Synthetic Traffic Signs and German Traffic Signs Recognition Benchmark datasets; Amazon Reviews dataset.	Asymmetric tri-training for unsupervised domain method. Where asymmetric means assigning different roles to three classifiers.	It's not too novel as it leverage existing methodology of Co-training (leveraging multiple classifiers to artificially label unlabeled samples and retrain the classifiers).	2		The University of Tokyo, Tokyo, Japan
26	2/28/2017 14:47:49	Claudio Bernardo Rodriguez Rodriguez	Efficient Learning of Graded Membership Models	https://arxiv.org/abs/1702.07933	Researchers propose a new method to calculating graded membership. Authors propose their new approach PTPQP and walk through the mathematical steps. Authors compare the real mean squared error with other popular methods and show how as N increases their method has better accuracy and less computation. Authors present Matlab code for their method.	Yes	No dataset	It uses Dirichlet to allocate a variable into a cluster to discover latent variables. To do this they propose their method partitioned tensor parallel quadratic programming (PTPQP) which is can be parallelized.	Seems very novel and very useful for unsupervised learning.	3	Authors tackle Latent Variable Model issues with negative numbers. Research method has similar prediction rates as TPM as HALS. Next step would be to validate results on more datasets and test parallelization of algorithm.	Duke University
27	2/28/2017 17:30:15	Anthony Cooper	Related Pins at Pinterest: The Evolution of a Real-World Recommender System	https://arxiv.org/abs/1702.07969	The authors begin by presenting the thought that no state-of-the-art system is build instantly. Instead, the system requires many hours of development and iterations to be successful. This paper highlights the attempts, results, and reasoning of the Pinterest researchers and engineers who worked on 'Related Pins', their machine-learning driven system that suggests pins for users. In order to maximize the number of users who click the suggested pins, the researchers employ several methods, (candidate generation, Memboost, ranking) many of which leverage machine-learning techniques. The paper also documents the challenges and solutions the researchers faced when building their system.	No	N/A	Recommender systems, gradient-boosted decision trees (GBDT), Random walk service			The authors share their journey, starting with 2 engineers shipping a product to test in 3 weeks, and the three years of research and implementation that come after. I enjoyed how thorough and honest the paper was. I was also pleasantly surprised with how readable and accessible the material is, compared to some other highly technical, math-heavy papers.	Pinterest
28	2/28/2017 18:26:04	Jack Clark	Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning	https://arxiv.org/abs/1702.08887	Tries to solve problem of RL mostly being good for single-agents. Claims some success on a Starcraft unit micromanagement task. Modern RL methods frequently use an experience replay memory. How do we make such a memory work with multiple independent q-learners. Instead you learn a policy conditioned on an estimate of the policies of other agents, and you infer this from their behavior. Also adds in some tweaks to memory to make it work better with inferring recent important actions versus old ones, particularly when analyzing and different agents' actions. One of the tricks here seems to be the assumption that you're training all your agents in cooperation with one another.	Maybe	Starcraft	IQL, Deep Learning, Q Learning, RNN,	Yes		Multi-agent learning seems bloody expensive (computationally) compared to single agent learning. seems like we need many more researchers working on this problem to stop computation costs exploding. The 'fingerprint' idea outlined in this paper is intriguing but a little hard to parse on an early read.	University of Oxford, Microsoft Research Redmond	Provides another view on how to tackle complex, dynamic, partially observable worlds like Starcraft.
29	2/28/2017 19:29:44	Anthony Cooper	Show, Attend and Interact: Perceivable Human-Robot Social Interaction through Neural Attention Q-Network	https://arxiv.org/abs/1702.08626	Researchers create a robot called "Pepper" to test human-robot social interactions (HSRI). The robot can do four things - wait, look towards human, wave hand, and handshake. They use a new model called a Multimodal Deep Attention Q-Network (MDARQN) to train the robot, which is composed of two Q-network streams running independently. Each stream consists of convolutional neural networks, long short term memory (LSTM) networks, and an attention network. The researchers train Pepper over 14 days, with the result of Pepper learning intention and intent from human behavioral patterns and actions.	Yes	New Dataset + Code, Physical human-robot social interaction https://sites.google.com/a/irl.sys.es.osaka-u.ac.jp/member/home/ahmed-qureshi/deephri	Deep learning, reinforcement learning, dual-stream convolutional neural networks, multimodal deep attention recurrent Q-network (MDARQN), LSTM	Yes		They use Aldebaran's Pepper robot, I believe.	Osaka University	Even though Pepper only had four actions to choose from, the robot managed to learn and adapt in an uncontrolled, real world environment. In the future, a similar sized robot may have four hundred actions to choose from - the applications are tremendous! The researchers' proposed network looks very promising, and it shows that there is at least one way to successfully train an agent to understand human behavioral cues.
30	2/28/2017 20:14:45	Anthony Cooper	Multi-agent systems and decentralized artificial superintelligence	https://arxiv.org/abs/1702.08529	The paper is simply an outline of how one could use existing infrastructure and tools to create decentralized AI, using multi-agent systems across several domains. Nothing modern is referenced, such as machine learning or neural networks. It focuses more on theory, and without actual, current data, I do not see how this paper presents anything new.	No	N/A	Berkeley Open Infrastructure for Network Computing (BOINC),	No		While I found the topic and proposal interesting, the most recent reference is from 1997. Despite this, modern technologies like Siry (Siri?), and the Ethereum blockchain are brought up along with Windows 95. The paper also contains grammatical and formatting errors which impair readability and flow. There is no research conducted; the paper resembles an outline on possible methods to create decentralized AI using existing tools, like BOINC or P2P networks.	Moscow Institute of Physics and Technology	N/A
31	2/28/2017 21:18:30	Fred Tanada	Improving Machine Learning Ability with Fine-Tuning	https://arxiv.org/pdf/1702.08563.pdf	The supplemental IRT training data has two qualities that make it desirable as training data: (i) the data were identified because of their local independence from each other in the process of fitting an evaluation metric to measure latent 27 Feb 2017 Improving Machine Learning Ability with Fine-Tuning ability, and (ii) each example in the data has a large number Our contributions are as follows: (i) we show that finetuning parameter weights for a memory-augmented neural network (Munkhdalai & Yu, 2017) with a specialized supplemental training set improves model performance for certain tasks when compared with a human population, (ii) this fine-tuning does not overfit on the supplemental data and therefore does not negatively affect generalization in terms of accuracy, and (iii) we motivate the use of supplemental training data as a way to improve performance in terms of ability with regards to a human population. The model may have learned a set of parameters that better models the data than the human population, and updating parameters to reflect the human distribution could lead to a drop in performance. In this case the fact that performance in terms of theta improved is a surprising result. Fine-tuning with the full SICK training set negatively impacts performance in all cases, while fine-tuning with a random sample from SICK improves performance over the original model in all cases except G4-Neutral, where the original model performed best. By introducing only a subset of new data to the model in the fine-tuning phase, the model is able to update its parameters without completely re-learning based on the new data.	Yes	Hovy, Dirk, Berg-Kirkpatrick, Taylor, Vaswani, Ashish, and Hovy, Eduard. Learning whom to trust with mace. http://aclweb.org/anthology/	Sentences Involving Compositional Knowledge (SICK) data set, another widely used RTE data set (Marelli et al., 2014), Algorithm 1 Fine-Tuning Procedure Input: NumEpochs e, XN train, Xtest, F Ttrain, IRTtest, Loss Function l for i = 1 to e do Train NSE with XN train with loss function CCE end for for i = 1 to e do Train NSE with F Ttrain with loss function l end	Yes			Columbia University	I'm excited about this research because it will improve the overfitting results from human performance.
32	2/28/2017 21:50:40	Fred Tanada	Improving the Neural GPU Architecture for Algorithm Learning	https://arxiv.org/pdf/1702.08727.pdf	The proposed improvements allow us to achieve substantial gains: the model can learn binary multiplication in 800 steps versus 30000 steps that are needed for the original Neural GPU, and, most importantly all the trained models generalize to 100 times longer inputs with less than 1% error. It can learn fairly complicated algorithms such as addition and binary multiplication, but only a small fraction of the trained models generalize well. The improved architecture can easily learn a variety of tasks including the binary multiplication on which other architectures struggle. The correct generalization of the learned models to arbitrary large inputs is still an open problem, and it is not even clear why some models generalize, and others do not.	Yes	Datasets can be found in Appendix https://arxiv.org/pdf/1702.08727.pdf	AdaMax optimizer, Neural Turing Machine, iagonal Convolutional Gated Recurrent Unit (DCGRU)	Yes			Cornell University	I am excited because I would like to improve performance in my GPU instances which I use in the self-driving program.
33	2/28/2017 22:03:30	Dmytro Filatov	Deep Forest: Towards An Alternative to Deep Neural Networks	https://arxiv.org/abs/1702.08835	Researchers propose gcForest (multi-Grained Cascade forest), a novel decision tree ensemble approach with performance highly competitive to deep neural networks (DNN). The gcForest includes two important components: cascade forest and multi-grained scanning. Comparing to DNN gcForest has much fewer hyper-parameters and is less sensitive to parameter setting. This approach showed a good accuracy in: 1. Face Recognition – ORL dataset : 98.30% gcForest vs 92.50% DNN (CNN) 2. Music Classification – The GTZAN dataset: 65.67% gcForest vs 59.20% CNN 3. Sentiment Classification – The IMDB dataset: 89.32% gcForest vs 89.02% DNN (CNN)	Maybe	MNIST http://yann.lecun.com/exdb/mnist/, GTZAN Genre Collection http://marsyasweb.appspot.com/download/data_sets/, sEMG for Basic Hand movements https://archive.ics.uci.edu/ml/datasets/sEMG+for+Basic+Hand+movements, The IMDB dataset https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset, UCI-datasets https://archive.ics.uci.edu/ml/datasets.html,	multi-Grained Cascade forest (gcForest), cascade forest, multi-grained scanning, deep neural networks (DNN), convolutional neural network (CNN)	Yes		Shows a good performance on CPUs: run on 2 Intel E5 2670 v3 CPUs (24 cores)	Nanjing University	gcForest has much fewer hyper-parameters to tweak than deep neural networks; performs well with default setting across different data from different domains; easy to analyze
34	2/28/2017 22:40:34	Anuj Gupta	Related Pins at Pinterest: The Evolution of a Real-World Recommender System	https://arxiv.org/abs/1702.07969	The paper begins it presentation stating how the intial recommendation engine with limited computational resources has now been transformed into a advanced engine with real-time complexity being induced, the challenges like avoiding feedback loops, performance evaluation,etc. and the trade-offs require it to be successful in not so state-of-art system. They began their journey by observing the problems in initial product and challenging them by first designing the simple highest-leverage products and then reach incremental milestone using several methods (like memboost, ranking, GBDT). They conclude the research by showing the results, evaluation and comparison with the previous engine.	No	N/A	Recommendation engine, GBDT, memboost, random walk	Yes			Pinterest	I am excited since this research is a combination of challenging the legacy concept and proving them wrong with the today's not so state-of-art system. Plus it can grow upto a large system with the latest technologies like deep learning , CNN,etc.
35	3/1/2017 2:49:49	Octavia Efraim	A Roadmap for a Rigorous Science of Interpretability	https://arxiv.org/abs/1702.08608	This paper is concerned with a scientific approach (i.e. a rigorous definition and evaluation) to interpretability in the context of ML. Interpretable means understandable by a human. Interpretability is involved in the pursuit of other criteria generally desired of ML systems (fairness/unbiasedness, privacy, robustness, causality, usability, and trustworthiness). The researchers argue that explanations, and thus interpretability, are needed when the problem at hand is incompletely formalised. The paper puts forward a taxonomy of evaluation approaches focused on interpretability: (1) application-grounded (real humans on real tasks, i.e. extrinsic evaluation of the system, by humans); (2) human-grounded (real humans on simplified tasks, e.g. identifying errors or patterns in relation to a system's output rather than evaluating it in its context of application); (3) functionally-grounded (no humans, proxy tasks, e.g. optimising a ML model already shown to be interpretable; avoids problems related to evaluation by humans). The authors go on to propose a data-driven approach aimed at discovering factors of interpretability, which should be based on data about the performance of different methods on their specific end tasks; they suggest that data repositories for this purpose be created (similar to existing ML repositories). Method-performance data in matrix form can reveal latent dimensions to do with interpretability (dimensions related either to the task, or to the method used). Finally, the authors recommend that the research community follow a couple of general principles: the claim of the research should match the type of evaluation; applications and methods should be categorised using a common taxonomy.	No	none	none	Yes			Harvard John A. Paulson School of Engineering and Applied Sciences, University of Washington Department of Computer Science and Engineering	This paper is about research methodology, which is often underrated, but remains crucial for quality work. It is important that rigorous frameworks be used to support research claims.
36	3/1/2017 3:13:13	Ethan Caballero	Bridging the Gap Between Value and Policy Based Reinforcement Learning	https://arxiv.org/abs/1702.08892	Acknowledging that γ-discounted entropy regularization is used in reward expectation, they formulate a new notion of softmax temporal consistency for optimal Q-values as: Q∗(s,a) = r(s,a)+γτ log a′ exp(Q∗(s′,a′)/τ) . They then introduce a new RL algorithm called Path Consistency Learning (PCL) that extends this softmax temporal consistency to arbitrary (multi-step) trajectories. Pseudocode of PCL Algorithm can be seen in Section 3.5 . Unlike algorithms using Qπ- values, PCL seamlessly combines on-policy and off-policy traces. Unlike algorithms based on hard-max consistency and Q◦-values, PCL easily generalizes to multi-step backups on arbitrary paths, while maintaining proper regularization and consistent mathematical justification (that they outline in Section 3 and the appendix). PCL is similar to A3C (and actor-critics in general) and PGQ (policy gradient q-learning) [arxiv:1611.01626], but has some key differences/improvements. A3C vs PCL: In comparison to A3C, PCL’s advantage function is more aligned with rewards in that advantage is 0 on every trajectory for its optimal policy, which is not the case for A3C. Also, PCL’s value function is not dependent on the current policy. PGQ vs PCL: PQQ relates the optimal policy to the hard-max Qvalues in the limit of τ = 0, and thus proposes to augment the actor-critic objective with offline updates that minimize a set of single-step hard-max Bellman errors; PQQ's weakness is assuming limit of τ = 0 (τ is entropy). PCL extends the relationship to τ > 0 by exploiting a notion of softmax consistency of Q-values. They compare performance of PCL, A3C, & DQN on the algorithmic tasks from OpenAI Gym, and PCL outperforms or equates A3C & DQN on all tasks. Most impressively in my opinion: PCL can incorporate expert trajectories very easily and very effectively. Inserting just 10 expert trajectories in a size 400 batch drastically improves performance (e.g. solving tasks with max reward in just ~50 iterations as opposed to only getting to ~half of max reward after 4000 iterations). They did not perform a direct comparison of effect of adding in expert trajectories to other RL algorithms (mostly because it’s more complicated to do so for others). This attribute could be very useful for tasks with a few expert trajectories such as Description2Code.	Yes	Algorithmic tasks from OpenAI Gym: https://gym.openai.com/envs#algorithmic	Path Consistency Learning (PCL), Softmax Temporal Consistency; Value and Policy Based Reinforcement Learning that acknowledges γ-discounted entropy regularization; on and off policy; optional ability to effectively/easily incorporate expert trajectories	Yes		Test it out on tasks more complicated than the algorithmic tasks from OpenAI Gym.	Google Brain, University of Alberta	Beats or equates all the baselines. Can use on-policy and/or off-policy data (any state-action subsequence). Can incorporate expert trajectories very easily and very effectively; adding just a few expert trajectories to a batch yields drastic improvements.
37	3/1/2017 8:04:37	Gregory Besson	Fast and Accurate Inference with Adaptive Ensemble Prediction in Image Classification w/ Deep NNs	https://arxiv.org/pdf/1702.08259.pdf	Ensembling multiple predictions is a widely used technique to improve the accuracy in image classification task. The drawback being the computation cost. The approach proposed in this paper is to add a confidence level for each input based on the probability of the predicted label (the softmax output). When the prediction for an input reaches a high enough probability on the basis of the confidence level, we stop ensembling for this input. This way, the computation time is drastically reduced while achieving similar accuracy to the regular ensembling technique.	No	ILSVRC 2012, Street View House Numbers (SVHN), CIFAR-10, and CIFAR-100 (with fine and course labels)	For the ILSVRC 2012 dataset, GoogLeNet as the network architecture and training the network using the stochastic gradient descent (momentum being the optimization method). For other datasets, a network with six convolutional layers with batch normalization followed by two fully connected layers (adam being the optimization method).	Yes			IBM	Because this technique would help to achieve the ensemble technique accuracy while keeping the computation time low.
38	3/1/2017 8:11:23	Nikolaos Sarafianos	MIML-FCN+: Multi-instance Multi-label Learning via Fully Convolutional Networks with Privileged Information	https://arxiv.org/abs/1702.08681	The authors propose a framework for multi-label object recognition from multiple instances by employing the Learning Using Privileged Information Paradigm in their learning. A 2-stream network is used (one for primary, one for privileged information) along with a new loss which incorporates the privileged information inside the regularization term.	Yes	VOC 2007, V0C 2012, MS COCO (http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2007/ - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ - http://mscoco.org/)	Deep Learning and the Learning Using Privileged Information paradigm	No			NTU, Singapore	An interesting take-home message from that paper could be the idea of PI pooling (performing pooling only from the bounding box region which is considered as privileged information). However a discussion about the similarities and the differences of the proposed method (even at a high level) with existing approaches that address the same problem is missing. The idea of incorporating priv. info as a regularization term has also been proposed by Wang and Ji (Classifier Learning with Hidden Information, CVPR 2015) whereas, employing additional info during training of a ConvNet for ImageNet (with segmentation masks as priv. info) by Chen et al. (Training Group Orthogonal Neural Networks with Privileged Information, ICLR 2017 submission).
39	3/1/2017 8:21:03	Spiros Raptis	eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys	https://arxiv.org/abs/1702.08568	The paper tries to solve with Deep learning the problem of Detecting cybersecurity problems such as finding malicious URLs ,file paths and registry keys. Previous machine learning detection techniques focused on automatically classifying file paths or registry keys and operated on groups of dynamic host-based observations. The proposed technique (eXpose) operates on individual events and learns the representation of input strings rather than using feature engineering (Very time consuming) and detects from the input character string based on lexical semantics. eXpose applies a neural network directly to the input strings. It uses character embedding for the input string. It then feeds the previous layer into multiple convolutions for detecting features. For classification it uses fully connected layers where the final layer is a sigmoid layer. The results achieve a 5-10% higher detection rate than manual feature extraction. However, the overall results for file paths and registry keys are worse than for URLs. The explanation is that there’s a difficulty in properly labeling samples. Finally, because the proposed technique is very computationally expensive, next steps include testing more complex architectures in better hardware.	Yes	For the first problem (malicious URLs) 19067879 unique URLs were downloaded from VirusTotal. Also for malicious file paths and registry keys they extracted from 18M Cuckoo sandbox runs from VirusTotal.	deep learning, convolutional neural networks, character embedding.	Yes		I believe this paper can be a proof of concept. It doesn't appear that anybody had tried something like that before.	Invincea Inc.	I'm excited because deep learning is applied to a very hard problem. Current techniques include manual feature extraction which is a very time consuming task. The paper presents the first signs that automatically recognizing malicious URLs, registry keys and file paths could be revolutionized with deep learning.
40	3/1/2017 9:00:44	Joseph DeBartola	Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition	https://arxiv.org/abs/1702.07975	Handwritten Chinese Character Recognition has seen promising results in offline applications thus far, but truly state of the art results demand large and costly CNNs. Here, the researchers present an application of two new techniques--Adaptive Drop-Weight and Global Supervised Low-Rank Expansion--in an attempt to both reduce the size of networks and improve their efficiency. They were successful in attaining a computational cost a magnitude less than the state-of-the-art while reducing the network to 1/18 its original size--with only a 0.21% drop in accuracy.	Yes	CASIA Online and Offline Chinese Handwriting Databases - Namely CASIA HWDB1.0 and CASIA HWDB1.1 http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html	Deep Learning, Convolutional Neural Networks, Adaptive Drop Weight--where the pruning threshold for redundant connections is calculated dynamically, Global Supervised Low-Rank Expansion--where low-rank filters are used in place of the typically higher-dimension filters seen in CNNs	Yes		The researchers plan to explore similar application of their new techniques to image classification and object detection.	School of Electronic and Information Engineering - South China University of Technology, bFujitsu Research & Development Center Co. Ltd.	The ability to move more demanding, state-of-the-art solutions to deep learning problems to offline applications will dictate the degree to which we can apply these models in our day-to-day life. Handwriting recognition is just the start here! This has major implications for all applications of CNNs which otherwise require massive, highly-dimensional calculations.
41	3/1/2017 9:56:58	Rene Wang	Deep and Hierarchical Implicit Models	https://arxiv.org/abs/1702.08896	Authors develop a new variational inference algorithm to estimate parameters of two competing models in bayesian inference framework: one is hierarchical implicit model for estimating density of mixture model and can be applied to the network where the likelihood is intractable to compute. The other is deep implicit model for estimating multi-layers representation and can be applied to learn the complicated representations. The new variational inference using KL-divergence as objective and estimate the ratio for two competing models (Hierarchical implicit and Deep implicit models). In order to be scalable, stochastic gradient descent / minibatch along with MCMC are used for both models. The new variational inference focusing ratio estimation better performs in Bayesian GAN than Bayesian neural network alone. Other experiments such as prey-predator simulation and denoise autoencoder with large network size are conducted to demonstrate the computational efficiency of ratio-based variational inference; However, the stability of training is highly dependent on the model assumptions and training result might not be converged and unstable when the model greatly deviates the true data.	Maybe	simulated data set (not provided in the paper)	KL-based variational inference algorithm along with mini-batched SGD/MCMC	Yes			Columbia University, Princeton Uni- versity
42	3/1/2017 10:13:48	Ioanna Stypsanelli	Depth Creates No Bad Local Minima	https://arxiv.org/abs/1702.08580	This paper proposes a new simplified proof that depth alone creates no bad local minima. It builds on previous work of Kawaguchi (one of the co-authors). So researchers trying to understand machine learning complexity can use this proof to work on designing efficient deep learning networks without special attention to depth but rather focusing on nonlinearity.	Yes	No dataset	calculus/optimisation	No			Haihao Lu, Kenji Kawaguchi, MIT
43	3/1/2017 14:06:20	Kostas Oraiopoulo	The Shattered Gradients Problem: If resnets are the answer, then what is the question?	https://arxiv.org/pdf/1702.08591.pdf	In very deep feedforward nets , weights become like white noise. That phenomenon is called by the authors 'shattered gradients problem' and in contrast , networks with skip connections like resnets are more resilient to the phenomenon. They investigate how batch normalization affects shuttered gradients At they end they investigate how "looks-linear" initialization affect very deep networks	Yes	CIFAR-10	resnet, CRelu	Yes			Victoria University of Wellington New Zealand , Disney Research Zurich Switzerland	It identifies an important problem of very deep networks and opens a gate for investigation on how to fix the problem
44	3/1/2017 16:23:06	Adam Tetelman	Learning Latent Networks in Vector Auto Regressive Models	https://arxiv.org/pdf/1702.08575.pdf	Researches considered the problem of causal inference from observational time series data. Given a partially observed VAR model and a set of time-series data the researchers attempted to prove that dependencies among observed processes could be inferred. They first defined conditions under which the underlying model could be recovered from the observed processes. They then defined a limit stating how much of the model could be recovered. They derived two algorithms to recover and the minimal latent network for their datasets. They built a small experimental dataset to prove out their theorems and saw good results. They were able to duplicate their good results with real economic data.	Maybe	Federal Reserve Economic Database - FRED http://research.stlouisfed.org/fred2/ A collection of three US dairy prices from January 1986 to December 2016 http://future.aae.wisc.edu/tab/prices.html Quarterly West German consumption expenditures X1, fixed investment X2, and disposable income X3 during 1960-1982 http://www.jmulti.de/data_imtsa.html	VAR model, Directed Tree Recovery Algorithm (DTR), Node Merging Algorithm (NM)	No			University of Illinois at Urbana-Champaign	This research focuses on VAR models and economic modeling. Their results are more aligned with quantitative analysis than machine learning research.
45	3/1/2017 18:21:31	Naveed Ahmed	Scaffolding Networks for Teaching and Learning to Comprehend	https://arxiv.org/abs/1702.08653	Researchers have developed a learning/teaching platform that would help the teachers to teach their students using the reasoning approach. It is evident that more questioning allows to learn about any concept better and deeper. Therefore, authors develop 1) Scaffolding Network, which is an attention-based NN to reason over a dynamic memory, 2) question simulator to generate different related questions for continuous learning, and 3) train the network with annotated dataset. Both real and synthetic datasets were used for qualitative analysis of the network.	Yes	http://fb.ai/babi	Reinforcement Learning(RL), Deep Q-Learning (DQN)	Yes			Microsoft Research	Yes, I am pretty much excited about both the problem statement and its implementation likewise. It gave me another dimension where AI can show its strength to outperform over the conventional teaching approaches. I also this kind of network can be deployed outside a school environment of its use.
46	3/1/2017 18:46:01	Cristóbal Felipe Fica	Scene Flow to Action Map: A New Representation for RGB-D based Action Recognition with Convolutional Neural Networks Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, Philip Ogunbona	https://arxiv.org/abs/1702.08652	With new devices like a kinect, that can capture depth additionally to a color map. human action recognition is pursued. Researches propose a new method (SFAM-CTKRP) to transform Scene Flow (3d optical flow) to an Action Map (SFAM). by firstly fusing RGB and depth data acquired into a single 3 channel Scene flow map, then using ConvNets to learn the intrinsic relationship between the Scene flow map an a RGB analogous image which can be use with existing trained ConvNets models like ImageNet. The paper also provides a proposal of a self calibration method of RGB-Data that could be misaligned. Finally, the method surpass the previous state of the art SFAM techniques, using the ChaLearn LAP ISO GD dataset (RGB-D videos representing one gesture instance). Yet for the M2I dataset the machine fails to converge due to the lack of training data).	Yes	ChaLearn LAP IsoGD; M2I Dataset	ConvNets, RankPooling, Multiple-Score fusion, Primal-Dual Flow	Yes			Advanced Multimedia Research Lab, University of Wollongong, Austrialia; University of Geosciences, Wuhan, china.	This paper achived the highest accuracy for transforming a scene flow to an action map, previously done mainly by hand-crafted methods. showing again the enormous power of convNets, and how deep learning is displacing other algorithms and methods in different fields. Maybe this will imply more flawless interaction between human and machines using gestures.
47	3/2/2017 1:34:30	Saurav Gupta	Active Learning using uncertainty information	https://arxiv.org/pdf/1702.08540.pdf	Researchers propose a new formulation for active learning. Importantly, they incorporate the known label-distribution information within the min-max framework. The key idea of min-max framework is to minimize the gain of the objective function no matter what the label for the new data point is. Using the existing label distribution information in this framework shows that this method works marginally better than existing methods (EER and MLI) on a majority 49 real world datasets.	Maybe	https://archive.ics.uci.edu/ml/datasets.html	Active learning	No			TU Delft, University of Copenhagen
48	3/2/2017 5:42:20	Kris Roosen	Learning Deep NBNN Representations for Robust Place Categorization	https://arxiv.org/pdf/1702.07898.pdf	In previous work, spacial classification was usually performed by feeding features, previously extracted by pre-trained CNN models to a NBNN (Naïve Bayes Nearest Neighbor) classifier. This paper aims to turn this 2 step process into a single step by integrating the NBNN fully into a CNN and exploiting local deep representations. This method shows to be more robust, outperforming previously used techniques, faster and computationally cheaper. Another advantage is the possibility of end-to-end training (1 step process).	Yes	- ImageNet [J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.] - Places [B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in NIPS, 2014.] [ B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: An image database for deep scene understanding,” arXiv preprint 1610.02055, 2016.] - Sports8 [L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition,” in ICCV, 2007.] - Scene15 [S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.] - MIT67 [A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR, 2009.] - COLD - KTH-IDOL - MIT8	- [Fully Convolutional] Convolutional Neural Networks ([FC]CNN) - Naïve Bayes Nearest Neighbors (NBNN) - Naïve Bayes Non Linear Learning (NBNL)			There was no practical information on how to implement the ideas in practice, only a very technical mathematical derivation.	University of Rome La Sapienza, Fondazione Bruno Kessler, University of Perugia
49	3/2/2017 9:50:09	Julie Zhu	Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees	https://arxiv.org/abs/1702.08833	The two big challenges of k-NN are 1) finding good representations and distance measures between samples, and 2) computational and memory requirements. The boundary tree algorithm (published in 2015) allows for efficient nearest neighbor classification, regression and retrieval, which addressed the second challenge. Researchers in this paper improved the boundary tree algorithm. By modeling traversals in the tree as stochastic events, they form a differentiable cost function which is associated with the tree's predictions. They then use a deep neural network to transform the data and back propagating through the tree allows to learn good representations for k-NN methods. The new algorithm works efficiently on non-large-scaled datasets, provided high accuracy and very simple and interpretable structures.	Maybe	half-moons dataset, MNIST handwritten digit dataset, CIFAR10 dataset	deep learning, boundary tree (an improved k-NN)	Yes			DeepMind, London, UK	The algorithm works well on not very large-scaled data. At the end of the training, the result is an extremely simple tree. The advantages of the tree are 1) the speed to use the tree for classification, regression and retrieval is fast and 2)the accuracy is high and 3) the structure is very interpretable.
50	3/2/2017 20:49:22	Anthony Cooper	Skin Lesion Analysis Towards Melanoma Detection	https://arxiv.org/abs/1703.00523	The paper outlines the method and results of an entry in the ISIC (International Skin Imaging Collaboration) 2017 challenge. It addresses two specific areas of the challenge: 1. Lesion Segmentation The author resizes the images and applies rotations and distortions. This increases the number of training examples to 20,000. The Adam optimization algorithm trains the network over a total of 200 epochs. 2. Lesion Classification Training occurs on 6,000 images. The author notes that images with gauze and images with a bright light become classed together. The architecture used is Alexnet (Deep CNN).Training uses Adam + 10-fold cross validation, and continued for 300 epochs. Overall, it seems like this was a great method, especially considering the small set of training examples (which had to be 'duplicated') and the low epoch count.	No	(possibly on ISIC 2017 website)	Deep Convolutional Networks, U-Net/AlexNet architecture, Dropout, ReLu, Adam optimization algorithm, Preprocessing	Yes			Matt Berseth	This is yet another step towards using machine learning and CNN's to detect cancer for early treatment and prevention. Exciting stuff. :)
51	3/2/2017 22:23:55	Anthony Cooper	TumorNet: Lung Nodule Characterization Using Multi-View Convolutional Neural Network with Gaussian Process	https://arxiv.org/abs/1703.00645	In order to better identify lung nodules as malignant or benign, researchers employ a unique Convolutional Neural Network called TumorNet. The traditional process of lung nodule classification involves hand crafted features classified using Support Vector Machines or Random Forest classifiers. More recently, pre-trained CNNs were only used for identifying features, with SVM and RF's being used for classifiers. The researchers' method involves an end-to-end CNN that learns all the features itself! They take about 1,000 scans annotated by radiologists, and augment the data by random rotation and up/down-scaling. The CNN consists of 5 convolutional layers, 3 fully connected layers, and a softmax classification layer. Gaussian process (GP) is used to add uncertainty to the predictions. One result the researchers noted is the link between high level nodule attributes and the deep learning features. They intend to explore more in this area, using transfer learning to help with the lack of labelled data.	Yes	LIDC-IDRI Dataset (reference [10])	CNN, data annotation, data augmentation, Gaussian process regression, computer vision, classification	Yes			University of Central Florida	This is absolutely incredible - using only 1,000 images, they manage to create an end-to-end RNN that classifies malignancy. This can really change lung cancer treatment + prevention, and we will see more and more examples of better methods. I look forward to reading their follow up research. Definitely read this paper; it's worth it!
52	3/2/2017 22:37:55	Jonathan Yan	Understanding Synthetic Gradients and Decoupled Neural Interfaces	https://arxiv.org/abs/1703.00522	Unlocked training can be achieved through the use of Synthetic Gradients (SG) and Decoupled Neural Interfaces (DNI). In this paper, the authors study their theoretical soundness, and show that (1) SG can preserve critical points of the original optimization problem, but can also introduce new ones; (2) under good conditions, the SG model converges to the solution of the original problem; and (3) the representations learned using SG are different from the ones learned using standard backpropagation.	No	MNIST http://yann.lecun.com/exdb/mnist/	Synthetic Gradients, Backpropagation, Feedback Alignment, Direct Feedback Alignment, Kickback	Yes			DeepMind	In Section 6, the authors provide a unified framework to connect several forms of approximate error propagation. I'm excited because this is new to me. Expressing these methods in the language of "conspiring networks" provides additional intuition and helps to spawn future innovations in this direction.
53	3/3/2017 20:31:55	Adam Letts	A Robust Adaptive Stochastic Gradient Method for Deep Learning	https://arxiv.org/abs/1703.00788	Extensive update to the original (2014) Adasecant paper. Adasecant is an adaptive learning rate algorithm which automatically tunes the learning rates. They show that it obtains better performance in deep learning scenarios than popular SGD algorithms. - Contains a much-improved Experiments section whereby the Adasecant algorithm is decomposed into its component parts, and graphs present comparative performance - New description of close relationship to diagonal approximation to the Hessian - Better description of their use of blockwise gradient normalization - Overall presentation is now more fashionable	Maybe	For experimental and comparison purposes: Penn Treebank (PTB) - [ authoritative source unknown - available on many github repos - e.g. https://github.com/yoonkim/lstm-char-cnn/tree/master/data/ptb ] MNIST database - http://yann.lecun.com/exdb/mnist/ IAM-OnDB - http://www.fki.inf.unibe.ch/databases/iam-on-line-handwriting-database/download-the-iam-on-line-handwriting-database	deep learning, Adasecant, Adagrad, blockwise gradient normalization, variance reduction, outlier detection, adaptive learning rate	Yes		The paper is an extension of the 2014 original - which had last been updated in 2015. Here is the link to the original: https://arxiv.org/abs/1412.7419	Universite de Montréal, University of Oxford	Removing the need to tune the learning rate can speed up development, and help achieve better results in less time. I feel that in general, approaches that reduce necessary tuning are extremely important. In terms of the new changes to the paper, I believe that the decomposition detail should be valuable for research and tuning purposes. I wonder if the new paper could renew interest and result in more implementations of the algorithm, or provide the additional accessible detail necessary for its findings to influence other algorithms and/or development methodology.
54	3/4/2017 0:53:05	Hyungwon Chae	Unsupervised Image-to-Image Translation Networks	https://arxiv.org/abs/1703.00848	Researchers at NVIDIA proposed the UNsupervised Image-to-image Translation (UNIT) framework, which is based on variational autoencoders and generative adversarial networks that can can learn the translation function without any corresponding images in two domains. This reduces the need to collect large sets of paired data. Some experiments shown in this paper include translating RGB day image to night image, Thermal image to color image, rainy-day image to sunny-day image. The team trained the UNIT networks using a Tesla P100 card in an NVIDIA DGX-1 machine. The UNIT framework achieved better results than competing algorithms do in benchmark datasets and beat the previous state-of-the-art approach in accuracy.	Yes	multispectral pedestrian detection benchmark, KAIST https://sites.google.com/site/pedestrianbenchmark/ CelebFaces dataset, CUHK http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html mnist dataset, http://yann.lecun.com/exdb/mnist/ Street View House Number, stanford http://ufldl.stanford.edu/housenumbers/ USPS dataset http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html	UNsupervised Image-to-image Translation (UNIT) framework, GAN, VAE, Unsupervised Domain Adaptation (UDA)	Yes		The team is interested in extending the framework to the unsupervised language-to-language translation task.	NVIDIA	This research has lots of potential and can be applied to improve tasks such as autonomous driving. Self driving cars can benefit from the easily obtainable images of various road conditions. Most of the existing image-to-image translation approaches are based on supervised learning that require training datasets consisting of pairs of corresponding images in two domains, which can be hard to obtain but this makes very it easier.
55	3/4/2017 7:11:42	Dmytro Mishkin	Shortscience	http://www.shortscience.org/	Guys, what about using http://www.shortscience.org/ ? It is website exactly for such efforts and have already a bunch of papers + search in it	No	-	crowdsourcing	Yes		I am not author of that web site, but excited about it. And sorry for spam	-	Because it is cool
56	3/5/2017 8:01:01	Martijn Handels	AnUnsupervisedLearningMethodExploitingSequentialOutputStatistics	https://arxiv.org/abs/1702.07817	Addresses unsupervised learning method for language models (LM). Instead of pairing input-output samples, they exploit sequential statistics of output labels, in the form of N-gram language models, which can be obtained independently of input data and thus with low or no cost. The success of the method over previous unsupervised approaches is the introduce a novel cost function in this unsupervised learning setting, whose proﬁles are analyzed and shown to be highly non-convex with large barriers near the global optimum. A new stochastic primal-dual gradient method is developed to optimize this very difﬁcult type of cost function via the use of dual variables to reduce the barriers. The paper demonstrate in experimental evaluation, with both synthetic and real-world data sets, that the new method for unsupervised learning gives drastically lower	Maybe	http://isis-data.science.uva.nl/events/dlia//datasets/uwash3.html	Tensorflow neural net on lineair model, new stochastic primal-dual gradient method	Yes		While the current work is limited to unsupervised linear models for prediction, it is straightforward to generalize the current cost function and SPDG algorithm to nonlinear models such as deep neural nets.They also plan to extend our current method from exploiting Ngram LM to exploiting state-of-art LM so that the full unsupervised learning and prediction can be formulated as an end-to-end system.	Microsoft Research	Promosing result with unsupervised language models
57	3/5/2017 12:06:18	Alex Korbonits	Coarse Grained Exponential Variational Autoencoders	https://arxiv.org/abs/1702.07805	Researchers relax assumptions of gaussian priors for modeling variational autoencoders and allow a much larger family of (parametrized) semi-continuous functions whose properties are not just great for generalization to for learning more complex posterior distributions, but also, due to their semi-continuous nature, they're easy to manipulate in a discretized way for computation. This paper proposes a new method, CG-BPEF-VAE, within the variational auto-encoder framework. They also prove some theoretical bounds, and touch on how to build a discrete latent structure to factor information in a continuous representation.	Yes	MNIST, SVHN	Deep learning, variational autoencoders, bayesian statistics,	Yes		Implementation https://github.com/sunk/cgvae.	Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST)	I'm excited about this research because they prove theoretical bounds and extend the abstraction of variational auto-encoders beyond previous results. It's good to see proofs and theoretical results in the deep learning literature. I am not excited that they used MNIST as their primary dataset on which to test out their modeling, but I think the approach is smart and worth attempting on other datasets. Wouldn't mind seeing this new approach validated experimentally on other datasets. The writing could be much improved. The density of the mathematics in the paper made it hard to digest, but I'm sure that with a more careful reading it would be easier to parse.
58	3/5/2017 12:32:56	Alex Korbonits	Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies	https://arxiv.org/abs/1702.07805	MIxed hiSTory RNNs, or MIST RNNs, resolve two issues that plague the use of NARX RNNs (which attempt to address the inability of RRNs to learn long term dependencies adequately due to vanishing gradients): (1) by using exponential delays instead of continuous delays, they greatly reduce worst-case bounds on the number of previous edges in a set of delays that need to be visited; and (2) (quoted) "Second, by restricting ourselves to a (learned) convex combination of previous states, we maintain a computational complexity that is similar to LSTM and GRUs." I.e. they are attacking the problem of learning long-term dependencies by introducing connections from previous units but in a way that is not too computationally complex (compared to NARX RNNs) AND with better bounds. They also spend a lot of time (in a good way) going through the mechanics of MIST RNNs so that they could be easily implemented.	Yes	TIMIT corpus (Garofolo et al., 1993); MNIST	deep learning, recurrent neural networks, NARX RNNs, MIST RNNs, vanilla RNNs, LSTMs, GRUs	Yes		The authors do a great job differentiating between different kinds of RNNs and their advantages and disadvantages in attacking the 4 general problems that this paper applies MIST RNNs to (Copy (D = 100), Addition (L = 100), TIMIT, Sequential pMNIST). They also acknowledge future work to be done that would combine other additional regularization/optimizations that are common across different RNNs types, and they stress the importance of reproducibility by being transparent and offering code for their implementation of MIST RNNs.	Johns Hopkins University, Technische Universitat Munchen, Institute for Advanced Study at Technische Universitat Munchen	I'm excited about this research because as the authors point out, MIST RNNs are an orthogonal approach to LSTMs and RNNs, and can be combined in future work to gain (potentially) better performance. I had not reviewed many papers before that considered such specific introductions of delays to previous units in order to address the vanishing gradient problem in learning long-term dependencies.
59	3/5/2017 17:17:35	Srinivas Neppalli	Learning Discrete Representations via Information Maximizing Self Augmented Training	https://arxiv.org/abs/1702.08720	This paper proposes a method called Information Maximizing Self Augmented Training (IMSAT), an information theoretic method for unsupervised discrete representation learning using deep neural networks with the end-to-end regularization. IMSAT is then applied to clustering and hash learning to achieve the "state-of-the-art performance" on several benchmark datasets. In IMSAT, the data points are mapped into their discrete representations by a deep neural network and it is regularized by encouraging its prediction to be invariant to data augmentation. The predicted discrete representations then exhibit the invariance specified by the augmentation. This regularization method is called Self Augmented Training (SAT). Following the Regularized Information Maximization (RIM) for clustering (Gomes et al., 2010), researchers maximized the information theoretic dependency between inputs and their mapped outputs, while regularizing the mapping function to arrive at Information Maximizing Self Augmented Training (IMSAT).	Maybe	MNIST: http://yann.lecun.com/exdb/mnist/ Omniglot: https://github.com/brendenlake/omniglot STL: https://cs.stanford.edu/~acoates/stl10/ CIFAR10: https://www.cs.toronto.edu/~kriz/cifar.html CIFAR100: https://www.cs.toronto.edu/~kriz/cifar.html SVHN: http://ufldl.stanford.edu/housenumbers/ Reuters: http://www.daviddlewis.com/resources/testcollections/reuters21578/ 20news: http://qwone.com/~jason/20Newsgroups/	Deep Neural Networks Unsupervised Learning Regularization Regularized Information Maximization (RIM)	No			Preferred Networks, Inc.
60	3/5/2017 21:33:50	Saurav Gupta	Meta Networks	https://arxiv.org/pdf/1703.00837.pdf	This paper introduces a novel network architecture for one-shot learning. The model acquires a meta-level knowledge across tasks. Model consists of two modules: meta-learner and base learner. The meta learner operates across tasks is responsible for fast weight parameterization of both base and meta learner. The base learner passes meta information in the form of loss gradients to the meta learner. MetaNet is shown to have good generalization and continual learning properties.	Yes	https://github.com/brendenlake/omniglot	meta Learning, one shot learning	Yes			U Mass Amherst	Gets SOTA in one shot task. Uses meta learning as a technique inspired from the brain.
61	3/11/2017 13:14:30	Darshan Pai	MoleculeNet: A Benchmark for Molecular Machine Learning	https://arxiv.org/abs/1703.00564	The authors introduce a benchmark system for molecular machine learning techniques. Molecular compounds are complex in nature and understanding the molecular properties of compounds (called tasks by authors) is a very complex process using ab-initio computations. Molecular compounds are hard to gather and goes through an extensive process to gather the chemical properties with high accuracy. Hence the datasets are not very large. Moreover, a lack of a common benchmark precules comparison of proposed methods already in literature. Machine learning help predict these molecular properties at a much higher rate and much more accurately. The challenges include curating the data to be able to feed it to a machine learning algorithms. The goal of the paper is to provide a benchmark suite of tools named MoleculeNet that contributes a data-loading framework, guidance and algorithms for featurization techniques to provide a consistent description of the original heterogeneous and highly variable molecule data to feed into a machine learning algorithm, techniques for splitting data into training and testing samples, and a range of learning models to be applied. The authors also present an analysis of 12 datasets from 4 different categories of tasks. The analysis concludes that data-driven methods for prediction can outperform physical algorithms with moderate amounts of data. Technique based on graph convolution models give best result with low overfitting. However, some datasets need good featurization that are currently not available.	No	Most datasets are cited to individual papers given in references. Some dataset links are provided. QM7/QM7b ESOL FreeSolv PCBA MUV PDBbind Toxcast SIDER ClinTox Quantum Machine http://quantum-machine.org/datasets HIV https://wiki.nci.nih.gov/display/NCIDTPdat/AIDS+Antiviral+Screen+Data Tox21 https://tripod.nih.gov/tox21/challenge http://www.meddra.org DeepChem package has datasets in it: https://github.com/deepchem/deepchem	The benchmark provides the following models that are available to users for data analysis and prediction logistic regression, random forest, multitask networks, bypass networks, inference relevance voting, graph convolution models	No		featurization techniques for data representation looks interesting .	Stanford University, Stanford School of Medicine, Schrodinger Inc	I understand the need for benchmarking suite that is definitely needed for better comparison of new and older approaches . However, I do not see original research other than it being a assortment of multiple research sources into a common platform. It will definitely help further molecular research. Maybe the authors have some of their own techniques that they can compare. However, it was not highlighted. The authors definitely take credit for the DeepChem package, which is the base platform for MoleculeNet.	Option 1
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100