ML@GT Spring 2017 Poster Showcase
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

ABCD
1
#TitleAuthorsAbstract
2
1Complex Event Recognition from Images with Few Training ExamplesUnaiza Ahsan, Chen Sun, James Hayes, Irfan EssaWe propose to leverage concept-level representations for complex event recognition in photographs given limited
training examples. We introduce a novel framework to discover event concept attributes from the web and use that to extract semantic features from images and classify them into social event categories with few training examples. Discovered concepts include a variety of objects, scenes, actions and event sub-types, leading to a discriminative and compact representation for event images. Web images are obtained for each discovered event concept and we use (pre-trained) CNN features to train concept classifiers. Extensive experiments on challenging event datasets demonstrate that our proposed method outperforms several baselines using deep CNN features directly in classifying images into events with limited training examples. We also demonstrate that our method achieves the best overall accuracy on a dataset with unseen event categories using a single training example.
3
2LR-GAN: Layered Recursive Generative Adversarial Networks for Image GenerationJianwei Yang, Anitha Kannan, Dhruv Batra, Devi ParikhWe present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods. The experiments demonstrate that LR-GAN can generate more natural images with objects that are more human recognizable than DCGAN.
4
3Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential DataPayam Siyari, Bistra Dilkina, Constantine DovrolisData represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as Lexis, that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "Lexis-DAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the "core" of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents.
5
4Scribbler: Controlling Deep Image Synthesis with Sketch and ColorPatsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, James HaysRecently, there have been several promising methods to generate realistic imagery from deep convolutional networks. These methods sidestep the traditional computer graphics rendering pipeline and instead generate imagery at the pixel level by learning from large collections of photos (e.g. faces or bedrooms). However, these methods are of limited utility because it is difficult for a user to control what the network produces. In this paper, we propose a deep adversarial image synthesis architecture that is conditioned on sketched boundaries and sparse color strokes to generate realistic cars, bedrooms, or faces. We demonstrate a sketch based image synthesis system which allows users to ‘scribble’ over the sketch to indicate preferred color for objects. Our network can then generate convincing images that satisfy both the color and the sketch constraints of user. The network is feed-forward which allows users to see the effect of their edits in real time. We compare to recent work on sketch to image synthesis and show that our approach can generate more realistic, more diverse, and more controllable outputs. The architecture is also effective at user-guided colorization of grayscale images.
6
5Learning Cooperative Visual Dialog Agents with Deep Reinforcement LearningAbhishek Das*, Satwik Kottur*, José M. F. Moura, Stefan Lee, Dhruv BatraWe introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward.
We demonstrate two experimental results.
First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision.
Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.
7
6Hierarchical clustering via spreading metricsAurko Roy, Sebastian PokuttaWe study the cost function for hierarchical clusterings introduced by [Dasgupta, 2015] where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters. It was also shown in [Dasgupta, 2015] that a top-down algorithm returns a hierarchical clustering of cost at most O\left(\alpha_n \log n\right) times the cost of the optimal hierarchical clustering, where \alpha_n is the approximation ratio of the Sparsest Cut subroutine used. Thus using the best known approximation algorithm for Sparsest Cut due to Arora-Rao-Vazirani, the top down algorithm returns a hierarchical clustering of cost at most O\left(\log^{3/2} n\right) times the cost of the optimal solution. We improve this by giving an O(\log{n})-approximation algorithm for this problem. Our main technical ingredients are a combinatorial characterization of ultrametrics induced by this cost function, deriving an Integer Linear Programming (ILP) formulation for this family of ultrametrics, and showing how to iteratively round an LP relaxation of this formulation by using the idea of \emph{sphere growing} which has been extensively used in the context of graph partitioning. We also prove that our algorithm returns an O(\log{n})-approximate hierarchical clustering for a generalization of this cost function also studied in [Dasgupta, 2015]. Experiments show that the hierarchies found by using the ILP formulation as well as our rounding algorithm often have better projections into flat clusters than the standard linkage based algorithms. We conclude with an inapproximability result for this problem, namely that no polynomial sized LP or SDP can be used to obtain a constant factor approximation for this problem.
8
7Temporal Models for Robot Classification of Human InterruptibilitySiddhartha Banerjee, Sonia ChernovaRobots are increasingly being deployed in unstructured human environments where they will need to approach and interrupt collocated humans. Most prior work on robot interruptions has focused on \textit{how} to interrupt a person or on estimating a human's awareness of the robot. Our work makes three contributions to this research area. First, we introduce an ordinal scale of interruptibility that can be used to rate the interruptibility of a human. Second, we propose the use of Conditional Random Fields (CRFs) and their variants, Hidden CRFs, and Latent-Dynamic CRFs, for classifying interruptibility. Third, we introduce the use of object labels as a visual cue to the context of an interruption in order to improve interruptibility estimates. Our results show that Latent-Dynamic CRFs outperform all other models across all tested conditions, and that the inclusion of object labels as a cue to context improves interruptibility classification performance, yielding the best overall results.
9
8It Takes Two to Tango: Towards Theory of AI’s MindArjun Chandrasekaran, Deshraj Yadav, Prithvijit Chattopadhyay, Viraj Prabhu, Devi ParikhTheory of Mind is the ability to attribute mental states (beliefs, intents, knowledge, perspectives, etc.) to others and recognize that these mental states may differ from one’s own. Theory of Mind is critical to effective communication and to teams demonstrating higher collective performance. To effectively leverage the progress in Artificial Intelligence (AI) to make our lives more productive, it is important for humans and AI to work well together in a team. Traditionally, there has been much emphasis on research to make AI more accurate, and (to a lesser extent) on having it better understand human intentions, tendencies, beliefs, and contexts. The latter involves making AI more human-like and having it develop a theory of our minds. In this work, we argue that for human-AI teams to be effective, humans must also develop a theory of AI’s mind – get to know its strengths, weaknesses, beliefs, and quirks. We instantiate these ideas within the domain of Visual Question Answering (VQA). We find that using just a few examples (50), lay people can be trained to better predict responses and oncoming failures of a complex VQA model. Surprisingly, we find that having access to the model’s internal states – its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image – do not help people better predict its behavior.
10
9Video and Accelerometer-Based Motion Analysis for Automated Surgical Skills AssessmentAneeq Zia, Yachna Sharma, Vinay Bettadapura, Eric L. Sarin, Irfan EssaBasic surgical skills of suturing and knot tying are an essential part of medical training. Having an automated system for surgical skills assessment could help save experts time and improve training eciency. There have been some recent attempts at automated surgical skills assessment using either video analysis or acceleration data. In this paper, we present a novel approach for automated assessment of OSATS based surgical skills and provide an analysis of different
features on multi-modal data (video and accelerometer data).
11
10Estimation of Collagenous Tissue Elastic Property From Microscopy Images Using Deep LearningLiang Liang, Minliang Liu and Wei SunAdvanced microscopy imaging, especially the second harmonic generation (SHG) imaging, can noninvasively reveal the network of collagen fibers in bovine pericardium tissue, which is a primary determinate of its elastic properties. We utilized the deep learning technique, a branch of machine learning techniques, to directly deduce tissue elastic properties from microscopy images. Briefly, using an unsupervised learning method, a deep convolutional neural network was trained on the tissue images and mechanical testing data. The trained deep neural network was used for classification to identify the tissue as soft or stiff, and for regression to predict the strain-stress curves from equi-biaxial testing.
12
11Localizing and Orienting Street Views Using Overhead ImageryNam Vo, James HaysIn this paper we aim to determine the location and orientation of a ground-level query image by matching to a reference database of overhead (e.g. satellite) images. For this task we collect a new dataset with one million pairs of street view and overhead images sampled from eleven U.S. cities. We explore several deep CNN architectures for cross-domain matching -- Classification, Hybrid, Siamese, and Triplet networks. Classification and Hybrid architectures are accurate but slow since they allow only partial feature precomputation. We propose a new loss function which significantly improves the accuracy of Siamese and Triplet embedding networks while maintaining their applicability to large-scale retrieval tasks like image geolocalization. This image matching task is challenging not just because of the dramatic viewpoint difference between ground-level and overhead imagery but because the orientation (i.e. azimuth) of the street views is unknown making correspondence even more difficult. We examine several mechanisms to match in spite of this -- training for rotation invariance, sampling possible rotations at query time, and explicitly predicting relative rotation of ground and overhead images with our deep networks. It turns out that explicit orientation supervision \textit{also} improves matching accuracy. Our best performing architectures are roughly 2.5 times as accurate as the commonly used Siamese network baseline.
13
12Let's Dance: Learning from Online Dance VideosDaniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan EssaIn recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce Let's Dance'', a 1000 video dataset (and growing) comprised of 10 visually overlapping dance categories that require motion for their classification. We compare our datasets' performance using imaging techniques with UCF-101 and demonstrate its difficulty. We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches. We discuss the motion parameterization of each of them and their value in learning to categorize online dance videos. Lastly, we plan to release this dataset (and its three representations) for the research community to use.
14
13Localized Detection of Surgical Tools in Laproscopic VideosDaniel Castro, Aneeq Zia, Prabhudev Prakash, Shaohui Xu, Irfan EssaUnderstanding surgeries based on unobtrusive data sources is a field of growing interest in the medical community. Surgical tool presence detection is a developing field that has contributed to the better understanding of operating rooms (ORs) for a variety of applications. We apply a faster than real-time bounding box detection system to laproscopic videos in order to obtain per-frame localized tool detection. We apply prior distribution thresholding and frame consolidation techniques to this result in order to improve our performance with limited training data. Existing methods generally perform tool presence detection on a per-frame level which prohibits the recognition of repeated tools in a single frame which we leverage for an increased confidence in detection. In order to demonstrate the performance of our method we compare to recent competition results for tool presence detection where we achieve a state-of-the-art mean AP of 68.94, which is further improved to 83.41 with additionally collected training data. We compare our approach to available methods presented by the research community and provide our bounding box annotations for seven surgical tools which we believe will be of tremendous value to the research community.
15
16
15TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity RecognitionChih-Yao Ma, Min-Hung Chen, Zsolt Kira, Ghassan AlRegibRecent two-stream deep Convolutional Neural Networks (ConvNets) have made significant progress in recognizing
human actions in videos. Despite their success, methods extending the basic two-stream ConvNet have not systematically explored possible network architectures to further exploit spatiotemporal dynamics within video sequences. Further, such networks often use different baseline two-stream networks. Therefore, the differences and the distinguishing factors between various methods using Recurrent Neural Networks (RNN) or convolutional networks on temporallyconstructed feature vectors (Temporal-ConvNet) are unclear. In this work, we first demonstrate a strong baseline two-stream ConvNet using ResNet-101. We use this baseline to thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting spatiotemporal information. Building upon our experimental results, we then propose and investigate two different networks to further integrate spatiotemporal information: 1) temporal segment RNN and 2) Inception-style Temporal-ConvNet. We demonstrate that using both RNNs (using LSTMs) and TemporalConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance. However, each of these methods require proper care to achieve state-of-the-art performance; for example, LSTMs require pre-segmented data or else they cannot fully exploit temporal information. Our analysis identifies specific limitations for each method that could form the basis of future work. Our experimental results on UCF101 and HMDB51 datasets achieve state-of-the-art performances, 94.1% and 69.0%, respectively, without requiring extensive temporal augmentation.
17
16Dynamic low-rank matrix recoveryLiangbei Xu, Mark DavenportWe propose the locally weighted matrix smoothing (LOWEMS) framework as one possible approach to dynamic low-rank matrix recovery. We then establish error bounds for LOWEMS in both the {\em matrix sensing} and {\em matrix completion} observation models. Our results quantify the potential benefits of exploiting dynamic constraints both in terms of recovery accuracy and sample complexity. To illustrate these benefits we provide both synthetic and real-world experimental results.
18
17Critical Lab Predictions for ICU Sepsis Patients Andrea McCarter, Soorya Eswaran, Kevin Johnson, Kevin Gu, Sai Srujana Buddi, Kshitish DeoAbstract: Sepsis is the body’s overpowering reaction to infection which can lead to organ damage, organ failure, and death. Approximately 27 million people in the world develop sepsis each year, and, in Georgia alone, the mortality rate was 85 deaths per 100,000 persons. Total spending in US hospitals in 2011 was 20.3 billion, which was 5.2% of hospital costs. For each hour that treatment (antibiotics) of sepsis is delayed, mortality increases by about 8%. The goal of this study is to predict the values of six labs which are advance indicators of sepsis, so that the progress of sepsis can be prevented by proactive treatment. Using an extract of the MIMIC III critical care database, an openly available dataset developed by the MIT Lab for Computational Physiology, which includes laboratory tests, vital signs, medications, timestamps, and more, we developed a Gaussian regression prediction model for white blood counts and five other relevant lab tests. The prediction model will be implemented with the Emory FHIR server providing MIMIC electronic health record data and a second server providing streaming vitals information, so that the predictions can be updated in real time, each time new data is received by the application. When a lab prediction is in the critical range, a mobile message is sent to the physician for evaluation and possible action. The physician can then make a decision on whether additional labs, antibiotics, or other treatment is needed for the patient. This work will also have implications for the computational phenotyping of sepsis, savings on lab tests, and provision of general trending information of the patient's health to the physician.
19
18Generating Adaptive and Robust Filter Sets using an Unsupervised Learning FrameworkMohit Prabhushankar, Dogancan Temel, and Ghassan AlregibIn this paper, we introduce an adaptive unsupervised learning framework, which utilizes natural images to train filter sets. The applicability of these filter sets is demonstrated by evaluating their performance in two contrasting applications -
image quality assessment and texture retrieval. While assessing image quality, the filters need to capture perceptual differences based on dissimilarities between a reference image and its distorted version. In texture retrieval, the filters need to assess similarity between texture images to retrieve closest matching textures. Based on experiments, we show that the filter responses span a set in which a monotonicity-based metric can measure both the perceptual dissimilarity of natural images and the similarity of texture images. In addition, we corrupt the images in the test set and demonstrate that the proposed method leads to robust and reliable retrieval performance compared to existing methods.
20
19One-Shot Learning for Semantic SegmentationAmirreza Shaban, Shray Bansal, Irfan Essa, Byron BootsDespite the recent success of Deep Neural Networks for semantic image segmentation, learning a new semantic class from few examples remains a very challenging problem. This is due to the reliance of training techniques on the availability of large quantities of well annotated data. Recent work on low-shot learning for image classification has pioneered techniques for learning from sparse data, we extend these techniques to dense semantic segmentation. Specifically, we train a network that, given a small set of annotated images, produces parameters for a Fully Convolutional Network (FCN). We then use this FCN to perform dense pixel-level prediction on a test image for the new semantic class. Our architecture shows 25\% relative meanIoU improvement in one-shot semantic segmentation compared to the best baseline methods on unseen classes from the PASCAL VOC 2012 dataset.
21
20ShimonHero: Deep Q-Learning for Musical Robotic Control Lamtharn (Hanoi) Hantrakul, Zachary Kondak, Gil WeinbergWe use a Deep Q-network to train Shimon, a 4-armed marimba-playing improvising robot, to correctly hit a sequence of musical notes using Deep Reinforcement Learning in a virtual game environment. The agent is given access to only the game pixels and a reward signal. The architecture is similar to DeepMind's Atari-playing Deep Q-network, but augmented with shared CNN layers that connect to control outputs for each arm. We show the network is able to learn control for 1 arm and 2 arms, but has yet to generalize to the full 4 arms. The purpose of this work goes beyond path planning, it aims to develop an intermediate representation reflecting the robot's physical state. This representation can be jointly fed with a musical embedding into say, a note-generating LSTM, so that the LSTM does not produce notes which are physically impossible to play in the robot's current configuration. Thus, both the physical embedding and musical embedding can be jointly optimized to inform the next musical note.
22
21PASSAGE: A Travel Safety Assistant with Safe Path Recommendations for PedestriansMatthew Garvey, Nilaksh Das, Jiaxing Su, Meghna Natraj, Bhanu VermaAtlanta has consistently ranked as one of the most dangerous cities in America with over 2.5 million crime events recorded within the past six years. People who commute by walking are highly susceptible to crime here. To address this problem, our group has developed a mobile application, PASSAGE, that integrates Atlanta-based crime data to find "safe paths" between any given start and end locations in Atlanta. It also provides security features in a convenient user interface to further enhance safety while walking.
23
22Prediction of onset of sepsis using machine learningLieberman, Adam Narayana, Jyothi Zucker, Brent Pal, Biswajyoti Chawla, Ravish Schwieterman, RobertSepsis is a life-threatening condition that arises when the human body's response to infection injures its own tissues and organs. It is a common occurrence among patients admitted to Intensive Care Units in hospitals with one million cases in the US increasing on average annually by 11.9%. The rate of mortality increases every hour by 10% after the onset of sepsis. Hence, we attempt to predict the onset using machine learning techniques by training models on historical electronic medical records available in the MIMIC-III. We are also creating an online prediction pipeline using the models on top of Unscrambl platform which pulls in data from a Fast Healthcare Interoperability Resources server and presents the prediction on a custom data visualization application.
24
23Visual Exploration ofMachine Learning Results using Data Cube AnalysisMinsuk Kahng, Dezhi Fang, Polo ChauAs complex machine learning systems become more widely adopted, it becomes increasingly challenging for users to understand their underlying models and interpret the model results. We present our ongoing work on developing interactive and visual approaches for exploring and understanding machine learning results using data cube analysis.
25
26
25Accelerate Sparse Tensor Decomposition on GPUsJiajia Li, Yuchen Ma, Jimeng Sun, Richard VuducThis paper improves two critical computational kernels for sparse tensors on GPUs considering GPU architecture and algorithmic features.
27
26DeepNav: Learning to Navigate Large CitiesSamarth Brahmbhatt, James HaysWe present DeepNav, a Convolutional Neural Network (CNN) based algorithm for navigating large cities using locally visible street-view images. The DeepNav agent learns to reach its destination quickly by making the correct navigation decisions at intersections. We collect a large-scale dataset of street-view images organized in a graph where nodes are connected by roads. This dataset contains 10 city graphs and more than 1 million street-view images. We propose 3 supervised learning approaches for the navigation task and show how A* search in the city graph can be used to generate supervision for the learning. Our annotation process is fully automated using publicly available mapping services and requires no human input. We evaluate the proposed DeepNav models on 4 held-out cities for navigating to 5 different types of destinations. Our algorithms outperform previous work that uses hand-crafted features and Support Vector Regression (SVR)
28
27FACETS: Adaptive Local Exploration of Large GraphsRobert Pienta, Minsuk Kahng, Zhiyuan Lin, Jilles Vreeken, Partha Talukdar, James Abello, Ganesh Parameswaran, Duen Horng ChauVisualization is a powerful paradigm for exploratory data analysis. Visualizing large graphs, however, often results in excessive edges crossings and overlapping nodes. We propose a new scalable approach called FACETS that helps users adaptively explore large million-node graphs from a local perspective, guiding them to focus on nodes and neighborhoods that are most subjectively interesting to users. We contribute novel ideas to measure this interestingness in terms of how surprising a neighborhood is given the background distribution, as well as how well it matches what the user has chosen to explore. FACETS uses Jensen-Shannon divergence over information-theoretically optimized histograms to calculate the subjective user interest and surprise scores. Participants in a user study found FACETS easy to use, easy to learn, and exciting to use. Empirical runtime analyses demonstrated FACETS’s practical scalability on large real-world graphs with up to 5 million edges, returning results in fewer than 1.5 seconds.
29